-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix failure to OCR: general quality issue due to LSTM being fed noisy/crappy *original* image pixels instead of cleaned-up binarized pixels. #4111
Conversation
…message chain in mailing list: https://groups.google.com/g/tesseract-ocr/c/5jrGvsrdqig/m/jvTG6L9zBgAJ): it turns out tesseract erroneously grabs the ORIGINAL image (instead of the THRESHOLDED/BINARIZED one!) to extract the word box (Tesseract::GetRectImage()) which will be fed into the LSTM OCR neural net in order to OCR the detected text area. Ergo: this fix SHOULD improve OCR results generally, as this is a generic bug which impacts ALL text bboxes found in a given input page image, which are then being pumped into the LSTM engine to obtain OCR'ed text. This fix was verified to work in an otherwise patched/augmented tesseract rig: GerHobbelt/tesseract: commit series bb37cf3, ffc1997, 15d2952, 69416e5, f49826b, d53c1a2, 44f2f84, where I worked on removing the curious BestPix() API, which SEEMINGLY was originally meant for ScrollView-et-al debug display purposes, but is (IMO) an ill-named API for that purpose. - remove accompanying, now obsolete, comment - also remove the need for BestPix() API usage in EquationDetect::PrintSpecialBlobsDensity() by invoking the API that delivers what's actually used there: the image height. Here BestPix() usage is also wrong (theoretically) as the sought-after image height is about the actual height of the binarized image data, which represent the cleaned-up-and-ready-for-use OCR sourcing image data.
I tested this commit on a greyscale newspaper image, and the result is mixed. Some lines were recognized better, others got worse. Generally it is intentional to use the LSTM with original images. Ideally the neural network was trained to handle noise and even to work better with greyscale images than with binarized images. |
Do you have a sample newspaper scan where this occurs?
🤔
The fundamental issue with the old code is that it circumnavigates the
tesseract thresholding/cleanup process using no obvious criteria & outside
the user's control, resulting in undesirable source image noise getting
into the engine. Pondering both your results and mine, this would mean
BestPix() needs a user control parameter. (Or folks should be relegated to
using fully external image cleanup processes, but that would still leave us
with a nonobvious decision making process in BestPix().🤔
…On Sat, 5 Aug 2023, 12:04 Stefan Weil, ***@***.***> wrote:
I tested this commit on a grey scale newspaper image, and the result is
mixed. Some line were recognized better, others got worse.
—
Reply to this email directly, view it on GitHub
<#4111 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADCIHVGZBSXOZ7IRNN3SV3XTYLBJANCNFSM6AAAAAA3E2Z2SQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
As I'm digging further into the catacombs of tesseract RTFC I realize that technically the LSTM engine is engineered to accept raw RGB input (3 channels) or failing that, greyscale input (1 channel), so from that perspective the reported issue is NOT a bug but rather a tesseract 4/5 feature. Meanwhile I find that same source image is fed through binarization, resulting in a pure-black-and-white 1 bit per pixel image which is used by the segmentation logic which clips (or extracts when we use a different jargon for the same "cutting out lines and feeding them to the OCR engine" process) boxes=segments of detected text lines to feed to the LSTM engine. The LSTM engine may have once been trained to tolerate noise, but given the actual behaviour for, for example, the rather noisy (yet very legible) wildlife camera text referenced earlier, I'd argue that the tolerance for noise in the LSTM engine is only 'good enough' when we feed tesseract very tightly preprocessed scans/images, such as can be expected off a professional book/paper scanning rig. Anyhow, I feel the currently filed pull request is subpar re quality, given my latest insights in tesseract, and should be closed. Before I do so, I'd love some feedback from @stweil on the next two questions: 1: procedure: would you like me to file subsequent (re)work on this subject under this same pr or do you rather see a fresh pr which mentions this one? 2: the planned rework is to combine this effort with an improved greyscale image preprocessing stage where the greyscale is "filtered" through the thresholding mask (after thickening/dilating/brick-closing it) so that the resulting greyscale image keeps all noisy data only under the masked areas which represent the (binarized) text mask. The cruddy OCR results that the wildlife cam sample produced are due to nearly invisible pixel noise in the background far removed from the actual characters, such that it may be argued to be "adversarial input" as it produces an OCR text result that's far off the mark with seemingly sensible high certainty values at the moment. Filtering the raw input through the threshold mask (dilated) would kill this and a lot of other "weird OCR results" that I got for old & otherwise low quality book scans I have been testing on. Hence the question: 2a. Do you have a link for me where I can download some or all of the pages you are working on so I can compare against your samples as well? 2b. Would you be interested in this work at all? Hm, what I'm looking for, I guess, is a project lead policy question (and answer), I suppose... Does tesseract, as a general policy, wish to receive maximum effort cleaned up black on white greyscale scans1 as source images, or is tesseract's strategy/long-term goal to include a industrial quality preprocessing stage? I'm asking because I know where I want to go towards, but I'm not clear on where tesseract wants to go. It may be my limited skillset to comprehend the documentations, but I was unable to find where y'all want to be in X years with this. (and, yes, open source is a labor of love primarily so no harm nor foul when the goals take forever, but I'm lacking that vision from the tesseract core as is. 🙏) So @stweil: if you could give me a hint at where you want to take tesseract I'd be much obliged. That way I would be much better informed whether some work I've done and intend to do is worth filing a pr for. 🙏😚 Thanks for bearing with me. Cheers, Ger Footnotes
|
Killing it; feeding the engine B&W pixels is not the answer. See also the @stweil comment above: #4111 (comment) |
fix Bushnell OCR bug (failure to properly OCR the number "11"; see message chain in mailing list: https://groups.google.com/g/tesseract-ocr/c/5jrGvsrdqig/m/jvTG6L9zBgAJ; this includes sample images, text output and context as originally reported by Astro/Nor):
root cause:
it turns out tesseract erroneously grabs the ORIGINAL image (instead of the THRESHOLDED/BINARIZED one!) to extract the word box (
Tesseract::GetRectImage()
) which will be fed into the LSTM OCR neural net in order to OCR the detected text area.Ergo: this fix SHOULD improve OCR results generally, as this is a generic bug which impacts ALL text bboxes found in a given input page image, which are then being pumped into the LSTM engine to obtain OCR'ed text.
This fix was verified to work in an otherwise patched/augmented tesseract rig: GerHobbelt/tesseract: commit series bb37cf3, ffc1997, 15d2952, 69416e5, f49826b, d53c1a2, 44f2f84, where I worked on removing the curious
BestPix()
API, which SEEMINGLY was originally meant for ScrollView-et-al debug display purposes, but is (IMO) an ill-named API for that purpose.remove accompanying, now obsolete, comment
also remove the need for
BestPix()
API usage inEquationDetect::PrintSpecialBlobsDensity()
by invoking the API that delivers what's actually used there: the image height. HereBestPix()
usage is also wrong (theoretically) as the sought-after image height is about the actual height of the binarized image data, which represent the cleaned-up-and-ready-for-use OCR sourcing image data.Corollary of this bug
Anyone feeding tesseract monochrome (pre-thresholded/binarized) images from an external cleanup+binarization process SHOULD already get best OCR results and SHOULD NOT be impacted by this bug, nor this fix. (As then there would be no difference between 'original pix' and 'binary pix' from tesseract's perspective.)