Some errors in the 800K datasets: oversized word/char box, missing labels #101

Jyouhou · 2018-02-07T07:12:27Z

Hi, I am having similar problems as those discussed in #13 #15.

I am using the pre-generated 800K dataset to train a model, and found that there exist the following issues:

(1) Some word/char boxes are oversized, as discussed in #13, #15.
(2) Some word recognition annotations are wrong.
(3) There are some confusing bounding box coordinate values, e.g. negative value, coordinates that cross over the image boundary, char box coordinates that actually consist of 2 pairs of vertexes(e.g. p1,p1,p2,p2, while 4 different points are expected).

ankush-me · 2018-02-07T14:31:47Z

Hi,
What percentage of the images have these problems? If it is only a small fraction, then (1), (3) can be checked automatically and such samples can be discarded.
Re (2): how is the annotation wrong? Please give some examples.

Jyouhou · 2018-02-09T06:03:50Z

@ankush-me
Hi,

For (1), it's hard to check whether a box is oversized. I think automatic check is an AI task itself :) but as you replied in #13 #15, it may have been caused by the fonts applied in those samples. the percentage is low. (we didn't manually count it)

For (3), the percentage is a bit high, but is easy to check and correct. We print an error message every time our pre-processing program encounters this problem, and it seems like, nearly around half the samples contain such errors. ( We shuffle the sample list every time the program starts). We just ignore and discard those invalid boxes.

For (2), wrong annotation includes: (a) the GT characters are different from what are actually on the image(e.g. GT chars are 'the' while chars on the image are 'HHH'). (b) there are no corresponding chars on the image while GT char list indicates there should be.

p.s. the (p1,p1,p2,p2) mentioned above has no corresponding chars on the image either

ankush-me · 2018-02-27T20:19:27Z

Hi, Apologies for the delay in replying.

Regarding incorrect GT:

This "HHH" is a known issue, and we don't really know the cause. Please let me know if you have any information regarding this.
This could be due to extreme values of the alpha channel -- hopefully there aren't too many of these.
p1,p1,p2,p2 again, can be checked automatically and discarded.

Jyouhou · 2018-02-28T06:01:02Z

Hi, thanks for your reply.

We checked the source code, and found out the possible reason why some chars may missing in the graph.

The way your code prints chars onto the img is:

(1) Compute a 'surface' according to the depth information and semantic segmentation result.
So the 'surface' is a connected subset of all pixels in the original image.
(2) Select a 'surface', use pygame to print text ont it one char by one char. Once the next char is out of 'surface', the printing process would be interrupted.

When the interruption happens, chars that have already been printed are not eliminated, and the text is still kept in the set of contours that would be the ground truth. This is where the problem is.

If a misunderstood it, please correct me.

ankush-me · 2018-02-28T12:00:49Z

I do not really understand your interpretation of the "interruption" ---
The code renders the full text string. It does check if the text fits in the "surface" patch (as you call it above), but if it doesn't it just moves on to the next sample, discarding the current one.

Can you point to the code where you think the text is still kept after "interruption"?

Jyouhou · 2018-04-15T05:14:46Z

Hi,

I think the problem is here in this function.

i loops across order, instead of in an incremental order. So when the line indicates that a collision happens, the function returns the first i elements in loc, bbs and order.

(1) It's possible that the first i elements would include chars that are not printed yet, while omitting ones that have already been printed.
(2) Line383 doesn't filter out text that are not print yet.

Please correct me if I'm wrong.

ankush-me · 2019-09-04T23:00:56Z

@Jyouhou Thank you -- yes it does seem like a bug!

Heermosi · 2020-12-10T12:26:31Z

Hi,

I think the problem is here in this function.

i loops across order, instead of in an incremental order. So when the line indicates that a collision happens, the function returns the first i elements in loc, bbs and order.

(1) It's possible that the first i elements would include chars that are not printed yet, while omitting ones that have already been printed.
(2) Line383 doesn't filter out text that are not print yet.

Please correct me if I'm wrong.

That seems like a bug, but not really a bug now!
Because now every time call this function, it would only put 1 text(line or word or paragraph) into the background
May be it was what you called a bug fix.....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some errors in the 800K datasets: oversized word/char box, missing labels #101

Some errors in the 800K datasets: oversized word/char box, missing labels #101

Jyouhou commented Feb 7, 2018

ankush-me commented Feb 7, 2018 •

edited

Loading

Jyouhou commented Feb 9, 2018

ankush-me commented Feb 27, 2018

Jyouhou commented Feb 28, 2018

ankush-me commented Feb 28, 2018 •

edited

Loading

Jyouhou commented Apr 15, 2018

ankush-me commented Sep 4, 2019

Heermosi commented Dec 10, 2020

Some errors in the 800K datasets: oversized word/char box, missing labels #101

Some errors in the 800K datasets: oversized word/char box, missing labels #101

Comments

Jyouhou commented Feb 7, 2018

ankush-me commented Feb 7, 2018 • edited Loading

Jyouhou commented Feb 9, 2018

ankush-me commented Feb 27, 2018

Jyouhou commented Feb 28, 2018

ankush-me commented Feb 28, 2018 • edited Loading

Jyouhou commented Apr 15, 2018

ankush-me commented Sep 4, 2019

Heermosi commented Dec 10, 2020

ankush-me commented Feb 7, 2018 •

edited

Loading

ankush-me commented Feb 28, 2018 •

edited

Loading