Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some errors in the 800K datasets: oversized word/char box, missing labels #101

Open
Jyouhou opened this issue Feb 7, 2018 · 8 comments
Open

Comments

@Jyouhou
Copy link

Jyouhou commented Feb 7, 2018

Hi, I am having similar problems as those discussed in #13 #15.

I am using the pre-generated 800K dataset to train a model, and found that there exist the following issues:

(1) Some word/char boxes are oversized, as discussed in #13, #15.
(2) Some word recognition annotations are wrong.
(3) There are some confusing bounding box coordinate values, e.g. negative value, coordinates that cross over the image boundary, char box coordinates that actually consist of 2 pairs of vertexes(e.g. p1,p1,p2,p2, while 4 different points are expected).

@ankush-me
Copy link
Owner

ankush-me commented Feb 7, 2018

Hi,
What percentage of the images have these problems? If it is only a small fraction, then (1), (3) can be checked automatically and such samples can be discarded.
Re (2): how is the annotation wrong? Please give some examples.

@Jyouhou
Copy link
Author

Jyouhou commented Feb 9, 2018

@ankush-me
Hi,

For (1), it's hard to check whether a box is oversized. I think automatic check is an AI task itself :) but as you replied in #13 #15, it may have been caused by the fonts applied in those samples. the percentage is low. (we didn't manually count it)

For (3), the percentage is a bit high, but is easy to check and correct. We print an error message every time our pre-processing program encounters this problem, and it seems like, nearly around half the samples contain such errors. ( We shuffle the sample list every time the program starts). We just ignore and discard those invalid boxes.

For (2), wrong annotation includes: (a) the GT characters are different from what are actually on the image(e.g. GT chars are 'the' while chars on the image are 'HHH'). (b) there are no corresponding chars on the image while GT char list indicates there should be.

p.s. the (p1,p1,p2,p2) mentioned above has no corresponding chars on the image either

@ankush-me
Copy link
Owner

Hi, Apologies for the delay in replying.

Regarding incorrect GT:

  1. This "HHH" is a known issue, and we don't really know the cause. Please let me know if you have any information regarding this.
  2. This could be due to extreme values of the alpha channel -- hopefully there aren't too many of these.
  3. p1,p1,p2,p2 again, can be checked automatically and discarded.

@Jyouhou
Copy link
Author

Jyouhou commented Feb 28, 2018

Hi, thanks for your reply.

We checked the source code, and found out the possible reason why some chars may missing in the graph.

The way your code prints chars onto the img is:

(1) Compute a 'surface' according to the depth information and semantic segmentation result.
So the 'surface' is a connected subset of all pixels in the original image.
(2) Select a 'surface', use pygame to print text ont it one char by one char. Once the next char is out of 'surface', the printing process would be interrupted.

When the interruption happens, chars that have already been printed are not eliminated, and the text is still kept in the set of contours that would be the ground truth. This is where the problem is.

If a misunderstood it, please correct me.

@ankush-me
Copy link
Owner

ankush-me commented Feb 28, 2018

I do not really understand your interpretation of the "interruption" ---
The code renders the full text string. It does check if the text fits in the "surface" patch (as you call it above), but if it doesn't it just moves on to the next sample, discarding the current one.

Can you point to the code where you think the text is still kept after "interruption"?

@Jyouhou
Copy link
Author

Jyouhou commented Apr 15, 2018

Hi,

I think the problem is here in this function.

i loops across order, instead of in an incremental order. So when the line indicates that a collision happens, the function returns the first i elements in loc, bbs and order.

(1) It's possible that the first i elements would include chars that are not printed yet, while omitting ones that have already been printed.
(2) Line383 doesn't filter out text that are not print yet.

Please correct me if I'm wrong.

@ankush-me
Copy link
Owner

@Jyouhou Thank you -- yes it does seem like a bug!

@Heermosi
Copy link

Hi,

I think the problem is here in this function.

i loops across order, instead of in an incremental order. So when the line indicates that a collision happens, the function returns the first i elements in loc, bbs and order.

(1) It's possible that the first i elements would include chars that are not printed yet, while omitting ones that have already been printed.
(2) Line383 doesn't filter out text that are not print yet.

Please correct me if I'm wrong.

That seems like a bug, but not really a bug now!
Because now every time call this function, it would only put 1 text(line or word or paragraph) into the background
May be it was what you called a bug fix.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants