-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract-file failed: 'attempt to subtract with overflow' #139
Comments
Good catch! Can you attach the txt file here so I can try to reproduce? |
Here the link for the txt file (51 MB) |
Thank you, with the fix I just pushed I was able to run through the hole file (took some time, but worked). |
Thank you! That was quick! |
Btw, the resulting file from this process will not pass the legal requirement, right? Since it doesn't guarantee that only 3 sentences will be picked from an article. Just to confirm that we cannot submit the output to the Sentence Collector. thx |
If there is no manual intervention needed we might be able to find a solution even if it's not just the code in this repo only. However we definitely need to make sure we're not taking more than 3 sentences per article (and no sentences for articles with less than 3 sentences in it). For this case here I'm not sure how we can guarantee that though :/
The output of the extraction wouldn't go through the Sentence Collector. Once extractor rule files get merged we can run an automatic extraction and then add the output directly to the Common Voice repo. The important thing here is that it's run through our process so we can guarantee that we indeed did not take more than 3 per article. |
Thank you for clarification. |
(in attempt to fix #133)
For a experiment purpose, to see how sentence extractor rules for Thai will work if we have a proper sentence splitter,
I get all the text from Wikipedia using this command:
cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt
Then I use an external sentence splitter ( https://pythainlp.github.io/docs/2.3/api/tokenize.html#module-pythainlp.tokenize.crfcut ) to get more proper sentences and store them in another text file.
Then I tried to extract sentences, that match the rules, from that line-break separated file (one line, one sentence),
and I got this error message:
thread 'main' panicked at 'attempt to subtract with overflow', src/extractor.rs:101:63
The full error message and backtrace is here:
Note that this is not urgent for me.
But anyone who have an interest in
extract-file
may like to learn about this.The text was updated successfully, but these errors were encountered: