extract-file failed: 'attempt to subtract with overflow' #139

bact · 2021-04-17T16:12:04Z

(in attempt to fix #133)

For a experiment purpose, to see how sentence extractor rules for Thai will work if we have a proper sentence splitter,
I get all the text from Wikipedia using this command:

cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt

Then I use an external sentence splitter ( https://pythainlp.github.io/docs/2.3/api/tokenize.html#module-pythainlp.tokenize.crfcut ) to get more proper sentences and store them in another text file.

Then I tried to extract sentences, that match the rules, from that line-break separated file (one line, one sentence),
and I got this error message:

thread 'main' panicked at 'attempt to subtract with overflow', src/extractor.rs:101:63

The full error message and backtrace is here:

$ cargo run -- extract-file -l th -d ../texts/ >> wiki.th-new.txt
    Finished dev [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/common_voice_sentence_collector extract-file -l th -d ../texts/`
Loading rules at "./src/rules/th.toml"
Using Rules Rules { min_trimmed_length: 3, min_word_count: 1, max_word_count: 5, min_characters: 6, may_end_with_colon: false, quote_start_with_letter: true, needs_punctuation_end: false, needs_uppercase_start: false, needs_letter_start: true, allowed_symbols_regex: "[0-9 \u{200b}\u{200c}ก-ฮะ-\u{e39}เ-ๅ\u{e47}-\u{e4c}\\-\\.‚;:!\\?“”‘’\"'`]", disallowed_symbols: [], disallowed_words: {}, broken_whitespace: [String("  "), String(" ,"), String(" ."), String(" ;")], abbreviation_patterns: [String("[A-Z]{2,}"), String("[A-Z]+\\.*[A-Z]+"), String("[ก-ฮ]{1,3}\\.([ก-ฮ]{1,3}\\.)+")], other_patterns: [String("[\\.,:;-]$"), String("[,:;]\\S"), String("[\\.|\\?|!].+$"), String("^.{81,}$"), String("(^|\\s+)[ะาำๅ\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}\u{e38}\u{e39}\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}]"), String("[เแโใไ](\\s+|$)"), String("[\u{200b}\u{200c}ก-ฮะ-\u{e39}เ-\u{e4c}‘’‚;:“”\"'`\\-\\?\\.!]{55,}"), String("^[\u{200b}\u{200c}]*[^ณ]\\s"), String("^[\u{200b}\u{200c}]*[บ\u{e49}าง|ก\u{e48}อน|เลย|แล\u{e49}ว|หร\u{e37}อไม\u{e48}|ไหม|ล\u{e48}ะ|ด\u{e49}วย|อ\u{e35}ก|และ|หร\u{e37}อ|ก\u{e31}บ|ก\u{e47}]\\s"), String("^\\S{2,3}[\u{200b}\u{200c}]*\\s"), String("\\s\\S{1,3}[\u{200b}\u{200c}]*$"), String("\\s[และ|หร\u{e37}อ|ก\u{e31}บ|เช\u{e48}น][\u{200b}\u{200c}]*$"), String("[เแโใไ]{2,}"), String("[ะาำๅ]{2,}"), String("[\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}]{2,}"), String("[\u{e38}\u{e39}]{2,}"), String("[\u{e48}\u{e49}\u{e4a}\u{e4b}]{2,}"), String("\u{e3a}{2,}"), String("\u{e4c}{2,}"), String("\u{e4d}{2,}"), String("\u{e4e}{2,}"), String("[เแโใไะาำๅ][\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}]"), String("[\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}][\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}\u{e38}\u{e39}]")], replacements: [Array([String("\u{200b}"), String("")]), Array([String("\u{200c}"), String("")]), Array([String(" พ.ร.บ."), String(" พระราชบ\u{e31}ญญ\u{e31}ต\u{e34}")]), Array([String(" พ.ร.ก."), String(" พระราชกำหนด")]), Array([String(" พ.ศ. "), String(" พ\u{e38}ทธศ\u{e31}กราช ")]), Array([String(" ค.ศ. "), String(" คร\u{e34}สต\u{e4c}ศ\u{e31}กราช ")]), Array([String(" ม.ร.ว."), String(" หม\u{e48}อมราชวงศ\u{e4c}")]), Array([String(" ."), String(".")]), Array([String(" ,"), String(" ")]), Array([String(" :"), String(":")]), Array([String(" ;"), String(";")]), Array([String(" !"), String("!")]), Array([String(" ?"), String("?")]), Array([String(":"), String(": ")]), Array([String("?"), String("? ")]), Array([String("!"), String("! ")]), Array([String(","), String(" ")]), Array([String(".."), String(" ")]), Array([String("..."), String(" ")]), Array([String("...."), String(" ")]), Array([String(" ."), String(".")]), Array([String("    "), String(" ")]), Array([String("   "), String(" ")]), Array([String("  "), String(" ")]), Array([String("เเ"), String("แ")]), Array([String("\u{e4d}า"), String("ำ")]), Array([String("\u{e4d}\u{e48}า"), String("\u{e48}ำ")]), Array([String("\u{e4d}\u{e49}า"), String("\u{e49}ำ")]), Array([String("\u{e4d}\u{e4a}า"), String("\u{e4a}ำ")]), Array([String("\u{e4d}\u{e4b}า"), String("\u{e4b}ำ")]), Array([String("ฤา"), String("ฤๅ")]), Array([String("ฦา"), String("ฦๅ")])], even_symbols: [String("\""), String("'")], matching_symbols: [Array([String("‘"), String("’")]), Array([String("“"), String("”")])] }
Using disallowed_word_file = false
file_name = "../texts/wiki.th.all-filtered.txt"
thread 'main' panicked at 'attempt to subtract with overflow', src/extractor.rs:101:63
stack backtrace:
   0:        0x103142b64 - std::backtrace_rs::backtrace::libunwind::trace::h79c24a8108eef51e
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
   1:        0x103142b64 - std::backtrace_rs::backtrace::trace_unsynchronized::hf491b9388f4887f5
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x103142b64 - std::sys_common::backtrace::_print_fmt::h5132bce5284c3ec2
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x103142b64 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hba4e1e451ca8711d
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:46:22
   4:        0x103160b4e - core::fmt::write::h7baaf1618474dae0
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/fmt/mod.rs:1094:17
   5:        0x10314011a - std::io::Write::write_fmt::hd293de47cc154cdf
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/io/mod.rs:1580:15
   6:        0x10314481f - std::sys_common::backtrace::_print::hb9d4bc7b9e0ae081
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:49:5
   7:        0x10314481f - std::sys_common::backtrace::print::h82a68481004d7b57
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:36:9
   8:        0x10314481f - std::panicking::default_hook::{{closure}}::h11b9cc5ac5c4d127
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:208:50
   9:        0x103144329 - std::panicking::default_hook::hfe650a460287c541
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:225:9
  10:        0x103144f75 - std::panicking::rust_panic_with_hook::h5212f5e986dcd234
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:591:17
  11:        0x103144ac9 - std::panicking::begin_panic_handler::{{closure}}::hd4a4baba3ac1c064
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:495:13
  12:        0x103143008 - std::sys_common::backtrace::__rust_end_short_backtrace::h5a76e76b61bd088d
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:141:18
  13:        0x103144a5a - rust_begin_unwind
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:493:5
  14:        0x10315f00f - core::panicking::panic_fmt::h6b7498085d32aaee
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/panicking.rs:92:14
  15:        0x10315ef67 - core::panicking::panic::he65ad651ff2e7951
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/panicking.rs:50:5
  16:        0x102da151b - common_voice_sentence_collector::extractor::pick_sentences::h5dff4df1baa07207
                               at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:101:63
  17:        0x102da0f5f - common_voice_sentence_collector::extractor::choose::hf29c68337c2c6437
                               at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:68:9
  18:        0x102da0612 - common_voice_sentence_collector::extractor::extract::ha227c2098ed10762
                               at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:27:29
  19:        0x102d8cdd3 - common_voice_sentence_collector::app::start::h28d652a6aaf2528f
                               at /Users/arthit/projects/cv-sentence-extractor/src/app.rs:80:16
  20:        0x102d6400d - common_voice_sentence_collector::app::run::he91c6de970f5ecc7
                               at /Users/arthit/projects/cv-sentence-extractor/src/app.rs:59:5
  21:        0x102d77b26 - common_voice_sentence_collector::main::hfd4bf9963f894313
                               at /Users/arthit/projects/cv-sentence-extractor/src/main.rs:8:5
  22:        0x102d77bc5 - core::ops::function::FnOnce::call_once::hacfea633331549bd
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/ops/function.rs:227:5
  23:        0x102d668cc - std::sys_common::backtrace::__rust_begin_short_backtrace::h741cc0dfecc9cbff
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:125:18
  24:        0x102d67f78 - std::rt::lang_start::{{closure}}::ha40e5aeaf02316c1
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:66:18
  25:        0x1031452e4 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h88801ec30fa967bc
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/ops/function.rs:259:13
  26:        0x1031452e4 - std::panicking::try::do_call::ha5838b1ed53bb3ce
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:379:40
  27:        0x1031452e4 - std::panicking::try::h2c2c426e3f3c01a8
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:343:19
  28:        0x1031452e4 - std::panic::catch_unwind::h383eb7eff10b175f
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panic.rs:431:14
  29:        0x1031452e4 - std::rt::lang_start_internal::h09b48eb36ffca70d
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:51:25
  30:        0x102d67f4e - std::rt::lang_start::hc9ed7f08068d5206
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:65:5
  31:        0x102d77b46 - _main

Note that this is not urgent for me.
But anyone who have an interest in extract-file may like to learn about this.

The text was updated successfully, but these errors were encountered:

MichaelKohler · 2021-04-17T16:16:13Z

Good catch! Can you attach the txt file here so I can try to reproduce?

bact · 2021-04-17T16:51:43Z

Here the link for the txt file (51 MB)
https://drive.google.com/file/d/13GGr0wxwQXhWrTXTvmzdCodhJ9Atf9NJ/view?usp=sharing

MichaelKohler · 2021-04-17T22:23:18Z

Thank you, with the fix I just pushed I was able to run through the hole file (took some time, but worked).

bact · 2021-04-18T02:45:03Z

Thank you! That was quick!

bact · 2021-04-18T03:01:24Z

Btw, the resulting file from this process will not pass the legal requirement, right? Since it doesn't guarantee that only 3 sentences will be picked from an article.

Just to confirm that we cannot submit the output to the Sentence Collector. thx

MichaelKohler · 2021-04-18T09:50:55Z

Btw, the resulting file from this process will not pass the legal requirement, right? Since it doesn't guarantee that only 3 sentences will be picked from an article.

If there is no manual intervention needed we might be able to find a solution even if it's not just the code in this repo only. However we definitely need to make sure we're not taking more than 3 sentences per article (and no sentences for articles with less than 3 sentences in it). For this case here I'm not sure how we can guarantee that though :/

Just to confirm that we cannot submit the output to the Sentence Collector. thx

The output of the extraction wouldn't go through the Sentence Collector. Once extractor rule files get merged we can run an automatic extraction and then add the output directly to the Common Voice repo. The important thing here is that it's run through our process so we can guarantee that we indeed did not take more than 3 per article.

bact · 2021-04-18T15:29:18Z

Thank you for clarification.

MichaelKohler added bug Something isn't working extract-improvements needs debugging P1 labels Apr 17, 2021

MichaelKohler self-assigned this Apr 17, 2021

MichaelKohler closed this as completed in b8d03e8 Apr 17, 2021

bact mentioned this issue Jun 9, 2021

Adding Thai rules for CV Sentence Extractor #137

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract-file failed: 'attempt to subtract with overflow' #139

extract-file failed: 'attempt to subtract with overflow' #139

bact commented Apr 17, 2021 •

edited

Loading

MichaelKohler commented Apr 17, 2021

bact commented Apr 17, 2021

MichaelKohler commented Apr 17, 2021 •

edited

Loading

bact commented Apr 18, 2021

bact commented Apr 18, 2021

MichaelKohler commented Apr 18, 2021

bact commented Apr 18, 2021

extract-file failed: 'attempt to subtract with overflow' #139

extract-file failed: 'attempt to subtract with overflow' #139

Comments

bact commented Apr 17, 2021 • edited Loading

MichaelKohler commented Apr 17, 2021

bact commented Apr 17, 2021

MichaelKohler commented Apr 17, 2021 • edited Loading

bact commented Apr 18, 2021

bact commented Apr 18, 2021

MichaelKohler commented Apr 18, 2021

bact commented Apr 18, 2021

bact commented Apr 17, 2021 •

edited

Loading

MichaelKohler commented Apr 17, 2021 •

edited

Loading