Add initial rules and blocklist for Ukrainian #152

somerandomguyontheweb · 2021-07-21T16:19:00Z

This is an adaptation of the Belarusian rules to Ukrainian.

How many sentences did you get at the end?

654224 sentences.

How did you create the blocklist file?

I took a grammatical dictionary of Ukrainian here, split the Wikipedia export into tokens and kept only those tokens in the blocklist that don't occur in the dictionary, no matter what is their frequency. Note that I didn't use the full export with --no-check, as it would bring many irrelevant tokens (non-Cyrillic spellings; words that only occur in the sentences which are filtered out anyway, etc.). Instead, I temporarily set max_sentences_per_text to std::usize::MAX, in order to consider tokens only in those sentences that pass the rules.

Get at least 3 different native speakers (ideally linguists) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

Spreadsheet here, not yet reviewed. As I'm not a competent speaker of Ukrainian myself, I'm going to contact the Common Voice Ukrainian community and update this PR once the review is complete.

somerandomguyontheweb · 2021-08-19T17:36:45Z

As advised by Ukrainian colleagues who reviewed several hundred sentences in the sample, I'm adding more patterns to the rules file uk.toml. The number of extracted sentences decreased to 620'015, the updated spreadsheet is here. This is still an effort in progress.

somerandomguyontheweb · 2021-08-26T15:55:03Z

Some more sentences have been reviewed by native speakers of Ukrainian, and it is clear that the ratio of errors is still higher than acceptable. I'm putting this effort on hold for now, as there isn't any obvious way to filter out the remaining issues automatically.

Add initial rules and blocklist for Ukrainian

4585e6c

MichaelKohler added the waiting on error rate review label Jul 21, 2021

MichaelKohler approved these changes Jul 21, 2021

View reviewed changes

Update rules for Ukrainian

ceb5bc4

somerandomguyontheweb force-pushed the ukwiki branch from 01df52e to ceb5bc4 Compare August 19, 2021 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial rules and blocklist for Ukrainian #152

Add initial rules and blocklist for Ukrainian #152

somerandomguyontheweb commented Jul 21, 2021

somerandomguyontheweb commented Aug 19, 2021

somerandomguyontheweb commented Aug 26, 2021

Add initial rules and blocklist for Ukrainian #152

Are you sure you want to change the base?

Add initial rules and blocklist for Ukrainian #152

Conversation

somerandomguyontheweb commented Jul 21, 2021

somerandomguyontheweb commented Aug 19, 2021

somerandomguyontheweb commented Aug 26, 2021