Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial rules and blocklist for Ukrainian #152

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

somerandomguyontheweb
Copy link
Contributor

This is an adaptation of the Belarusian rules to Ukrainian.

  • How many sentences did you get at the end?

654224 sentences.

  • How did you create the blocklist file?

I took a grammatical dictionary of Ukrainian here, split the Wikipedia export into tokens and kept only those tokens in the blocklist that don't occur in the dictionary, no matter what is their frequency. Note that I didn't use the full export with --no-check, as it would bring many irrelevant tokens (non-Cyrillic spellings; words that only occur in the sentences which are filtered out anyway, etc.). Instead, I temporarily set max_sentences_per_text to std::usize::MAX, in order to consider tokens only in those sentences that pass the rules.

  • Get at least 3 different native speakers (ideally linguists) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

Spreadsheet here, not yet reviewed. As I'm not a competent speaker of Ukrainian myself, I'm going to contact the Common Voice Ukrainian community and update this PR once the review is complete.

@somerandomguyontheweb
Copy link
Contributor Author

As advised by Ukrainian colleagues who reviewed several hundred sentences in the sample, I'm adding more patterns to the rules file uk.toml. The number of extracted sentences decreased to 620'015, the updated spreadsheet is here. This is still an effort in progress.

@somerandomguyontheweb
Copy link
Contributor Author

Some more sentences have been reviewed by native speakers of Ukrainian, and it is clear that the ratio of errors is still higher than acceptable. I'm putting this effort on hold for now, as there isn't any obvious way to filter out the remaining issues automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants