Add initial rules and blocklist for Ukrainian #152
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an adaptation of the Belarusian rules to Ukrainian.
654224 sentences.
I took a grammatical dictionary of Ukrainian here, split the Wikipedia export into tokens and kept only those tokens in the blocklist that don't occur in the dictionary, no matter what is their frequency. Note that I didn't use the full export with
--no-check
, as it would bring many irrelevant tokens (non-Cyrillic spellings; words that only occur in the sentences which are filtered out anyway, etc.). Instead, I temporarily setmax_sentences_per_text
tostd::usize::MAX
, in order to consider tokens only in those sentences that pass the rules.Spreadsheet here, not yet reviewed. As I'm not a competent speaker of Ukrainian myself, I'm going to contact the Common Voice Ukrainian community and update this PR once the review is complete.