-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial addition of the Russian language #66
Draft
iLeonidze
wants to merge
4
commits into
common-voice:main
Choose a base branch
from
iLeonidze:russian
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 3 commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,258 @@ | ||
хуй | ||
хуина | ||
хуйло | ||
опизденевшие | ||
пизда | ||
др | ||
доп | ||
ул | ||
им | ||
ст | ||
св | ||
чел | ||
шт | ||
пр | ||
см | ||
мн | ||
пл | ||
мл | ||
уд | ||
ср | ||
др | ||
рус | ||
ед | ||
чл | ||
корр | ||
еп | ||
пп | ||
оз | ||
кг | ||
гв | ||
рр | ||
тд | ||
км | ||
кн | ||
мм | ||
юр | ||
ур | ||
дв | ||
ев | ||
яп | ||
шп | ||
яз | ||
цз | ||
тт | ||
сб | ||
пн | ||
вт | ||
ср | ||
чт | ||
пт | ||
вск | ||
эп | ||
зп | ||
сц | ||
уу | ||
ув | ||
оо | ||
би | ||
мя | ||
ал | ||
сс | ||
уг | ||
ол | ||
сл | ||
узб | ||
эк | ||
кр | ||
хр | ||
кс | ||
рч | ||
вн | ||
ов | ||
аг | ||
уч | ||
хх | ||
дд | ||
тп | ||
мч | ||
вр | ||
ьо | ||
ин | ||
оф | ||
ус | ||
тж | ||
жд | ||
дл | ||
мд | ||
фр | ||
эм | ||
ит | ||
оп | ||
лл | ||
ак | ||
эл | ||
рп | ||
вм | ||
3-бет | ||
аббр | ||
аббрев | ||
абл | ||
абс | ||
абх | ||
авар | ||
Авв | ||
авг | ||
Авд | ||
австр | ||
австрал | ||
авт | ||
Агг | ||
агр | ||
адж | ||
адм | ||
адыг | ||
азерб | ||
азиат | ||
акад | ||
академ | ||
акк | ||
акц | ||
алб | ||
алг | ||
алгебр | ||
алж | ||
алт | ||
алф | ||
альм | ||
альп | ||
ам | ||
Ам | ||
амер | ||
анат | ||
англ | ||
ангол | ||
аннот | ||
антич | ||
ао | ||
ап | ||
Апок | ||
апп | ||
апр | ||
ар | ||
араб | ||
арам | ||
аргент | ||
арифм | ||
арм | ||
арт | ||
артез | ||
арх | ||
археол | ||
архиеп | ||
архим | ||
архип | ||
архит | ||
ас | ||
асб | ||
асс | ||
ассир | ||
ассист | ||
астр | ||
астрон | ||
ат | ||
ата | ||
ати | ||
атм | ||
афг | ||
афр | ||
ацет | ||
б-ка | ||
б-н | ||
б-ца | ||
б-чка | ||
бат-н | ||
башк | ||
бел | ||
белорус | ||
бзн | ||
библ | ||
биогр | ||
биол | ||
бирм | ||
Бл | ||
блгв | ||
блгвв | ||
блж | ||
блр | ||
больн | ||
бр | ||
браз | ||
брет | ||
брит | ||
бц | ||
быв | ||
Быт | ||
бюдж | ||
бюлл | ||
вл | ||
Вл | ||
вс | ||
вт | ||
вып | ||
г-жа | ||
г-н | ||
Гбайт | ||
ГВт | ||
гг | ||
Гкал | ||
гл | ||
глаг | ||
гм | ||
гос | ||
гр | ||
грн | ||
дал | ||
дБ | ||
деепр | ||
дееприч | ||
Дж | ||
диак | ||
долл | ||
дптр | ||
др | ||
зак | ||
зам | ||
Зв | ||
изд-во | ||
кал | ||
кат | ||
кв | ||
кВА | ||
кВт | ||
кВтч | ||
ккал | ||
корп | ||
корр | ||
Мб | ||
Мбит | ||
МВт | ||
мг | ||
МГц | ||
межд | ||
междунар | ||
мес | ||
мест | ||
нареч | ||
Бк | ||
Вт | ||
га | ||
гг | ||
Гг | ||
Ггц | ||
кг | ||
км | ||
кт | ||
мкс | ||
мм | ||
сек |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
min_trimmed_length = 3 | ||
|
||
# For count = 2 we have sentences like: "Это он", "Будет беда" | ||
# and a lot of trash abbreviations like "до н.", "и доп." | ||
min_word_count = 3 | ||
|
||
max_word_count = 14 | ||
|
||
# In Russian, words can consist of one letter, so no restrictions | ||
min_characters = 0 | ||
|
||
may_end_with_colon = false | ||
quote_start_with_letter = true | ||
needs_punctuation_end = false | ||
needs_letter_start = true | ||
|
||
# Apparently, in some places the sentences are cut incorrectly, | ||
# which is why we get some part of the sentence, and not its entirety. | ||
# This is required to fix sentences like: | ||
# с Иваном Галамяном. | ||
needs_uppercase_start = true | ||
|
||
# The following are symbols that are either absent in the Russian language | ||
# or participating in language constructs that will be difficult or impossible to voice, | ||
# as well as constructions causing problems in sentences | ||
# (for example, opening quotes without closing) | ||
disallowed_symbols = [ | ||
'<', '>', '+', '*', '\', '#', '@', '^', '[', ']', '(', ')', '/', | ||
'é', 'è', 'à', 'ç', 'Å', '薛', '⋅', '·', 'ѿ', '=', '|', '≡', '→', | ||
'×', 'Ա', 'բ', 'դ', 'ի', 'ս', 'յ', 'շ', 'ո', '•', 'थ्', '©', '¸', | ||
'天', '不', '言', ',', '以', '行', '與', '事', '示', '之', '而', | ||
'已', '矣', '《', '大', '学', '集', '注', '》', '多', '识', '录', | ||
'「', '鴃', '」', '鳩', '鳧', '鶯', 'ا', 'ل', 'ر', 'د', 'ن', | ||
'เ', 'วฺ', 'ส', 'ลฺ', 'ยุ', 'ทิ', 'โ', 'ร', 'ชฺ', 'ท', 'ยุ', 'ตฺ', 'สฺ', | ||
'ย', 'พ', 'ทฺ', 'นิ', 'इ', '_', 'Ѳ', '‹', '›', '`', | ||
'і', 'ў', '~', '®', '%', '⟨', '⟩', '¬', '&', '≤', | ||
'α', 'β', 'Γ', 'γ', 'Δ', 'δ', 'ε', 'ζ', 'η', 'Θ', 'θ', 'ι', 'Ә', | ||
'Λ', 'λ', 'μ', 'ν', 'Ξ', 'ξ', 'ρ', 'Σ', 'σ', 'ς', 'τ', | ||
'υ', 'Φ', 'φ', 'χ', 'Ψ', 'ψ', 'Ω', 'ω', 'a', 'A', 'b', 'B', 'c', | ||
'C', 'd', 'D', 'e', 'E', 'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', | ||
'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'N', 'o', 'O', 'p', | ||
'P', 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'v', 'V', | ||
'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z', | ||
'җ', 'ә', 'ү', 'ː', 'қ', 'ћ', 'Ä', 'Ҋ', 'ј', 'Ι', 'Χ', | ||
|
||
# There is a lot of broken sentences like: | ||
# У меня щемило сердце: «Что мы сделаем? | ||
# Такой он был веселый, бодрый, нас воодушевлял: „Вы знаете, не бойтесь! | ||
# История и культура»". | ||
# Пушкин, прочитав поэму, сказал «— Далеко мальчик пойдет». | ||
# После увольнения он пытается создать новую жизнь...Жизнь, в свою очередь, строит ему множество «баррикад" | ||
# Пушкин, Набоков… мало кого еще можно поставить в этот ряд? | ||
# Let's remove them all, we have so many good sentences without problems | ||
'«', '»', '"', '„', '“', '”', '‘', '…', | ||
'{', '}', '(', ')', | ||
iLeonidze marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Sentences with accented chars, it is not possible to voice and | ||
# should be removed, example: | ||
# е́р | ||
'ѣ', 'Ѣ', 'І', 'ң', 'ă', 'ĕ', ' ́', '́', | ||
|
||
# Sentences with invisible chars (use vim to work with them) | ||
'', '', '', '་' | ||
] | ||
|
||
broken_whitespace = [" ", " ,", " .", " ?", " !", " ;", " \""] | ||
|
||
abbreviation_patterns = [ | ||
# Abbreviations: | ||
|
||
# 1. М.Н.С. | ||
"[А-Я]+\\.*[А-Я]", | ||
|
||
# 2. А. Пушкин | ||
"[А-Я]\\.", | ||
|
||
# 3. Дж. | ||
"[А-Я][а-я]\\.", | ||
|
||
# 4. СССР | ||
"[А-Я]{2,}", | ||
|
||
# 5. г. Пушкина. | ||
"\\s[а-я]\\.", | ||
|
||
# 6. — — | ||
"— —", | ||
|
||
# 7. сайка фито— и зоопланктоном | ||
"[а-я]—\\s[а-я]", | ||
|
||
# 8. Повод —первое упоминание | ||
"[а-я]\\s—[а-я]", | ||
|
||
# 9. с разрывом связей С—С и образованием | ||
"[А-Я]—[А-Я]", | ||
|
||
# 10. Words that are similar to ordinary, but cannot be at the end of a sentence, | ||
# which means they are abbreviations | ||
"\\s(ор|ок|ом|ум|те|ил)\\.", | ||
|
||
# 11. по учению св.отцов | ||
"[а-я]\\.[а-я]", | ||
|
||
# 12. др.-евр. чл.-корр. | ||
"[а-я]{1,3}\\.-[а-я]{1,4}\\.", | ||
|
||
# 13. ул.Тюменская | ||
"[А-Яа-я]\\.[А-Яа-я]", | ||
|
||
# 14. Ставропольский кр., | ||
"\\.,", | ||
|
||
# It is not possible to disallow ALL bad symbols, let's use regex for this | ||
# 15. Remove any other bad symbols | ||
"[^\\dА-Яа-яёЁ\\s:,.-‑?;!—‐–'·=’―−”‘]" | ||
iLeonidze marked this conversation as resolved.
Show resolved
Hide resolved
|
||
] | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this won't be used as
allowed_symbols_regex
is set. However, given that those invisible chars should not be allowed by the above regex, I guess that's fine and it can be completely removed?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As I found out invisible chars are not detected by regex for some reason. Perhaps because they are part of
\s
. Therefore, they had to be defined here, and can't be removed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's probably what is happening here. Could you just use the specific character you want for whitespace instead of
\s
or is my Regex knowledge completely failing me here?