-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added hindi language toml and wiki sample #89
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. A few comments and suggestions:
- Are there any Hindi script specific symbols we might not want? (I have no idea about hindi)
- Are there Hindi specific abbreviation patterns?
- Did you check if some of the newer rules might be helpful such as even_symbols or replacements?
- Did you run the blacklist generation script as referenced in the Readme? For other languages not allowing less often used words greatly increased the quality as we could remove less used foreign words and foreign names
- How many sentences did you get in total? I assume 4500 is just for the review?
Happy to help as much as I can, probably mostly on the technical side as I don't know Hindi at all.
Also, can you remove the sample file and add it somewhere online? We eventually do not want this as part of the source code here. |
Thanks, Michael. Responses to your questions below.
Thanks. |
Thanks for your answers!
Which limit did you choose?
A maximum of 3 sentences per article is a legal requirement, we can't go higher than that. Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?
Also note that the local command for extraction will now be:
Happy to answer any question you may have and thanks for your efforts! I'll comment on the change in |
Pulling latest changes from Common voice master
@karthiksibm can you please also have a look at the other comments I've made? |
I've made the updates to hi.toml. Thanks for your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additions, this starts to look really good! I have two more comments.
…Updated disallowed symbols in toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the changes. This looks good to me from a technical perspective. Let's see if the error rate is good as well.
Error Rate Review:
|
These numbers are a bit too high. @nukeador I forgot what the required minimum was, can you remind me? Can you look at the sentences and see if you can
Thanks for your efforts! |
The error rate should be between 5-7%. Anything lower of course is great, but probably very hard to achieve. |
Thanks @MichaelKohler . Looks like it has too many complicated, long words which make them hard to pronounce. To filter out such long words, is there a parameter to set the Meanwhile, I will play around to try and catch them into a better blacklist words set. |
There is currently no such setting, but you could use a Regex in the |
@karthiksibm it seems you can use |
If you merge latest master into your branch, you can also use the |
.. but those words are still appearing with a high frequency? If not, increasing the minimum frequency of the blacklist might also be a way to go |
Error Rate Review:
Looks better now by improving the blacklist. |
Thanks, that looks better. How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit. I've just saw the following sentence:
You might want to have a look at the |
@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences? Feel free to join our matrix chat so we can support you to get more sentences, my understanding is that Hindi wikipedia has 180K articles, it's weird you are only getting 90K sentences. https://chat.mozilla.org/#/room/#common-voice-sentence-extractor:mozilla.org |
Check #89 (comment) where the answer is "We get around 90K sentences." However, the following questions should still be answered before we proceed here. I'm mostly worried about not having a recent commit for the blacklist change.
|
@MichaelKohler sorry I got busy with other projects. I'll get back quickly with the answer to your question. |
@MichaelKohler checked in the latest rules and the blacklist file. The blacklist was obtained with frequency of 50 and also including words longer than 9 characters. That resulted in improved readability. |
@@ -0,0 +1,21 @@ | |||
min_trimmed_length = 3 | |||
min_word_count = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate the change from 1 to 5 here? How do smaller sentences below 5 words look like? Would there be some that would be valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a lower limit will result in lot of sentences smaller than 5 words, which will look like:
यह बड़ा है (this is big)
और कहाँ है (where else is this)
क्या मिलेगा (what will you get)
..and so on
These don't seem to be very useful and will dominate the resulting dataset. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smaller sentences are also OK if they make sense. How many are we losing when this is applied, what's the total with and without these sentences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will need to be updated to not have merge conflicts, and there are still open questions such as nukeador's question around losing sentences. Additionally it's worth to invest a bit more time to bring down the error rate (preferably < 5%).
How many sentences did you get at the end?
4500 lines on output
How did you create the blacklist file?
removed all characters from English language.
Review
For review please use sample file wiki.hi.txt.