-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Polish language rules and blocklist #168
base: main
Are you sure you want to change the base?
Conversation
Added blocklist based on word frequency with threshold of 40. Blocklist also incorporates english worlist as many sentences i.e. from Wikipedia contain english words. Added additional rules to pl.toml to filter out some foreign letter combinations. Added two more abbreviations to python segmenter to improve segmentation of sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I only have one comment, the rest looks good to me and I don't have any more suggestions. Well done!
Most problems are with surnames and names of places but I don't have reasonably complete way of filtering them atm. I'm open for additional suggestions.
I looked into this a while ago and was very surprised how often some of quite hard to pronounce words exist in Wikipedia. There is only so much we can do and that's why we don't require a 0% error rate.
[" w pow. ", " w powiecie "], | ||
[" w tzw. ", " w tak zwanym "], | ||
[" tzw. ", " tak zwany "], | ||
# Centuries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, interesting idea. I kinda like it :O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it was started by user Scarfmonster but I liked idea though it may be a bit tedious.
) | ||
|
||
sentence_tokenizer = nltk.data.load("tokenizers/punkt/polish.pickle") | ||
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These abbreviations would then still end up in the sentences right? I see that a few of those are indeed being replaced or checked for, but quite a few of those are not. Are those all easily pronounceable? Alternatively I would suggest to add those to replacements
as well and then we could also simplify this part here. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those abbreviations do end up in the split sentences. The reason for them being fed to the python punkt segmenter is that without them some sentences get split incorrectly (i.e. "Jego współpracownikami byli inż. A. Schmidta oraz technolodzy: Lipko, Obryś i Rohn." is split into two). So here it is purely to improve general quality of the splitting. They are latter filtered out by the abbreviations regex if not replaced as most of may be read incorrectly.
Sadly replacements are not always possible as many abbreviations in expanded form change conjugation or vary in hard to manage way. But it came to my head that maybe this python part can be moved to separate .py file to hide the not-so-pretty word listings and in segmenter.rs we can just call one func :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all of these here have to be followed by a period then? So that's kinda implicit when defining abbreviations here? If so, I'm totally fine with this and all is good :)
But it came to my head that maybe this python part can be moved to separate .py file to hide the not-so-pretty word listings and in segmenter.rs we can just call one func :)
When I set it up I couldn't figure out how to do that, but I didn't want to spend too much time on it. If such a setup is doable, I'd definitely prefer that as well. However I don't see this as a requirement to get this PR merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, most of polish abbreviations are ended with a dot and even some exceptions which should not have dot are found to have one :)
As for cleaning up python code into other file... it seems more difficult than I expected. I don't quite grasp how rust works and inline python is not an exception. It seems to me that regular .py files are lacking something for inline python to process them correctly. But I found some workaround. I can have a separate rust function which creates context with all the arrays of strings and return the context which is then used to run the splitting. The context setup function can be a separate file hiding what does not have to be visible in the segmenter.rs. Anyway I need to learn rust :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can have a separate rust function which creates context with all the arrays of strings and return the context which is then used to run the splitting. The context setup function can be a separate file hiding what does not have to be visible in the segmenter.rs.
I'm not sure I understand your workaround. In any case, I think your additions are fine for now, I wouldn't block on that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this for now. Please ping me once the review is done and we can get this merged :)
For now me and one more polish contributor did review and we estimated 96% and 94% of OK sentences from sample of random 400. I'll try get at least one more reviewer but I don't have high hopes. |
@J-Wrobel did you have any luck getting a third reviewer? |
No, tried to get someone to look at it but no luck despite some initial interest. Everyone seems busy :) |
Sentences got after blocklist incorporation: 214838
Block list is based on frequency of words with threshold 40 and added English wordlist taken from
wordlist repo by dwyl
After a few interations my estimated correctness based on 400 sentence sample is ~96%. Probably could be better in reality but I marked sentences with hard to read words quite conservatively I think. Most problems are with surnames and names of places but I don't have reasonably complete way of filtering them atm. I'm open for additional suggestions.
Link to review spreadsheet with 400 sentences reviewed by me as Reviewer1: link
Additional sentence selection and filtration info:
-removed most abbreviations which could be misspelled or read in wrong form by reader
-based on Scarfmonsters initial config file i left replacements with ages like XII w.
-replaced some abbreviations with full form when one available
-filtered other patterns like 2-4 letters unkommon in polish words like qu, sh, łł, heim
-left x, v and q as allowed letters - discussable
Used python segmenter with punkt which in trials gave better results than rust implementation (topic started by user Scarfmonster some time ago) and added some abbreviations tweaks found on gists (marked in source).