-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.dic and .aff content by param. #19
Comments
It is theoretically possible. You'll need to implement some wrapper around a list that will correspond to two requirements:
Once you have this, you can just: aff, context = spylls.hunspell.readers.read_aff(MyReader(af_lines_list))
dic = spylls.hunspell.readers.read_dic(MyReader(dic_lines_list), aff=aff, context=context)
dictionary = spylls.hunspell.Dictionary(aff, dic)
Not that convenient, but from spylls.hunspell.algo.capitalization import Type as CapType
dic = spylls.hunspell.Dictionary.from_files('examples/en_US')
for form in dic.lookuper.affix_forms('kittens', captype=CapType.NO):
print(form.stem)
# prints: "kitten" |
Thanks for the reply, about the first issue, I was able to populate the dictionary with the method that you have suggested, I have some problems with the encoding of special chars, but is something that I will address the next week. The second issue, the stemming, I did a test with the code that you have provided, but it seems that there is some import (or library version) that is preventing to pass the captype:
The complete code snipped is this, ignore the spark udf wrapper:
|
That's very weird! Can you show a full backtrace of an error? |
Thanks for the fast reply. The stacktrace shows a lot of spark garbage that is not informative and the only python related message is the weird one. But looking your message that seems to be something related with the spark environment. I have executed the code in a local instance of python, at the driver side of the spark (pyspark) environment, and it works properly. So there is something with the python versions of the executors and the imports of the hunspell library that is not being imported or being imported as None I suppose. I will check that and will come with the solution. |
I found the problem, as I suspected the executors' python instance weren't able to install the hunspell library and the import was failing , producing a cascade of scala<->java errors (common in pyspark stacktraces) that was hidding the main problem, I had to log in into the cluster manager to find out that error. Summarizing, your were totally right and your code can be integrated into a spark UDF, thanks! |
Victor, one final question about the stemming process. What is the procedure for stemming accented words like "específicos". It seems that the affix form method requires non accented words I'm right? thanks! |
It should depend on the dictionary only (if the dictionary has accents, they should be properly processed); but with Unicode quirks you never know :) |
Hello!
Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?
This is useful in the case of spark UDFs where it is easier to pass LIST variables, rather than copy .dic and .aff files from the driver node to the executors.
Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?
The text was updated successfully, but these errors were encountered: