.dic and .aff content by param. #19

alecuba16 · 2022-01-14T17:18:37Z

Hello!

Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?

This is useful in the case of spark UDFs where it is easier to pass LIST variables, rather than copy .dic and .aff files from the driver node to the executors.

Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?

zverok · 2022-01-15T17:48:58Z

Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?

It is theoretically possible. You'll need to implement some wrapper around a list that will correspond to two requirements:

It is iterable, producing pairs of (line number, line)
It has method reset_encoding(encoding_name) which works in the middle of iteration and makes the next lines in different encoding.

Once you have this, you can just:

aff, context = spylls.hunspell.readers.read_aff(MyReader(af_lines_list))
dic = spylls.hunspell.readers.read_dic(MyReader(dic_lines_list), aff=aff, context=context)
dictionary = spylls.hunspell.Dictionary(aff, dic)

Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?

Not that convenient, but

from spylls.hunspell.algo.capitalization import Type as CapType

dic = spylls.hunspell.Dictionary.from_files('examples/en_US')
for form in dic.lookuper.affix_forms('kittens', captype=CapType.NO): 
  print(form.stem)
# prints: "kitten"

alecuba16 · 2022-01-16T09:42:05Z

Thanks for the reply, about the first issue, I was able to populate the dictionary with the method that you have suggested, I have some problems with the encoding of special chars, but is something that I will address the next week.

The second issue, the stemming, I did a test with the code that you have provided, but it seems that there is some import (or library version) that is preventing to pass the captype:

Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)

The complete code snipped is this, ignore the spark udf wrapper:

from pyspark.sql import *
import pyspark.sql.functions as F
import pyspark.sql.types as T
import spylls
from spylls.hunspell.algo.capitalization import Type as CapType
from pyspark import SparkFiles

def pyspark_transform(spark, df):
    def hunspell(desc):
        if desc:
            dic = spylls.hunspell.Dictionary.from_zip(SparkFiles.get("es_ES.zip"))
            return [sug for sug in dic.lookuper.affix_forms(desc, captype=CapType.NO)]
        else:
            return [""]

    dic_path="hdfs:///hunspell/es_ES.zip"
    spark.sparkContext.addFile(dic_path)
    
    udf_hunspell = F.udf(hunspell, T.ArrayType(T.StringType()))
    
    df=df.withColumn("result",udf_hunspell(F.col("desc")))  
    
    return df

zverok · 2022-01-16T10:04:08Z

Expected zero arguments for construction of ClassDict (for spylls.hunspell.algo.capitalization.Type)

That's very weird! Can you show a full backtrace of an error?

alecuba16 · 2022-01-16T10:13:20Z

Thanks for the fast reply.

The stacktrace shows a lot of spark garbage that is not informative and the only python related message is the weird one. But looking your message that seems to be something related with the spark environment. I have executed the code in a local instance of python, at the driver side of the spark (pyspark) environment, and it works properly.

So there is something with the python versions of the executors and the imports of the hunspell library that is not being imported or being imported as None I suppose.

I will check that and will come with the solution.

alecuba16 · 2022-01-16T10:45:51Z

I found the problem, as I suspected the executors' python instance weren't able to install the hunspell library and the import was failing , producing a cascade of scala<->java errors (common in pyspark stacktraces) that was hidding the main problem, I had to log in into the cluster manager to find out that error.

Summarizing, your were totally right and your code can be integrated into a spark UDF, thanks!

alecuba16 · 2022-01-17T19:36:45Z

Victor, one final question about the stemming process. What is the procedure for stemming accented words like "específicos". It seems that the affix form method requires non accented words I'm right?

thanks!

zverok · 2022-01-17T19:49:36Z

It should depend on the dictionary only (if the dictionary has accents, they should be properly processed); but with Unicode quirks you never know :)

zverok added the question Further information is requested label Jan 15, 2022

zverok mentioned this issue Jan 30, 2022

ask for the Stemming feature #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.dic and .aff content by param. #19

.dic and .aff content by param. #19

alecuba16 commented Jan 14, 2022 •

edited

Loading

zverok commented Jan 15, 2022

alecuba16 commented Jan 16, 2022 •

edited

Loading

zverok commented Jan 16, 2022

alecuba16 commented Jan 16, 2022

alecuba16 commented Jan 16, 2022

alecuba16 commented Jan 17, 2022

zverok commented Jan 17, 2022

.dic and .aff content by param. #19

.dic and .aff content by param. #19

Comments

alecuba16 commented Jan 14, 2022 • edited Loading

zverok commented Jan 15, 2022

alecuba16 commented Jan 16, 2022 • edited Loading

zverok commented Jan 16, 2022

alecuba16 commented Jan 16, 2022

alecuba16 commented Jan 16, 2022

alecuba16 commented Jan 17, 2022

zverok commented Jan 17, 2022

alecuba16 commented Jan 14, 2022 •

edited

Loading

alecuba16 commented Jan 16, 2022 •

edited

Loading