Sharing pipeline configuration between Python (train env) and Java (prod env) #6315
-
I was wondering what is the best practice to share SparkNLP pipeline configurations between different environments. For instance, if models are being developed and trained in a Python environment, but the production environment is using Java? Thankfully, SparkNLP easily supports both languages and models can be shared easily. However, when moving to production, it is vital to use the same pre-processing pipeline for incoming data that was used to train the model. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @mwunderlich Spark NLP extends Spark ML Pipeline natively, in that sprit every model or PipelineModel is saved with metadata. (both default and already set parameters) Regardless of where they are being trained/saved and being loaded (Python, Scala, Java, or R) the metadata that was saved for each stage (annotator) will be loaded alongside and it will behave exactly the same. They look something like this:
{"class":"org.apache.spark.ml.PipelineModel","timestamp":1632168876633,"sparkVersion":"3.0.2","uid":"RECURSIVE_PIPELINE_b04dd1c887aa","paramMap":{"stageUids":["document_811d40a38b24","SENTENCE_ce56851acebe","REGEX_TOKENIZER_78daa3b4692f","SPELL_79c88338ef12","LEMMATIZER_c62ad8f355f9","STEMMER_caf11d1f4d0e","POS_dbb704204f6f"]},"defaultParamMap":{}}
{"class":"com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel","timestamp":1632168877138,"sparkVersion":"3.0.2","uid":"SPELL_79c88338ef12","paramMap":{"dupsLimit":2,"shortCircuit":false,"doubleVariants":false,"caseSensitive":true,"intersections":10,"outputCol":"spell","inputCols":["token"],"frequencyPriority":true,"wordSizeIgnore":3,"vowelSwapLimit":6,"reductLimit":3},"defaultParamMap":{"dupsLimit":2,"shortCircuit":false,"caseSensitive":true,"doubleVariants":false,"intersections":10,"lazyAnnotator":false,"frequencyPriority":true,"vowelSwapLimit":6,"wordSizeIgnore":3,"reductLimit":3}} You are very correct, Spark NLP supports multiple programming languages/environments/and many other things so this makes it crucial to have the same behavior when moving models/pipelines from one point to another. Thanks to this native feature of Spark ML in Spark NLP, you can preserve/persist any model or pipeline on disk by using any language and load them later across Scala, Java, or Python with the very same parameters (inputCols, outputCol, hyper-parameters, etc.) |
Beta Was this translation helpful? Give feedback.
Hi @mwunderlich
Spark NLP extends Spark ML Pipeline natively, in that sprit every model or PipelineModel is saved with metadata. (both default and already set parameters)
Regardless of where they are being trained/saved and being loaded (Python, Scala, Java, or R) the metadata that was saved for each stage (annotator) will be loaded alongside and it will behave exactly the same.
They look something like this: