Sharing pipeline configuration between Python (train env) and Java (prod env) #6315

mwunderlich · 2021-10-18T08:17:11Z

mwunderlich
Oct 18, 2021

I was wondering what is the best practice to share SparkNLP pipeline configurations between different environments. For instance, if models are being developed and trained in a Python environment, but the production environment is using Java? Thankfully, SparkNLP easily supports both languages and models can be shared easily.

However, when moving to production, it is vital to use the same pre-processing pipeline for incoming data that was used to train the model.
So, is there a simple declarative way to specify and share the pipeline and model config, e.g. by using an XML/Text/Json format to specify annotator classes, input/output columns, and parameters? This would allow constructing pipelines at runtime and, thus, share pipelines between the different training and prediction environments.

Answered by maziyarpanahi

Oct 18, 2021

Hi @mwunderlich

Spark NLP extends Spark ML Pipeline natively, in that sprit every model or PipelineModel is saved with metadata. (both default and already set parameters)

Regardless of where they are being trained/saved and being loaded (Python, Scala, Java, or R) the metadata that was saved for each stage (annotator) will be loaded alongside and it will behave exactly the same.

They look something like this:

metadata for the whole pipeline

{"class":"org.apache.spark.ml.PipelineModel","timestamp":1632168876633,"sparkVersion":"3.0.2","uid":"RECURSIVE_PIPELINE_b04dd1c887aa","paramMap":{"stageUids":["document_811d40a38b24","SENTENCE_ce56851acebe","REGEX_TOKENIZER_78daa3b4692f","SPELL_79c88…

View full answer

maziyarpanahi · 2021-10-18T08:48:01Z

maziyarpanahi
Oct 18, 2021
Maintainer

Hi @mwunderlich

Spark NLP extends Spark ML Pipeline natively, in that sprit every model or PipelineModel is saved with metadata. (both default and already set parameters)

Regardless of where they are being trained/saved and being loaded (Python, Scala, Java, or R) the metadata that was saved for each stage (annotator) will be loaded alongside and it will behave exactly the same.

They look something like this:

metadata for the whole pipeline

{"class":"org.apache.spark.ml.PipelineModel","timestamp":1632168876633,"sparkVersion":"3.0.2","uid":"RECURSIVE_PIPELINE_b04dd1c887aa","paramMap":{"stageUids":["document_811d40a38b24","SENTENCE_ce56851acebe","REGEX_TOKENIZER_78daa3b4692f","SPELL_79c88338ef12","LEMMATIZER_c62ad8f355f9","STEMMER_caf11d1f4d0e","POS_dbb704204f6f"]},"defaultParamMap":{}}

metadata for each stage (for instance Spell Checker in that pipeline)

{"class":"com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel","timestamp":1632168877138,"sparkVersion":"3.0.2","uid":"SPELL_79c88338ef12","paramMap":{"dupsLimit":2,"shortCircuit":false,"doubleVariants":false,"caseSensitive":true,"intersections":10,"outputCol":"spell","inputCols":["token"],"frequencyPriority":true,"wordSizeIgnore":3,"vowelSwapLimit":6,"reductLimit":3},"defaultParamMap":{"dupsLimit":2,"shortCircuit":false,"caseSensitive":true,"doubleVariants":false,"intersections":10,"lazyAnnotator":false,"frequencyPriority":true,"vowelSwapLimit":6,"wordSizeIgnore":3,"reductLimit":3}}

You are very correct, Spark NLP supports multiple programming languages/environments/and many other things so this makes it crucial to have the same behavior when moving models/pipelines from one point to another. Thanks to this native feature of Spark ML in Spark NLP, you can preserve/persist any model or pipeline on disk by using any language and load them later across Scala, Java, or Python with the very same parameters (inputCols, outputCol, hyper-parameters, etc.)

1 reply

mwunderlich Oct 18, 2021
Author

Thanks a lot @maziyarpanahi for the quick reply. This looks exactly like what I was looking for. Brilliant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing pipeline configuration between Python (train env) and Java (prod env) #6315

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Sharing pipeline configuration between Python (train env) and Java (prod env) #6315

mwunderlich Oct 18, 2021

Replies: 1 comment · 1 reply

maziyarpanahi Oct 18, 2021 Maintainer

mwunderlich Oct 18, 2021 Author

mwunderlich
Oct 18, 2021

Replies: 1 comment 1 reply

maziyarpanahi
Oct 18, 2021
Maintainer

mwunderlich Oct 18, 2021
Author