John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more! #6721
maziyarpanahi
announced in
Announcement
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! 🎉
Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.
This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
GPT2Transformer
uses OpenAI GPT-2 models from HuggingFace 🤗 for prediction at scale in Spark NLP 🚀 .GPT-2
is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentencesRoBertaForSequenceClassification
can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingRobertaForSequenceClassification
for PyTorch orTFRobertaForSequenceClassification
for TensorFlow models in HuggingFace 🤗XlmRoBertaForSequenceClassification
can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingXLMRobertaForSequenceClassification
for PyTorch orTFXLMRobertaForSequenceClassification
for TensorFlow models in HuggingFace 🤗LongformerForSequenceClassification
can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingLongformerForSequenceClassification
for PyTorch orTFLongformerForSequenceClassification
for TensorFlow models in HuggingFace 🤗AlbertForSequenceClassification
can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingAlbertForSequenceClassification
for PyTorch orTFAlbertForSequenceClassification
for TensorFlow models in HuggingFace 🤗XlnetForSequenceClassification
can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingXLNetForSequenceClassification
for PyTorch orTFXLNetForSequenceClassification
for TensorFlow models in HuggingFace 🤗useBestModel
param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the trainingspark-nlp-spark32
andspark-nlp-gpu-spark32
packagesspark32=True
)pip install spark-nlp pyspark==3.1.2
ContextSpellCheckerApproach
annotatorgraphFolder
param toContextSpellCheckerApproach
annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graphgraphFolder
param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on DatabricksForTokenClassification
andForSequenceClassification
) to retrieve classes from the pretrained modelsinputFormats
param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.readDataset
in CoNLL() classBug Fixes
lemma_antbnc
,sentiment_vivekn
, andspellcheck_norvig
for Spark 3.xclean_slang
,check_spelling
,match_chunks
, andmatch_datetime
for Spark 3.xsaveModel
in TrainingHelperModels Hub
Models Hub now comes with new features to easily filter and find your desired models & pipelines by:
In addition, you can also filter models & pipelines by:
As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! 🚀
Models and Pipelines
Spark NLP 3.4.0 comes with state-of-the-art pre-trained transformer models. Models Hub supports over 15 NLP tasks:
Named Entity Recognition
,Text Classification
,Sentiment Analysis
,Translation
,Question Answering
,Summarization
,Sentence Detection
,Embeddings
,Language Detection
,Stop Words Removal
,Word Segmentation
,Part of Speech Tagging
,Lemmatization
,Spell Check
,Dependency Parser
, andText Generation
Featured Models
en
en
en
en
en
fr
en
en
en
en
en
en
en
it
en
ru
en
en
es
is
en
id
en
xx
tr
id
de
xx
hi
xx
hi
nl
zh
te
sw
fr
fi
en
ur
ur
en
en
en
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (Marāṭhī)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian Bokmål
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 4100+ models & pipelines in 230+ languages is available on Models Hub
Backward Compatibility
dateFormat
in DateMatcher and MultiDateMatcher annotators has been renamed tooutputFormat
:CMLM
models available which outperform xling models with support for more languages)New Notebooks
Import hundreds of models in different languages to Spark NLP
You can visit Import Transformers in Spark NLP for more info
New Word2Vec notebook
Documentation
Installation
Python
#PyPI pip install spark-nlp==3.4.0
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
GPU
spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):
GPU
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
GPU
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
GPU
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 3.2.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.4.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.3.x:
spark-nlp-gpu:
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.0.jar
GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.0.jar
CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.0.jar
GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.0.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.0.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.0.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.0.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.0.jar
What's Changed
New Contributors
Full Changelog: 3.3.4...3.4.0
This discussion was created from the release John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more!.
Beta Was this translation helpful? Give feedback.
All reactions