John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more! #6721

maziyarpanahi · 2022-01-05T15:25:58Z

maziyarpanahi
Jan 5, 2022
Maintainer

Overview

We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! 🎉

Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.

This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Introducing GPT2Transformer annotator in Spark NLP 🚀 for Text Generation purposes. GPT2Transformer uses OpenAI GPT-2 models from HuggingFace 🤗 for prediction at scale in Spark NLP 🚀 . GPT-2 is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences
NEW: Introducing RoBertaForSequenceClassification annotator in Spark NLP 🚀. RoBertaForSequenceClassification can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForSequenceClassification for PyTorch or TFRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing XlmRoBertaForSequenceClassification annotator in Spark NLP 🚀. XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForSequenceClassification for PyTorch or TFXLMRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing LongformerForSequenceClassification annotator in Spark NLP 🚀. LongformerForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForSequenceClassification for PyTorch or TFLongformerForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing AlbertForSequenceClassification annotator in Spark NLP 🚀. AlbertForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForSequenceClassification for PyTorch or TFAlbertForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing XlnetForSequenceClassification annotator in Spark NLP 🚀. XlnetForSequenceClassification can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForSequenceClassification for PyTorch or TFXLNetForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML. You can train Word2Vec in a cluster on multiple machines to handle large-scale datasets and use the trained model for token-level classifications such as NerDL
Introducing useBestModel param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
Support Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.0.x/3.1.x, but now you have spark-nlp-spark32 and spark-nlp-gpu-spark32 packages
Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (spark32=True)
Update Colab and Kaggle scripts for faster setup. We no longer need to remove Java 11 in order to install Java 8 since Spark NLP works on Java 11. This makes the installation of Spark NLP on Colab and Kaggle as fast as pip install spark-nlp pyspark==3.1.2
Add new scripts/notebook to generate custom TensroFlow graphs for ContextSpellCheckerApproach annotator
Add a new graphFolder param to ContextSpellCheckerApproach annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
Support DBFS file system in graphFolder param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
Add a new feature to all classifiers (ForTokenClassification and ForSequenceClassification) to retrieve classes from the pretrained models

sequenceClassifier = XlmRoBertaForSequenceClassification \
      .pretrained('xlm_roberta_base_sequence_classifier_ag_news', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class')

print(sequenceClassifier.getClasses())

#Sports, Business, World, Sci/Tech

Add inputFormats param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.

date_matcher = DateMatcher() \
    .setInputCols(['document']) \
    .setOutputCol("date") \
    .setInputFormats(["yyyy", "yyyy/dd/MM", "MM/yyyy"]) \
    .setOutputFormat("yyyyMM") \ #previously called `.setDateFormat`
    .setSourceLanguage("en")

Enable batch processing in T5Transformer and MarianTransformer annotators
Add Schema to readDataset in CoNLL() class
Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.0
- Databricks 10.0 ML GPU
- Databricks 10.1
- Databricks 10.1 ML GPU
- Databricks 10.2
- Databricks 10.2 ML GPU
Welcoming 3x new EMR 6.x series to our Spark NLP family:
- EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
- EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
- EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)

Bug Fixes

Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times at once results in higher disk usage and IO may become a bottleneck for larger models especially on a machine with slower disks. Thanks to @jerrychenhf for finding this issue and offering a solution TensorFlow functions refactor and fix race condition #6575
Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes up to 2x slower). Please update to 3.4.0 if you are using any of these two annotators Fix ignored tokens processing in seq2seq models #6605
Fix a bug in model resolution by not filtering based on the timestamp
Fix configProtoBytes param type in Python Fix configProtoBytes param type in Python #6549
Fix missing DefaultParamsReadable in RegexTokenizer annotator Added DefaultParamsReadable to RegexTokenizer companion obj #6653
Fix missing models lemma_antbnc, sentiment_vivekn, and spellcheck_norvig for Spark 3.x
Fix missing pipelines clean_slang, check_spelling, match_chunks, and match_datetime for Spark 3.x
Fix saveModel in TrainingHelper
Fix Keyword/Yake module naming in Scala Fix Keyword/Yake module naming in Scala #6562

Models Hub

Models Hub now comes with new features to easily filter and find your desired models & pipelines by:

NLP Task
Natural Language
Spark NLP version

In addition, you can also filter models & pipelines by:

Models or Pipelines (finally! 😃 )
Tags used inside Model's card
Or even by predicted entities (which labels/classes a model can predict)

As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! 🚀

Models and Pipelines

Spark NLP 3.4.0 comes with state-of-the-art pre-trained transformer models. Models Hub supports over 15 NLP tasks: Named Entity Recognition, Text Classification, Sentiment Analysis, Translation, Question Answering, Summarization, Sentence Detection, Embeddings, Language Detection, Stop Words Removal, Word Segmentation, Part of Speech Tagging, Lemmatization, Spell Check, Dependency Parser, and Text Generation

Featured Models

Model	Name	Lang
GPT2Transformer	gpt2_distilled	`en`
GPT2Transformer	gpt2	`en`
GPT2Transformer	gpt2_medium	`en`
GPT2Transformer	gpt2_large	`en`
XlmRoBertaForSequenceClassification	xlm_roberta_base_sequence_classifier_imdb	`en`
XlmRoBertaForSequenceClassification	xlm_roberta_base_sequence_classifier_allocine	`fr`
XlmRoBertaForSequenceClassification	xlm_roberta_base_sequence_classifier_ag_news	`en`
RoBertaForSequenceClassification	roberta_base_sequence_classifier_imdb	`en`
RoBertaForSequenceClassification	roberta_base_sequence_classifier_ag_news	`en`
AlbertForSequenceClassification	albert_base_sequence_classifier_ag_news	`en`
AlbertForSequenceClassification	albert_base_sequence_classifier_imdb	`en`
LongformerForSequenceClassification	longformer_base_sequence_classifier_ag_news	`en`
LongformerForSequenceClassification	longformer_base_sequence_classifier_imdb	`en`
BertForSequenceClassification	bert_sequence_classifier_sentiment	`it`
BertForSequenceClassification	bert_sequence_classifier_finbert_tone	`en`
BertForSequenceClassification	bert_sequence_classifier_toxicity	`ru`
XlnetForSequenceClassification	xlnet_base_sequence_classifier_imdb	`en`
XlnetForSequenceClassification	xlnet_base_sequence_classifier_ag_news	`en`
RoBertaForTokenClassification	roberta_token_classifier_bne_capitel_ner	`es`
RoBertaForTokenClassification	roberta_token_classifier_icelandic_ner	`is`
RoBertaForTokenClassification	roberta_token_classifier_ticker	`en`
RoBertaForTokenClassification	roberta_token_classifier_pos_tagger	`id`
RoBertaForTokenClassification	roberta_token_classifier_timex_semeval	`en`
XlmRoBertaForTokenClassification	xlm_roberta_large_token_classifier_masakhaner	`xx`
XlmRoBertaForTokenClassification	xlm_roberta_base_token_classifier_ner	`tr`
XlmRoBertaForTokenClassification	xlm_roberta_large_token_classifier_ner	`id`
XlmRoBertaForTokenClassification	xlm_roberta_large_token_classifier_conll03	`de`
XlmRoBertaForTokenClassification	xlm_roberta_large_token_classifier_hrl	`xx`
BertForTokenClassification	bert_hi_en_ner	`hi`
BertForTokenClassification	bert_token_classifier_scandi_ner	`xx`
BertForTokenClassification	bert_token_classifier_hi_en_ner	`hi`
BertForTokenClassification	bert_token_classifier_dutch_udlassy_ner	`nl`
BertForTokenClassification	bert_token_classifier_chinese_ner	`zh`
DistilBertEmbeddings	distilbert_uncased	`te`
XlmRoBertaEmbeddings	xlm_roberta_base_finetuned_swahili	`sw`
BertEmbeddings	bert_base_finnish_uncased	`fr`
BertEmbeddings	bert_base_finnish_cased	`fi`
BertEmbeddings	electra_medal_acronym	`en`
ClassifierDLModel	classifierdl_urduvec_fakenews	`ur`
ClassifierDLModel	classifierdl_bert_news	`ur`
NerDLModel	nerdl_restaurant_100d	`en`
Word2VecModel	word2vec_gigaword_wiki_300	`en`
Word2VecModel	word2vec_gigaword_300	`en`

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 4100+ models & pipelines in 230+ languages is available on Models Hub

Backward Compatibility

The parameter dateFormat in DateMatcher and MultiDateMatcher annotators has been renamed to outputFormat:


# previously
.setDateFormat("yyyy/MM/dd")

# after 3.4.0 release
.setOutputFormat("yyyy/MM/dd")

Deprecating xling TF Hub models for UniversalSentenceEncoder annotator (there are CMLM models available which outperform xling models with support for more languages)
Deprecating Finnish old BERT models (there are newer models available now)

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP	HuggingFace Notebooks	Colab
AlbertForSequenceClassification	HuggingFace in Spark NLP - AlbertForSequenceClassification
RoBertaForSequenceClassification	HuggingFace in Spark NLP - RoBertaForSequenceClassification
XlmRoBertaForSequenceClassification	HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification
XlnetForSequenceClassification	HuggingFace in Spark NLP - XlnetForSequenceClassification

You can visit Import Transformers in Spark NLP for more info

New Word2Vec notebook

Spark NLP	Jupyter Notebook
Word2VecApproach	Train Word2Vec and NER models

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==3.4.0

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.0

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.0

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.0

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.0

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.4.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.4.0</version>
</dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark32_2.12</artifactId>
    <version>3.4.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
    <version>3.4.0</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.4.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.4.0</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.4.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.4.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.0.jar
GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.0.jar
CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.0.jar
GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.0.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.0.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.0.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.0.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.0.jar

What's Changed

Added missing pretrained models in models hub by @DevinTDHa in Added missing pretrained models in models hub #6504
2021-11-23-sbiobertresolve_loinc_augmented_en by @jsl-models in 2021-11-23-sbiobertresolve_loinc_augmented_en #6509
2021-11-24-sbertresolve_ner_model_finder_en by @jsl-models in 2021-11-24-sbertresolve_ner_model_finder_en #6511
2021-11-24-ner_model_finder_en by @jsl-models in 2021-11-24-ner_model_finder_en #6510
2021-11-15-sbiobertresolve_clinical_snomed_procedures_measurements_en by @jsl-models in 2021-11-15-sbiobertresolve_clinical_snomed_procedures_measurements_en #6470
2021-11-16-redl_nihss_biobert_en by @jsl-models in 2021-11-16-redl_nihss_biobert_en #6477
2021-11-15-ner_nihss_en by @jsl-models in 2021-11-15-ner_nihss_en #6473
removed broken link from menu by @diatrambitas in removed broken link from menu #6516
Add new demos by @agsfer in Add new demos #6518
2021-11-26-ner_biomarker_en by @jsl-models in 2021-11-26-ner_biomarker_en #6519
Update demomenu.html by @agsfer in Update demomenu.html #6520
2021-11-27-sbiobertresolve_ndc_en by @jsl-models in 2021-11-27-sbiobertresolve_ndc_en #6521
2021-11-29-ner_deid_subentity_augmented_i2b2_en by @jsl-models in 2021-11-29-ner_deid_subentity_augmented_i2b2_en #6526
deprecate tag added to chunk resolvers by @galiph in deprecate tag added to chunk resolvers #6528
update 2021-11-11-sbiobertresolve_snomed_procedures_measurements_en.md by @Ahmetemintek in update 2021-11-11-sbiobertresolve_snomed_procedures_measurements_en.md #6531
[skip ci] Create PR 3.3.4-healthcare-docs-3ae5966fd16758f401475f3fe1faf5ecb5c59365-4 by @jsl-builder in [skip ci] Create PR 3.3.4-healthcare-docs-3ae5966fd16758f401475f3fe1faf5ecb5c59365-4 #6533
update in compat table by @albertoandreottiATgmail in update in compat table #6534
Add new pages by @agsfer in Add new pages #6530
added healthcare 3.3.4 release notes by @albertoandreottiATgmail in added healthcare 3.3.4 release notes #6536
Updated issues from models_hub by @gadde5300 in Updated issues from models_hub #6537
Merge branch 'models_hub' into 'master' by @muhammetsnts in Merge branch 'models_hub' into 'master' #6529
Changes name model in release notes 3.3.2 example by @josejuanmartinez in Changes name model in release notes 3.3.2 example #6541
2021-11-29-classifierdl_bert_sentiment_pipeline_es by @jsl-models in 2021-11-29-classifierdl_bert_sentiment_pipeline_es #6527
fix python-scla code, added 'pretrained()' by @murat-gunay in fix python-scla code, added 'pretrained()' #6542
Introducing ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer ForSequenceClassification annotators by @maziyarpanahi in Introducing ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer ForSequenceClassification annotators #6538
Add new getClasses function to retrieve labels from pretrained models by @maziyarpanahi in Add new getClasses function to retrieve labels from pretrained models #6544
GPT-2 transformer implementation by @vankov in GPT-2 transformer implementation #6523
Add some fixes in demos by @agsfer in Add some fixes in demos #6543
2021-12-02-xlm_roberta_base_token_classifier_ner_tr by @jsl-models in 2021-12-02-xlm_roberta_base_token_classifier_ner_tr #6548
licensed release notes v3.3.4 updated by @Ahmetemintek in licensed release notes v3.3.4 updated #6554
2021-12-03-gpt2_distilled_en by @jsl-models in 2021-12-03-gpt2_distilled_en #6553
2021-12-03-gpt2_medium_en by @jsl-models in 2021-12-03-gpt2_medium_en #6552
2021-12-03-gpt2_en by @jsl-models in 2021-12-03-gpt2_en #6551
2021-12-03-gpt_large_en by @jsl-models in 2021-12-03-gpt_large_en #6559
Enable batch processing for GPT2 and T5 by @vankov in Enable batch processing for GPT2 and T5 #6560
Feature/date matcher input formats by @wolliq in Feature/date matcher input formats #6556
GPT2 tests updated by @vankov in GPT2 tests updated #6557
2021-12-04-electra_medal_acronym_en by @jsl-models in 2021-12-04-electra_medal_acronym_en #6564
Introducing Distributed and Trainable Word2Vec annotators by @maziyarpanahi in Introducing Distributed and Trainable Word2Vec annotators #6566
Fix Keyword/Yake module naming in Scala by @maziyarpanahi in Fix Keyword/Yake module naming in Scala #6562
Use the best model while training NerDLApproach annotator by @maziyarpanahi in Use the best model while training NerDLApproach annotator #6567
2021-12-03-nerdl_fewnerd_100d_pipeline_en by @jsl-models in 2021-12-03-nerdl_fewnerd_100d_pipeline_en #6555
2021-12-03-xlm_roberta_large_token_classification_ner_id by @jsl-models in 2021-12-03-xlm_roberta_large_token_classification_ner_id #6558
Added predicted entities to bert_token_classifier_ner_btc by @gadde5300 in Added predicted entities to bert_token_classifier_ner_btc #6561
2021-12-06-xlm_roberta_large_token_classifier_masakhaner_xx by @jsl-models in 2021-12-06-xlm_roberta_large_token_classifier_masakhaner_xx #6569
Move Detect Entities in tweets into Recognize Entities by @agsfer in Move Detect Entities in tweets into Recognize Entities #6570
340 docs update by @DevinTDHa in 340 docs update #6572
2021-12-06-sbiobertresolve_umls_drug_substance_en by @jsl-models in 2021-12-06-sbiobertresolve_umls_drug_substance_en #6573
Updated OCR docs for 3.9.1 by @xyutech in Updated OCR docs for 3.9.1 #6547
Add release notes for NLP Server 0.4.0 by @pabla in Add release notes for NLP Server 0.4.0 #6579
Feature/pabla nlp server release notes by @diatrambitas in Feature/pabla nlp server release notes #6580
Docs/alab2.4.0 by @diatrambitas in Docs/alab2.4.0 #6581
Fixes bad model name in python code by @josejuanmartinez in Fixes bad model name in python code #6584
2021-12-07-roberta_token_classifier_bne_capitel_ner_es by @jsl-models in 2021-12-07-roberta_token_classifier_bne_capitel_ner_es #6577
2021-12-07-bert_token_classifier_chinese_ner_zh by @jsl-models in 2021-12-07-bert_token_classifier_chinese_ner_zh #6578
2021-12-08-bert_token_classifier_dutch_udlassy_ner_nl by @jsl-models in 2021-12-08-bert_token_classifier_dutch_udlassy_ner_nl #6585
2021-12-09-text_detection_v1_en by @jsl-models in 2021-12-09-text_detection_v1_en #6591
FEATURE NMH-51: Tracked latest model and change id for indexed models [skip test] by @KshitizGIT in FEATURE NMH-51: Tracked latest model and change id for indexed models [skip test] #6550
BUGFIX-NMH55: Fix all commits dont trigger github action when merged [skip test] by @KshitizGIT in BUGFIX-NMH55: Fix all commits dont trigger github action when merged [skip test] #6574
2021-12-08-indic_transformers_te_distilbert_spark_nlp_te by @jsl-models in 2021-12-08-indic_transformers_te_distilbert_spark_nlp_te #6589
2021-12-09-bert_token_classifier_scandi_ner_xx by @jsl-models in 2021-12-09-bert_token_classifier_scandi_ner_xx #6596
Fixes typo in entities by @josejuanmartinez in Fixes typo in entities #6594
Removes typos from entities by @josejuanmartinez in Removes typos from entities #6595
FEATURE NMH-53: Add deprecated label in models [skip test] by @KshitizGIT in FEATURE NMH-53: Add deprecated label in models [skip test] #6590
FEATURE NMH-56: Add spark_version to elastic search models id [skip test] by @KshitizGIT in FEATURE NMH-56: Add spark_version to elastic search models id [skip test] #6593
Fixes the code examples in snomed (es) by @josejuanmartinez in Fixes the code examples in snomed (es) #6597
Deprecating models(Open-Source) by @gadde5300 in Deprecating models(Open-Source) #6599
T5, MarianMT and GPT2 batch processing fixes and optimization by @vankov in T5, MarianMT and GPT2 batch processing fixes and optimization #6600
2021-12-10-classifierdl_bert_news_ur by @jsl-models in 2021-12-10-classifierdl_bert_news_ur #6598
Add 4 new demos by @agsfer in Add 4 new demos #6601
Feature/graph folder context spell by @danilojsl in Feature/graph folder context spell #6582
TensorFlow functions refactor and fix race condition by @maziyarpanahi in TensorFlow functions refactor and fix race condition #6575
2021-12-11-sbiobertresolve_clinical_abbreviation_acronym_en by @jsl-models in 2021-12-11-sbiobertresolve_clinical_abbreviation_acronym_en #6604
Fix ignored tokens processing in seq2seq models by @vankov in Fix ignored tokens processing in seq2seq models #6605
2021-12-06-roberta_token_classifier_icelandic_ner_is by @jsl-models in 2021-12-06-roberta_token_classifier_icelandic_ner_is #6571
2021-12-14-distilbert_uncased_te by @jsl-models in 2021-12-14-distilbert_uncased_te #6609
Feature/sbt 156 by @maziyarpanahi in Feature/sbt 156 #6607
2021-12-10-bert_hi_en_ner_hi by @jsl-models in 2021-12-10-bert_hi_en_ner_hi #6602
Add filters by type, tags, and predicted entities to the models' page by @pabla in Add filters by type, tags, and predicted entities to the models' page #6610
NER Model for Hindi+English and CPUvsGPUBenchmark by @agsfer in NER Model for Hindi+English and CPUvsGPUBenchmark #6612
2021-12-14-text_detection_v1_en by @jsl-models in 2021-12-14-text_detection_v1_en #6611
2021-12-15-mediacal_bert_token_classifier_ner_bacteria_en by @jsl-models in 2021-12-15-mediacal_bert_token_classifier_ner_bacteria_en #6617
Restore missing deprecated label [skip test] by @pabla in Restore missing deprecated label [skip test] #6615
added release notes for Annotation Lab 2.5.0 by @diatrambitas in added release notes for Annotation Lab 2.5.0 #6618
Fix filter by checking timestamp on model resolution by @danilojsl in Fix filter by checking timestamp on model resolution #6613
GPU vs CPU benchmarks changes by @agsfer in GPU vs CPU benchmarks changes #6616
2021-12-16-albert_base_sequence_classifier_ag_news_en by @jsl-models in 2021-12-16-albert_base_sequence_classifier_ag_news_en #6619
2021-12-16-albert_base_sequence_classifier_imdb_en by @jsl-models in 2021-12-16-albert_base_sequence_classifier_imdb_en #6620
2021-12-16-longformer_base_sequence_classifier_ag_news_en by @jsl-models in 2021-12-16-longformer_base_sequence_classifier_ag_news_en #6621
2021-12-16-longformer_base_sequence_classifier_imdb_en by @jsl-models in 2021-12-16-longformer_base_sequence_classifier_imdb_en #6622
Add Radiology demos by @agsfer in Add Radiology demos #6623
Detect Anatomical and Observation Entities in Chest Radiology Reports by @agsfer in Detect Anatomical and Observation Entities in Chest Radiology Reports #6625
Deprecated models by @Ahmetemintek in Deprecated models #6635
2021-12-17-bert_base_finnish_cased_fi by @jsl-models in 2021-12-17-bert_base_finnish_cased_fi #6636
2021-12-17-bert_base_finnish_uncased_fi by @jsl-models in 2021-12-17-bert_base_finnish_uncased_fi #6638
2021-12-17-bert_token_classifier_drug_development_trials_en by @jsl-models in 2021-12-17-bert_token_classifier_drug_development_trials_en #6639
Move NER Model for Hindi+Englishdemo to top by @agsfer in Move NER Model for Hindi+Englishdemo to top #6640
2021-12-20-ner_drugprot_clinical_en by @jsl-models in 2021-12-20-ner_drugprot_clinical_en #6641
CoNLL Reader: Add Schema to Dataframe by @albertoandreottiATgmail in CoNLL Reader: Add Schema to Dataframe #6637
2021-12-16-roberta_base_sequence_classifier_ag_news_en by @jsl-models in 2021-12-16-roberta_base_sequence_classifier_ag_news_en #6627
2021-12-16-roberta_base_sequence_classifier_imdb_en by @jsl-models in 2021-12-16-roberta_base_sequence_classifier_imdb_en #6628
Update 2020-08-31-sent_bert_finnish_cased.md by @Ahmetemintek in Update 2020-08-31-sent_bert_finnish_cased.md #6630
Update 2020-08-31-sent_bert_finnish_uncased.md by @Ahmetemintek in Update 2020-08-31-sent_bert_finnish_uncased.md #6631
Models hub by @maziyarpanahi in Models hub #6643
updated description with reasoning for entity labels by @luca-martial in updated description with reasoning for entity labels #6642
2021-12-21-text_cleaner_v1_en by @jsl-models in 2021-12-21-text_cleaner_v1_en #6645
2021-12-21-text_cleaner_v1_en by @jsl-models in 2021-12-21-text_cleaner_v1_en #6646
2021-12-21-jsl_sbert_medium_rxnorm_en by @jsl-models in 2021-12-21-jsl_sbert_medium_rxnorm_en #6649
2021-12-21-bert_sequence_classifier_sentiment_it by @jsl-models in 2021-12-21-bert_sequence_classifier_sentiment_it #6644
2021-12-21-bert_sequence_classifier_finbert_tone_en by @jsl-models in 2021-12-21-bert_sequence_classifier_finbert_tone_en #6647
2021-12-22-icd10_icd9_mapping_en by @jsl-models in 2021-12-22-icd10_icd9_mapping_en #6652
2021-12-22-bert_sequence_classifier_toxicity_ru by @jsl-models in 2021-12-22-bert_sequence_classifier_toxicity_ru #6651
2021-12-23-xlm_roberta_base_sequence_classifier_ag_news_en by @jsl-models in 2021-12-23-xlm_roberta_base_sequence_classifier_ag_news_en #6655
2021-12-23-xlm_roberta_base_sequence_classifier_imdb_en by @jsl-models in 2021-12-23-xlm_roberta_base_sequence_classifier_imdb_en #6656
2021-12-23-xlm_roberta_base_sequence_classifier_allocine_fr by @jsl-models in 2021-12-23-xlm_roberta_base_sequence_classifier_allocine_fr #6657
2021-12-23-xlnet_base_sequence_classifier_ag_news_en by @jsl-models in 2021-12-23-xlnet_base_sequence_classifier_ag_news_en #6658
2021-12-23-xlnet_base_sequence_classifier_imdb_en by @jsl-models in 2021-12-23-xlnet_base_sequence_classifier_imdb_en #6659
2021-12-23-sbiobert_jsl_rxnorm_cased_en by @jsl-models in 2021-12-23-sbiobert_jsl_rxnorm_cased_en #6662
Add support for Apache Spark 3.2.0 on Scala 2.12 by @maziyarpanahi in Add support for Apache Spark 3.2.0 on Scala 2.12 #6333
2021-12-24-sblubertresolve_loinc_uncased_en by @jsl-models in 2021-12-24-sblubertresolve_loinc_uncased_en #6663
2021-12-24-sbiobertresolve_loinc_cased_en by @jsl-models in 2021-12-24-sbiobertresolve_loinc_cased_en #6664
2021-12-27-roberta_token_classifier_pos_tagger_id by @jsl-models in 2021-12-27-roberta_token_classifier_pos_tagger_id #6667
2021-12-26-xlm_roberta_large_token_classifier_hrl_xx by @jsl-models in 2021-12-26-xlm_roberta_large_token_classifier_hrl_xx #6666
2021-12-27-bert_token_classifier_hi_en_ner_hi by @jsl-models in 2021-12-27-bert_token_classifier_hi_en_ner_hi #6669
2021-12-23-sbert_jsl_medium_rxnorm_uncased_en by @jsl-models in 2021-12-23-sbert_jsl_medium_rxnorm_uncased_en #6661
move Detect biological concepts by @agsfer in move Detect biological concepts #6670
models tagged as deprecated by @Ahmetemintek in models tagged as deprecated #6671
2021-12-27-roberta_token_classifier_ticker_en by @jsl-models in 2021-12-27-roberta_token_classifier_ticker_en #6672
2021-12-25-xlm_roberta_large_token_classifier_conll03_de by @jsl-models in 2021-12-25-xlm_roberta_large_token_classifier_conll03_de #6665
2021-12-28-roberta_token_classifier_timex_semeval_en by @jsl-models in 2021-12-28-roberta_token_classifier_timex_semeval_en #6674
2021-12-28-sbertresolve_jsl_rxnorm_augmented_med_en by @jsl-models in 2021-12-28-sbertresolve_jsl_rxnorm_augmented_med_en #6675
2021-12-28-sbluebertresolve_rxnorm_augmented_uncased_en by @jsl-models in 2021-12-28-sbluebertresolve_rxnorm_augmented_uncased_en #6678
2021-12-28-sbiobertresolve_rxnorm_augmented_cased_en by @jsl-models in 2021-12-28-sbiobertresolve_rxnorm_augmented_cased_en #6677
2021-12-27-sbiobertresolve_jsl_rxnorm_augmented_en by @jsl-models in 2021-12-27-sbiobertresolve_jsl_rxnorm_augmented_en #6676
Fix configProtoBytes param type in Python by @maziyarpanahi in Fix configProtoBytes param type in Python #6549
Added DefaultParamsReadable to RegexTokenizer companion obj by @wolliq in Added DefaultParamsReadable to RegexTokenizer companion obj #6653
Add new demos by @agsfer in Add new demos #6681
2021-12-29-classifierdl_xlm_roberta_sentiment_sw by @jsl-models in 2021-12-29-classifierdl_xlm_roberta_sentiment_sw #6680
2021-12-30-layoutlmv2_funsd_en by @jsl-models in 2021-12-30-layoutlmv2_funsd_en #6682
2021-12-30-ner_abbreviation_clinical_en by @jsl-models in 2021-12-30-ner_abbreviation_clinical_en #6684
modelhub md files updated by @Cabir40 in modelhub md files updated #6685
2021-12-31-sbluebertresolve_loinc_uncased_en by @jsl-models in 2021-12-31-sbluebertresolve_loinc_uncased_en #6686
2022-01-01-sbiobertresolve_snomed_drug_en by @jsl-models in 2022-01-01-sbiobertresolve_snomed_drug_en #6687
2021-12-29-classifierdl_urduvec_fakenews_ur by @jsl-models in 2021-12-29-classifierdl_urduvec_fakenews_ur #6683
2021-12-31-nerdl_restaurant_100d_en by @jsl-models in 2021-12-31-nerdl_restaurant_100d_en #6691
Add demos fixes by @agsfer in Add demos fixes #6692
2022-01-03-bert_token_classifier_ner_bionlp_en by @jsl-models in 2022-01-03-bert_token_classifier_ner_bionlp_en #6695
2022-01-03-clean_slang_en by @jsl-models in 2022-01-03-clean_slang_en #6694
2022-01-03-word2vec_gigaword_300_en by @jsl-models in 2022-01-03-word2vec_gigaword_300_en #6696
2022-01-03-word2vec_gigaword_wiki_300_en by @jsl-models in 2022-01-03-word2vec_gigaword_wiki_300_en #6697
Models hub by @maziyarpanahi in Models hub #6698
2022-01-03-sbiobertresolve_rxnorm_augmented_en by @jsl-models in 2022-01-03-sbiobertresolve_rxnorm_augmented_en #6702
2022-01-03-sbiobertresolve_clinical_abbreviation_acronym_en by @jsl-models in 2022-01-03-sbiobertresolve_clinical_abbreviation_acronym_en #6701
2022-01-03-sbert_jsl_medium_rxnorm_uncased_en by @jsl-models in 2022-01-03-sbert_jsl_medium_rxnorm_uncased_en #6700
Preserve the actual trained weights from NerDLApproach by @maziyarpanahi in Preserve the actual trained weights from NerDLApproach #6699
2022-01-03-bert_base_finnish_cased_fi by @jsl-models in 2022-01-03-bert_base_finnish_cased_fi #6703
2022-01-03-bert_base_finnish_uncased_fi by @jsl-models in 2022-01-03-bert_base_finnish_uncased_fi #6704
2022-01-04-electra_medal_acronym_en by @jsl-models in 2022-01-04-electra_medal_acronym_en #6712
modelhub md files updated by @gadde5300 in modelhub md files updated #6713
Models hub by @maziyarpanahi in Models hub #6714
2022-01-05-layoutlmv2_funsd_en by @jsl-models in 2022-01-05-layoutlmv2_funsd_en #6715
2022-01-04-match_datetime_en by @jsl-models in 2022-01-04-match_datetime_en #6711
2022-01-04-match_chunks_en by @jsl-models in 2022-01-04-match_chunks_en #6710
2022-01-04-check_spelling_dl_en by @jsl-models in 2022-01-04-check_spelling_dl_en #6709
Models hub by @maziyarpanahi in Models hub #6716
Models hub internal - release for M12.2 by @josejuanmartinez in Models hub internal - release for M12.2 #6693
Release/340 release candidate by @maziyarpanahi in Release/340 release candidate #6546

New Contributors

@galiph made their first contribution in deprecate tag added to chunk resolvers #6528
@Ahmetemintek made their first contribution in update 2021-11-11-sbiobertresolve_snomed_procedures_measurements_en.md #6531
@xyutech made their first contribution in Updated OCR docs for 3.9.1 #6547
@KshitizGIT made their first contribution in FEATURE NMH-51: Tracked latest model and change id for indexed models [skip test] #6550
@luca-martial made their first contribution in updated description with reasoning for entity labels #6642
@Cabir40 made their first contribution in modelhub md files updated #6685

Full Changelog: 3.3.4...3.4.0

This discussion was created from the release John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more!.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

maziyarpanahi Jan 5, 2022 Maintainer

Overview

Major features and improvements

Bug Fixes

Models Hub

Models and Pipelines

Featured Models

Backward Compatibility

New Notebooks

Documentation

Installation

What's Changed

New Contributors

Replies: 0 comments

maziyarpanahi
Jan 5, 2022
Maintainer