John Snow Labs Spark-NLP 3.2.0: New Longformer embeddings, BERT and DistilBERT for Token Classification, GraphExctraction, Spark NLP Configurations, new state-of-the-art multilingual NER models, and lots more! #5942
maziyarpanahi
announced in
Announcement
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
We are very excited to release Spark NLP 🚀 3.2.0! This is a big release with new Longformer models for long documents, BertForTokenClassification & DistilBertForTokenClassification for existing or fine-tuned models on HuggingFace, GraphExctraction & GraphFinisher to find relevant relationships between words, support for multilingual Date Matching, new Pydoc for Python APIs, and so many more!
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
Longformer
is a transformer model for long documents. Longformer is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.We have trained two NER models based on Longformer Base and Large embeddings:
BertForTokenClassification
can load BERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by usingBertForTokenClassification
orTFBertForTokenClassification
in HuggingFace 🤗DistilBertForTokenClassification
can load BERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by usingDistilBertForTokenClassification
orTFDistilBertForTokenClassification
in HuggingFace 🤗NerDLModel
and creates a dependency tree that describes how the entities relate to each other. For that, a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between wordsapplication.conf
usage. You can easily change Spark NLP configurations in SparkSession. For more examples please visti Spark NLP Configurationlog_folder
Spark NLP config andoutputLogsPath
param inNerDLApproach
,ClassifierDlApproach
,MultiClassifierDlApproach
, andSentimentDlApproach
annotatorscache_folder
,log_folder
, andcluster_tmp_dir
to sparknlp.start() function to set Spark NLP configurationsModels and Pipelines
Spark NLP 3.2.0 comes with new LongformerEmbeddings, BertForTokenClassification, and DistilBertForTokenClassification annotators.
New Longformer Models
en
en
Featured NerDL Models
New NER models for CoNLL (4 entities) and OntoNotes (18 entities) trained by using BERT, RoBERTa, DistilBERT, XLM-RoBERTa, and Longformer Embeddings:
en
en
en
en
en
en
en
en
en
en
BERT and DistilBERT for Token Classification
New BERT and DistilBERT fine-tuned for the Named Entity Recognition (NER) in English, Persian, Spanish, Swedish, and Turkish:
en
en
en
en
fa
fa
fa
tr
es
sv
en
en
en
en
fa
The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.
New Notebooks
Import hundreds of models in different languages to Spark NLP
You can visit Import Transformers in Spark NLP for more info
New Multilingual DateMatcher and MultiDateMatcher
Documentation
Installation
Python
#PyPI pip install spark-nlp==3.2.0
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
GPU
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
GPU
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
GPU
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.4.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.3.x:
spark-nlp-gpu:
FAT JARs
CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.0.jar
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.0.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.0.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.0.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.0.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.0.jar
This discussion was created from the release John Snow Labs Spark-NLP 3.2.0: New Longformer embeddings, BERT and DIstilBERT for Token Classification, GraphExctraction, Spark NLP Configurations, new state-of-the-art multilingual NER models, and lots more!.
Beta Was this translation helpful? Give feedback.
All reactions