Spark-nlp

Latest version: v5.5.1

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 8 of 23

3.4.3

Not secure
========
----------------
New Features
----------------
* **NEW:** Introducing **DeBertaForSequenceClassification** annotator in Spark NLP 🚀. `DeBertaForSequenceClassification` can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `DebertaForSequenceClassification` for **PyTorch** or `TFDebertaForSequenceClassification` for **TensorFlow** models in HuggingFace
* New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification
* New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL
* New impossiblePenultimates feature in SentenceDetectorDLModel
* New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol
* New formCol and lemmaCol parameters in Lemmatizer annotator
* Add new functionality to download and extract models from S3 via direct link

----------------
Bug Fixes & Enhancements
----------------
* Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
* Update SentenceDetector documentation
* Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation


========

3.4.2

Not secure
========
----------------
New Features
----------------
* Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%).
This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2Model` for **PyTorch** or `TFDebertaV2Model` for **TensorFlow** models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace
* Introducing a new param enableCaching in Doc2VecApproach and Word2VecApproach which if enabled speeds up the training
* Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
* Support EMR emr-5.34.0 and emr-6.5.0

----------------
Bug Fixes
----------------
* Fix bestModelMetric param when the set value was ignored https://github.com/JohnSnowLabs/spark-nlp/pull/6978


========

3.4.1

Not secure
========
----------------
New Features & Enhancements
----------------
* Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same https://github.com/JohnSnowLabs/spark-nlp/pull/6773
* Add bestModelMetric param to choose between Micro-average or Macro-average for best model https://github.com/JohnSnowLabs/spark-nlp/pull/6749
* Add trimWhitespace and preservePosition params to RegexTokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6806
* Add a new `setSentenceMatchAdd` param to EntityRuler to match entities across documents/sentences and not just tokens https://github.com/JohnSnowLabs/spark-nlp/pull/6841
* Add support spark32 and real_time_output flags in sparknlp.start() function at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6822

----------------
Bug Fixes
----------------
* Fix random NullPointerException when using TensorFlow models without Kyro serialization https://github.com/JohnSnowLabs/spark-nlp/pull/6741
* Fix RecursiveTokenizerModel not being readable in a saved Pipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6748
* Fix ContextSpellCheckerApproach not being trained on Databricks https://github.com/JohnSnowLabs/spark-nlp/pull/6750
* Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors https://github.com/JohnSnowLabs/spark-nlp/pull/6799
* Fix GraphExtraction when fullAnnotate and document are used at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6845
* Fix Word2VecModel being cast to Doc2VecModel by mistake https://github.com/JohnSnowLabs/spark-nlp/pull/6849
* Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification https://github.com/JohnSnowLabs/spark-nlp/pull/6867
* Fix missing setExceotionsPath param in Tokenizer when it's used in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6868
* Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1. (this option is now available to choose which metric to be tracked)
* Update broken slow unit tests https://github.com/JohnSnowLabs/spark-nlp/pull/6767


========

3.4.0

Not secure
========
----------------
Major features and improvements
----------------
* **NEW:** Introducing **GPT2Transformer** annotator in Spark NLP 🚀. OpenAI GPT2 - huggingface `TFGPT2LMHeadModel`
* **NEW:** Introducing **RoBertaForSequenceClassification** annotator in Spark NLP 🚀. `RoBertaForSequenceClassification` can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `RobertaForSequenceClassification` for **PyTorch** or `TFRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForSequenceClassification** annotator in Spark NLP 🚀. `XlmRoBertaForSequenceClassification` can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForSequenceClassification` for **PyTorch** or `TFXLMRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **LongformerForSequenceClassification** annotator in Spark NLP 🚀. `LongformerForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `LongformerForSequenceClassification` for **PyTorch** or `TFLongformerForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **AlbertForSequenceClassification** annotator in Spark NLP 🚀. `AlbertForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `AlbertForSequenceClassification` for **PyTorch** or `TFAlbertForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlnetForSequenceClassification** annotator in Spark NLP 🚀. `XlnetForSequenceClassification` can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLNetForSequenceClassification` for **PyTorch** or `TFXLNetForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML
* Support for Apache Spark and PySpark 3.2.x on Scala 2.12
* Introducing `useBestModel` param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
* Welcoming 6x new Databricks runtimes to our Spark NLP family:
* Databricks 10.0
* Databricks 10.0 ML GPU
* Databricks 10.1
* Databricks 10.1 ML GPU
* Databricks 10.2
* Databricks 10.2 ML GPU
* Welcoming 3x new EMR 6.x series to our Spark NLP family:
* EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
* EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
* EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)
* Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (`spark32=True`)
* Add new scripts/notebook to generate custom TensroFlow graphs for `ContextSpellCheckerApproach` annotator
* Add a new `graphFolder` param to `ContextSpellCheckerApproach` annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
* Support DBFS file system in `graphFolder` param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
* Add new feature to all classifiers (`ForTokenClassification` and `ForSequenceClassification`) to retrieve classes from the pretrained models
* Add `inputFormats` param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
* Enable batch processing in T5Transformer and MarianTransformer annotators
* Add Schema to `readDataset` in CoNLL() class

----------------
Bug Fixes
----------------
* Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times result in disk activities and IO becomes a bottleneck for larger models especially on a machine(s) with slower disks) https://github.com/JohnSnowLabs/spark-nlp/pull/6575
* Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes twice slower). Please do update to 3.4.0 if you are using any of these annotators. https://github.com/JohnSnowLabs/spark-nlp/pull/6605
* Fix a bug in model resolution by not filtering based on the timestamp
* Fix configProtoBytes param type in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6549
* Fix missing DefaultParamsReadable in RegexTokenizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/6653
* Fix missing models `lemma_antbnc`, `sentiment_vivekn`, and `spellcheck_norvig` for Spark 3.x
* Fix missing pipelines `clean_slang`, `check_spelling`, `match_chunks`, and `match_datetime` for Spark 3.x
* Fix `saveModel` in TrainingHelper
* Fix Keyword/Yake module naming in Scala https://github.com/JohnSnowLabs/spark-nlp/pull/6562

----------------
Backward Compatibility
----------------

* The parameter `dateFormat` in DateMatcher and MultiDateMatcher annotators has been renamed to `outputFormat`:

python=

previously
.setDateFormat("yyyy/MM/dd")

after 3.4.0 release
.setOutputFormat("yyyy/MM/dd")



* Deprecating xling TF Hub models for UniversalSentenceEncoder annotator (there are `CMLM` models available which outperform xling models with support for more languages)
* Deprecating Finnish old BERT models (there are newer models available now)

========

3.3.4

Not secure
========
----------------
Patch release
----------------
* Fix "ClassCastException" error in pretrained function for DistilBertForSequenceClassification in Python

========

3.3.3

Not secure
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **DistilBertForSequenceClassification** annotator in Spark NLP 🚀. `DistilBertForSequenceClassification` DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `DistilBertForSequenceClassification` or `TFDistilBertForSequenceClassification` in HuggingFace 🤗
* **NEW:** Introducing trainable and distributed **Doc2Vec** annotators based on Word2Vec in Spark ML
* Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
* Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
* Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame
* Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach

----------------
Bug Fixes
----------------
* Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version
* Fix MarianTransformer bug on empty sequences
* Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512
* Fix MarianTransformer multi-lingual models and pipelines such as `opus_mt_mul_en` and `opus_mt_mul_en`


========

Page 8 of 23

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.