Spark-nlp

Latest version: v5.5.3

Safety actively analyzes 723607 Python packages for vulnerabilities to keep your Python projects secure.

Page 11 of 23

3.1.0

Not secure

========

----------------
New Features
----------------
* **NEW:** Introducing DistiBertEmbeddings annotator. DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling BERT base. It has 40% fewer parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT’s performances
* **NEW:** Introducing RoBERTaEmbeddings annotator. RoBERTa (Robustly Optimized BERT-Pretraining Approach) models deliver state-of-the-art performance on NLP/NLU tasks and a sizable performance improvement on the GLUE benchmark. With a score of 88.5, RoBERTa reached the top position on the GLUE leaderboard
* **NEW:** Introducing XlmRoBERTaEmbeddings annotator. XLM-RoBERTa (Unsupervised Cross-lingual Representation Learning at Scale) is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data with 100 different languages. It also outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model
* **NEW:** Introducing support for HuggingFace exported models in equivalent Spark NLP annotators. Starting this release, you can easily use the `saved_model` feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release - For more information please visit [this discussion](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669)
* **NEW:** Migrate MarianTransformer to BatchAnnotate to control the throughput when you are on accelerated hardware such as GPU to fully utilize it
* Upgrade to TensorFlow v2.4.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
* Update to CUDA11 and cuDNN 8.0.2 for GPU support
* Implement ModelSignatureManager to automatically detect inputs, outputs, save and restore tensors from SavedModel in TF v2. This allows Spark NLP 3.1.x to extend support for external Encoders such as HuggingFace and TF Hub (coming soon!)
* Implement a new BPE tokenizer for RoBERTa and XLM models. This tokenizer will use the custom tokens from `Tokenizer` or `RegexTokenizer` and generates token pieces, encodes, and decodes the results
* Welcoming new Databricks runtimes to our Spark NLP family:
* Databricks 8.1 ML & GPU
* Databricks 8.2 ML & GPU
* Databricks 8.3 ML & GPU
* Welcoming a new EMR 6.x series to our Spark NLP family:
* EMR 6.3.0 (Apache Spark 3.1.1 / Hadoop 3.2.1)

----------------
Backward compatibility
----------------

* We have updated our MarianTransformer annotator to be compatible with TF v2 models. This change is not compatible with previous models/pipelines. However, we have updated and uploaded all the models and pipelines for `3.1.x` release. You can either use `MarianTransformer.pretrained(MODEL_NAME)` and it will automatically download the compatible model or you can visit [Models Hub](https://sparknlp.org/models) to download the compatible models for offline use via `MarianTransformer.load(PATH)`

========

3.0.3

Not secure

========

----------------
New Features
----------------
* Add new functionalities for text generation in T5Transformer

----------------
Bug Fixes
----------------
* Fix ChunkEmbeddings Array out of bounds exception
* Fix pretrained tfhub_use_multi and tfhub_use_multi_lg models in UniversalSentenceEncoder
* Fix anchorDateMonth in Python and case sensitivity in relative dates

========

3.0.2

Not secure

[[named_entity, 0, 4, B-LOC, [B-LOC -> 0.9998, I-ORG -> 0.0, I-MISC -> 0.0, I-LOC -> 0.0, I-PER -> 0.0, B-MISC -> 0.0, B-ORG -> 1.0E-4, word -> Japan, O -> 0.0, B-PER -> 0.0], []]

* Add confidence score to NerConverter metadata https://github.com/JohnSnowLabs/spark-nlp/pull/2784

[chunk, 30, 37, john, [entity -> PERSON, sentence -> 0, chunk -> 0, confidence -> 0.44035]

* Refactoring SentencePiece encoding in AlbertEmbeddings and XlnetEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/2777

----------------
Bug Fixes
----------------
* Fix an exception in NerConverter when the documents/sentences don't carry the used tokens in NerDLModel https://github.com/JohnSnowLabs/spark-nlp/pull/2784
* Fix an exception in AlbertEmbeddings when the original tokens are longer than the piece tokens https://github.com/JohnSnowLabs/spark-nlp/pull/2777

========

3.0.1

Not secure

========

----------------
New Features
----------------
* Add minLength and maxLength parameters to Normalizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/2614
* 1 line to setup [Google Colab](https://github.com/JohnSnowLabs/spark-nlp#google-colab-notebook)
* 1 line to setup [Kaggle Kernel](https://github.com/JohnSnowLabs/spark-nlp#kaggle-kernel)

----------------
Enhancements
----------------
* Adjust shading rule for amazon AWS to support sub-projects from Spark NLP Fat JAR https://github.com/JohnSnowLabs/spark-nlp/pull/2613
* Fix the missing variables in BertSentenceEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/2615
* Restrict loading Sentencepiece ops only to supported models https://github.com/JohnSnowLabs/spark-nlp/pull/2623
* improve dependency management and resolvers https://github.com/JohnSnowLabs/spark-nlp/pull/2479

========

3.0.0

Not secure

========
----------------
New Features
----------------
* Support for Apache Spark and PySpark 3.0.x on Scala 2.12
* Support for Apache Spark and PySpark 3.1.x on Scala 2.12
* Migrate to TensorFlow v2.3.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
* Welcoming 9x new Databricks runtimes to our Spark NLP family:
* Databricks 7.3
* Databricks 7.3 ML GPU
* Databricks 7.4
* Databricks 7.4 ML GPU
* Databricks 7.5
* Databricks 7.5 ML GPU
* Databricks 7.6
* Databricks 7.6 ML GPU
* Databricks 8.0
* Databricks 8.0 ML (there is no GPU in 8.0)
* Databricks 8.1 Beta
* Welcoming 2x new EMR 6.x series to our Spark NLP family:
* EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
* EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
* Starting Spark NLP 3.0.0 the default packages for CPU and GPU will be based on Apache Spark 3.x and Scala 2.12 (`spark-nlp` and `spark-nlp-gpu` will be compatible only with Apache Spark 3.x and Scala 2.12)
* Starting Spark NLP 3.0.0 we have two new packages to support Apache Spark 2.4.x and Scala 2.11 (`spark-nlp-spark24` and `spark-nlp-gpu-spark24`)
* Spark NLP 3.0.0 still is and will be compatible with Apache Spark 2.3.x and Scala 2.11 (`spark-nlp-spark23` and `spark-nlp-gpu-spark23`)
* Adding a new param to sparknlp.start() function in Python for Apache Spark 2.4.x (`spark24=True`)
* Adding a new param to adjust Driver memory in sparknlp.start() function (`memory="16G"`)

----------------
Performance Improvements
----------------
Introducing a new batch annotation technique implemented in Spark NLP 3.0.0 for NerDLModel, BertEmbeddings, and BertSentenceEmbeddings annotators to radically improve prediction/inferencing performance.
From now on the `batchSize` for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row.
You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.

----------------
Breaking changes
----------------
There are only 5 annotators that are not compatible with both Scala 2.11 (Apache Spark 2.3 and Apache Spark 2.4) and Scala 2.12 (Apache Spark 3.x).
You can either train and use them on Apache Spark 2.3.x/2.4.x or train and use them on Apache Spark 3.x. The rest of our models/pipelines can be used on all Apache Spark and Scala major versions.

- TokenizerModel
- PerceptronApproach (POS Tagger)
- WordSegmenter
- DependencyParser
- TypedDependencyParser

========

2.7.5

Not secure

========
----------------
Bugfixes
----------------
* Fix BigDecimal error in NerDL when includeConfidence is true

----------------
Enhancements
----------------
* Shade Hadoop AWS and AWS Java SDK dependencies

========

Page 11 of 23

Releases

Has known vulnerabilities

Previous Next

Spark-nlp

Page 11 of 23

3.1.0

3.0.3

3.0.2

3.0.1

3.0.0

2.7.5

Page 11 of 23

Links

Releases