Spark-nlp

Latest version: v5.5.1

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 23

5.3.1

========
----------------
Bug Fixes
----------------
* Fix M2M100 not working on the second run (closing the ONNX Session by mistake)
* Fix ONNX models failing in clusters like Databricks
* Fix `ZeroShotNerClassification` issue with NerConverter
* adding colab notebook for M2M100


========

5.3.0

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing Llama-2 and all the models fine-tuned based on this architecutre. This our very first CasualLM annotator in ONNX and it comes with support for quantization in INT4 and INT8 for CPUs.
* **NEW:** Introducing `MPNetForSequenceClassification` annotator for sequence classification tasks. This annotator is based on the MPNet architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Introducing `MPNetForQuestionAnswering` annotator for question answering tasks. This annotator is based on the MPNet architecture and is designed to answer questions based on a given context.
* **NEW:** Introducing `M2M100` state-of-the-art multilingual translation. M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. The model can directly translate between the 9,900 directions of 100 languages.
* **NEW:** Introducing a new `DeBertaForZeroShotClassification` annotator for zero-shot classification tasks. This annotator is based on the DeBERTa architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Implement retreival feature in our `DocumentSimilarity`annotator. The new DocumentSimilarity ranker is a powerful tool for ranking documents based on their similarity to a given query document. It is designed to be efficient and scalable, making it ideal for a variety of RAG applications/
* Add ONNNX support for `BertForZeroShotClassification` annotator.
* Add support for in-memory use of `WordEmbeddingsModel` annotator in server-less cluster. We initially introduced in-memory feature for this annotator for users inside Kubernetes cluster without any `HDFS`, however, today it runs without any issue `locally`, Google `Colab`, `Kaggle`, `Databricks`, `AWS EMR`, `GCP`, and `AWS Glue`.
* New Whisper Large and Distil models.
* Update ONNX Runtime to 1.17.0
* Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, 14.3 GPU
* Support new EMR 6.15.0 and 7.0.0 versions
* Add nobteook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it to Spark NLP
* Add notebook to import BERT for Zero-Shot classification from Hugging Face
* Add notebook to import DeBERTa for Zero-Shot classification from Hugging Face
* Update EntityRuler documentation
* Improve SBT project and resolve warnings (almost!)

----------------
Bug Fixes
----------------
* Fix Spark NLP Configuration's to set `cluster_tmp_dir` on Databricks' DBFS via `spark.jsl.settings.storage.cluster_tmp_dir` https://github.com/JohnSnowLabs/spark-nlp/issues/14129
* Fix score calculation in `RoBertaForQuestionAnswering` annotator https://github.com/JohnSnowLabs/spark-nlp/pull/14147
* Fix optional input col validations https://github.com/JohnSnowLabs/spark-nlp/pull/14153
* Fix notebooks for importing DeBERTa classifiers https://github.com/JohnSnowLabs/spark-nlp/pull/14154
* Fix GPT2 deserialization over the cluster (Databricks) https://github.com/JohnSnowLabs/spark-nlp/pull/14177

========

5.2.3

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForQuestionAnswering annotator
* Refactoring AWS SDK use in Spark NLP to reduce the overal size of the library. We have dropped the use of `bundle` and started to directly using `S3` SDK. This will also minimize incompatibilities with other libraries that use AWS SDKs
* Add new notebooks to import DeBertaForQuestionAnswering, DebertaForSequenceClassification, and DeBertaForTokenClassification models from HuggingFace
* Add a new `DocumentTokenSplitter` notebook
* Add a new trainig NER notebook by using DeBerta Embeddings
* Add a new trainig text classification notebook by using INSTRUCTOR Embeddings
* Update `RoBertaForTokenClassification` notebook
* Update `RoBertaForSequenceClassification` notebook
* Update `OpenAICompletion` notebook with new `gpt-3.5-turbo-instruct` model


----------------
Bug Fixes
----------------
* Fix `BGEEmbeddings` not downloading in Python



========

5.2.2

========
----------------
Enhancements
----------------
* Update `aws-java-sdk-bundle` dependency to a version without any CVEs

----------------
Bug Fixes
----------------
* Fix the missing `BGEEmbeddings` from annotator in Python
* Add a new BGE notebook to import models into Spark NLP
* Upload the new true `BGE` models to Spark NLP for text embeddings


========

5.2.1

========
----------------
New Features & Enhancements
----------------
* Add support for Spark and PySpark 3.5 major release
* Support Databricks Runtimes of 14.0, 14.1, 14.2, 14.0 ML, 14.1 ML, 14.2 ML, 14.0 GPU, 14.1 GPU, and 14.2 GPU
* **NEW:** Introducing the `BGEEmbeddings` annotator for Spark NLP. This annotator enables the integration of `BGE` models, based on the BERT architecture, into Spark NLP. The `BGEEmbeddings` annotator is designed for generating dense vectors suitable for a variety of applications, including `retrieval`, `classification`, `clustering`, and `semantic search`. Additionally, it is compatible with `vector databases` used in `Large Language Models (LLMs)`.
* **NEW:** Introducing support for ONNX Runtime in DeBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForQuestionAnswering annotator
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with TensorFlow format
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with ONNX format
* Add a new notebook to show how to import any model from `MarianNMT` family into Spark NLP with ONNX format

----------------
Bug Fixes
----------------
* Fix serialization issue in `DocumentTokenSplitter` annotator failing to be saved and loaded in a Pipeline
* Fix serialization issue in `DocumentCharacterTextSplitter` annotator failing to be saved and loaded in a Pipeline


========

5.2.0

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `CLIPForZeroShotClassification` for Zero-Shot Image Classification using OpenAI's CLIP models
* **NEW:** Introduceding the `DocumentTokenSplitter` which allows users to split large documents into smaller chunks to be used in RAG with LLM models
* **NEW:** Introducing support for ONNX Runtime in T5Transformer annotator
* **NEW:** Introducing support for ONNX Runtime in MarianTransformer annotator
* **NEW:** Introducing support for ONNX Runtime in BertSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in XlmRoBertaSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in CamemBertForQuestionAnswering, CamemBertForTokenClassification, and CamemBertForSequenceClassification annotators
* Adding a caching support for newly imported T5 models in TF format to improve the performance to be competitive to ONNX version
* Improve ZIP util and add tests for both ZipArchiveUtil and OnnxWrapper
* Refactor ONNX and add OnnxSession to broadcast
* Update ONNX Runtime to 1.16.3
* Add a new notebook fro structure streaming

----------------
Bug Fixes
----------------
* Fix random dimension mismatch in E5Embeddings and MPNetEmbeddings due to a missing average_pool after last_hidden_state in the output
* Fix batching exception in E5 and MPNet embeddings annotators failing when sentence is used instead of document
* Fix chunk construction when an entity is found
* Fix a bug in library's version in Scala
* Fix Whisper models not downloading due to wrong library's version
* Fix and refactor saving best model based on given metrics during NerDL training


========

Page 2 of 23

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.