Spark-nlp

Latest version: v5.5.1

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 23

4.2.6

========
----------------
Enhancements
----------------
* Updating Spark & PySpark dependencies from 3.2.1 to 3.2.3 in provided scripts and in all the documentation

----------------
Bug Fixes
----------------
* Fix the broken TypedDependencyParserApproach and TypedDependencyParserModel annotators used in Python (this bug was introduced in 4.2.5 release)
* Fix the broken Python API documentation


========

4.2.5

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **CamemBertForSequenceClassification** annotator in Spark NLP 🚀. `CamemBertForSequenceClassification` can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForSequenceClassification` for PyTorch or `TFCamembertForSequenceClassification` for TensorFlow in HuggingFace 🤗
* **NEW:** Add `AnnotatorType` validation in Spark NLP `LightPipeline`. Currently, a misconfiguration of `inputCols` in an annotator in a pipeline raises an exception when using `transform` method, but in `LightPipeline` it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in `LightPipeline` too.
* Add outputAnnotatorType for all annotators in Python
* Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from `AnnotatorApproach` and `AnnotatorModel`
* Adding AnnotatorType validation in `LightPipeline`
* Add validation for the number and type of columns set in `TFNerDLGraphBuilder` annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python
* Add more details to Alphabet error message in `EntityRuler` annotator to better guide users
* Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
* Refactor and implement a better error handling in ResourceDownloader. This change removes `getObjectFromS3` allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader
* Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
* UpdateUpgrade `sbt-assembly` to `1.2.0` that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR
* Update `sbt` to `1.8.0` with improvements and bug fixes, but mostly for CVEs fixes:
* Updates to Coursier 2.1.0-RC1 to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Use the new withIncludeScala in assemblyOption instead of value


----------------
Bug Fixes
----------------
* Fix an issue with the `BigTextMatcher` Annotator, where it would not match entities with overlapping definitions. For Example, if both `lung` and `lung cancer` are defined, `lung` would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the `BigTextMatcher` during construction of the underlying data structure
* Fix indexing issue for `RegexTokenizer` annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators
* Refactor the `Resolvers` object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new `sbt`


========

4.2.4

Not secure
========
----------------
New Features & Enhancements
----------------
* Introduce support for GCP storage to be allowed as `cache_pretrained` directory for keeping all downloaded models and pipelines
* Update to TensorFlow 2.7.4 with bug and CVEs fixes
* Update documentation on how to use `testDataset` param in NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach
* Update installation instructions for Apple M1 chip
* Improve error handling while importing external TensorFlow models into Spark NLP
* Improve error messages when importing external models from remote storages like DBFS, S3, and HDFS
* Add support for future decoder-encoder models (2 separate models)

----------------
Bug Fixes
----------------
* Add missing setPreservePosition in NerConverter
* Add missing inputAnnotatorTypes to BigTextMatcher, ViveknSentimentModel, and NerConverter annotators
* Fix all wrong example codes provided for LemmatizerModel in Models Hub
* Fix provided notebook to import Longformer models from HF: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20Longformer.ipynb
* Fix the t5_grammar_error_corrector model to be compatible with Spark NLP 4.0+


========

4.2.3

Not secure
========
----------------
New Features & Enhancements
----------------
* Implement a new control over number of accepted columns in Python. This will sync the behavior between Scala and Python where user sets more columns than allowed inside setInputCols
* Adding metadata sentence key parameter in order to select which metadata field to use as sentence for CoNLLGenerator annotator
* Include escaping in CoNLLGenerator annotator when writing to csv and preserve special char tokens
* Add documentation for new `IAnnotation` feature for Scala users
* Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
python
regexMatcher = RegexMatcher() \
.setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
.setDelimiter(",") \
.setInputCols(["sentence"]) \
.setOutputCol("regex") \
.setStrategy("MATCH_ALL")


----------------
Bug Fixes
----------------
* Fix NotSerializableException when WordEmbeddings is used over K8s cluster while `setEnableInMemoryStorage` is set to `true`
* Fix a bug in RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
* Fix training modul failing on EMR due to a bad Apache Spark version detection. The following classes were fixed: `CoNLL()`, `CoNLLU()`, `POS()`, and `PubTator()`
* Fix a bug in CoNLLGenerator annotator where token has non-int metadata
* Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
* Fix `NaNs` result in some ViTForImageClassification models/pipelines

========

4.2.2

Not secure
========
----------------
New Features & Enhancements
----------------

* Add support for importing TensorFlow SavedModel from remote storages like DBFS, S3, and HDFS
* Add support for `fullAnnotate` in `LightPipeline` for path of images in Scala
* Add `fullAnnotate` method in `PretrainedPipeline` for Scala
* Add `fullAnnotateJava` method in `PretrainedPipeline` for Java
* Add `fullAnnotateImage` to `PretrainedPipeline` for Scala
* Add `fullAnnotateImageJava` to `PretrainedPipeline` for Java
* Add support for QA in `fullAnnotate` method in `PretrainedPipeline`
* Add `Predicted Entities` to all Vision Transformers (ViT) models and pipelines

----------------
Bug Fixes
----------------
* Unify `annotatorType` name in Python and Scala for Spark schema in Annotation, AnnotationImage and AnnotationAudio
* Fix missing indexes in `RecursiveTokenizer` annotator

========

4.2.1

Not secure
========
----------------
New Features & Enhancements
----------------

* Support for multi-lingual WordSegmenter. Add `enableRegexTokenizer` feature in WordSegmenter to support word segmentation within mixed and multi-lingual content https://github.com/JohnSnowLabs/spark-nlp/pull/12854
* Add support for Audio/ASR (Wav2Vec2) support to LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12895
* Add support for Double type in addition to Float type to AudioAssembler annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12904
* Improve error handling in fullAnnotateImage for LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12868

* Add SpanBertCoref annotator to all docs https://github.com/JohnSnowLabs/spark-nlp/pull/12889

----------------
Bug Fixes
----------------

* Fix feeding `fullAnnotate` in Lightpipeline with a list that started to fail in 4.2.0 release
* Fix exception in ContextSpellCheckerModel when updateVocabClass is used with append set to true https://github.com/JohnSnowLabs/spark-nlp/pull/12875
* Fix exception in Chunker annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12901

========

Page 6 of 23

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.