
Latest version: v5.5.1

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 19 of 23


Not secure
We're glad to announce a new release for Spark NLP. This one calls the attention of the community who contributed
immensely towards reporting bugs and feedback to the library. This release focuses in various bugfixes around DeepSentenceDetector
and also python deserialization of some specific pipelines. It also improves the DeepSentenceDetector allowing further fine-tuning
and customization. Then, we have embeddings that are being cached in the models folder, and further improvements towards accessing
them through S3 storage. Finally, we have made serious improvements in noteoboks and documentation around the library.
Special thanks to Tshimanga and haimco10 for very interesting contributions. See you on Slack!

* Improved OCR performance in skew detection
* SentenceDetector now better handles single quote protections (Thanks haimco10)
* DeepSentenceDetector now can explodeSentences (Thanks Tshimanga from
* EmbeddingsHelper now is capable of caching downloaded embeddings to avoid re-downloading
* Application.conf file may now be read from an s3 location
* DeepSentenceDetector has now access to all pragmatic SentenceDetector params in order to fine-tune it

* Fixed ambiguous classpath resolution in pyspark, causing errors in deserializing some models
* Fixed DeepSentenceDetector not being deserializable in PySpark
* Fixed Chunk2Doc and Doc2Chunk annotators not being loadable in PySpark
* Fixed a bug where DeepSentenceDetector wouldn't corrent denote start and end offsets (Thanks Tshimanga from
* Fixed a bug where DeepSentenceDetector would miss sentence parts when NER model missed header sentence (Thanks Tshimanga from
* Cleaned and optimized DeepSentenceDetector code (Thanks danilojsl)
* Fixed a missing dependency for OCR

Documentation and notebooks
* Added support and instructions for Anaconda deployment (Thanks Maziyar)
* Updated various python notebooks to show utilization of spark packages instead of jars
* Added a new conference talk with Spark NLP in French at XebiCon'18
* Updated documentation towards less use of jars in favor of dependency solving



Not secure
This release potentially targets to improve performance and resource usage in some pipelines that use word embeddings, it also comes
together with a very interesting autorotation feature in OCR, and a couple of new annotators to solve particular needs, including the ChunkTokenizer
or a Param to limit sentence lengths. Finally, we are starting to organize our multilingual store of models and data for training models.
Check the examples for some italian notebooks!. Thanks again to all community for such quick feedback all the time.

New Features
* OCR now capable of automatic rotation, significantly improving accuracy in some scenarios
* ChunkTokenizer is a new annotator that Tokenizes CHUNK type annotations. Extends Tokenizer algorithm and stores chunk ID for reference.
* SentenceDetector new Param maxLength now cuts off sentences longer than (by default) 240 characters. It avoids Deep Learning annotator issues and may improve performance in some scenarios.
* NerConverter new Param whiteList now allows a list of NER labels to be considered, while discarding the rest. May be useful for selective CHUNKing pipelines.

* Pipelines using Word Embeddings should now perform faster due to a group of RocksDB optimizations allowing annotators to reuse current open connections to DB

* Fixed a bug where DeepSentenceDetector was missing the load() interface (Thanks Tshimanga from Deep6!)
* Fixed a bug where RocksDB opened too many files at once causing pipelines to fail or to work very slowly
* Fixed NerCrfModel when prefetching RocksDB causing slower performance

* Added missing artifact resolution dependencies for OCR Module
* Started adding and organizing multilanguage models (Thanks maziyarpanahi)
* Updated RocksDB to 5.17.2



Not secure
This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and ocr improvements. Enjoy! and thanks again for community contributions!

New Features
* New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection

* Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
* OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. May be turned off

Examples and use cases
* Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from for dataset and model contribution!!!)

* Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks johnmccain for reporting!)

* Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)

* Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to setup their own private local model downloader service
* NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.

Special mentions
* Vincenzo Gaudenzi ( for contributing italian datasets and models. maziyar for creating examples with them.
* correlator from for contributing feedback in slack and features feedback in general
* johnmccain for reporting bugs in spell checker
* rohit-nlp for delivering maven coordinates for OCR
* haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.



Not secure
This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.

New Features
* Built on top of Spark 2.4.0
* Dependency Parser annotator allows for sentence relationship encoding
* Typed Dependency Parser annotator allows for labeling relationships within dependency tags
* ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens

* More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
* OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
* Improved word embeddings performance when working in local filesystems
* Reduced the amount of disk IO when working with Word Embeddings
* All python notebooks improved for better readability and better documentation
* Simplified PySpark interface API
* CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
* EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths

* Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
* Fixed application.conf reading bug which didn't properly refresh AWS credentials
* RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
* Solved various python default parameter settings
* Fixed circular dependency with jbig pdfbox image OCR

* DeIdentification annotator is no longer supported in the open source version of Spark-NLP
* AssertionStatus annotator is no longer supported in the open source version of Spark-NLP



Not secure
This hotfix release focuses on fixing word-embeddings cluster problems on some frameworks such as Databricsk, while keeping 1.7.x performance benefits. Various YARN based clusters have been tested, databricks cloud among them to test this hotfix.
Aside of that, multiple improvements have been commited towards a better support of PySpark-NLP, fixing diverse technical issues in the API that help consistency in Annotator's super classes.
Finally, PIP installation has been made easier with a SparkNLP class that creates SparkSession automatically, for those who are learning Python Spark on their local computers.
Thanks to all the community for reporting issues.

* Fixed 'RocksDB not serializable' when running LightPipeline scenarios or using _.functions implicits
* Fixed dependency with apache.commons.codec causing Apache Zeppelin 0.8.0 not to work in %pyspark
* Fixed Python pretrained() downloader not correctly setting Params and incorrectly creating new Model UIDs
* Fixed error 'JavaPackage not callable' when using AnnotatorModel.load() API without instantiating the class first
* Fixed Spark addFiles missing local file causing Word Embeddings not properly work in some Cluster-based frameworks
* Fixed broadcast NoSuchElementException `Failed to get broadcast_6_piece0 of broadcast_6` causing pretrained models not work in cluster frameworks (thanks EnricoMi)

Developer API
* EmbeddingsHelper.setRef() has been removed. Reference is now set implicitly through EmbeddingsHelper.load(). Does not need to be loaded before deserializing models.
* Fixed and properly renamed chunk2doc and dock2chunk transformers, should now be working as expected
* Renamed setCompositeTokens to setCompositeTokensPatterns to help user remind that regex are being used in such Param
* Fixed PySpark automatic getter and setter Param generation when using pretrained() or load() models
* Simplified cluster path resolution for word embeddings

* sparknlp.base now contains SparkNLP() classs which automatically cretes SparkSession using appropriate jar settings. Helps newcomers get started in PySpark NLP.



Not secure
Quick release with another hotfix, due to a new found bug when deserializing word embeddings in a distributed fs. Also introduces changes in application.conf reader in order
to allow run-time changes. Also introduces renaming from EmbeddingsHelper API.

* Fixed embeddings deserialization from distributed filesystem (caused due to windows pathfix)
* Fixed application.conf not reading changes in runtime
* Added missing remote_locs argument in python pretrained() functions
* Fixed wrong build version introduced in 1.7.1 to detect proper pretrained models version

Developer API
* Renamed EmbeddingsHelper functions for more convenience


Page 19 of 23

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.