Spark-nlp

Latest version: v5.5.1

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 18 of 23

2.0.5

Not secure
========
---------------
Overview
---------------
This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to CyborgDroid because of reporting importantly and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!

---------------
Enhancements
---------------
* ViveknSentiment annotator now includes confidence score in metadata
* NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
* NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
* NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
* Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
* Spark version bumped to 2.4.3

---------------
Bugfixes
---------------
* Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks CyborgDroid)
* Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
* Fixed DependencyParser pretrained models not working properly in Python

---------------
Models and Pipelines
---------------
* NerDL will download noncontrib model if windows is detected, for better compatibility
* noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
* Improved error message when user is under windows and trying to load a contrib NerDL model
* Fixed ViveknSentimentModel not working properly (Thanks CyborgDroid)

---------------
Developer API
---------------
* Embeddings in python moved to annotator module for consistency
* SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
* Metadata model reader now ignores empty lines instead of failing
* Unified lang instead of language attribute name in pretrained API

========

2.0.4

Not secure
========
---------------
Overview
---------------
We are excited about Spark NLP workshop (spark-nlp-workshop repository) being so useful for many users.
Now we also made a step forward by moving website's documentation to an easy to maintain Wiki!. Spark NLP library received key bug fixes
on this release. Thanks to the community for reporting issues on GitHub. Much more to come, as always.

---------------
Bugfixes
---------------
* Fixed DependencyParser and TypedDependencyParser working inaccurately
* Fixed a bug preventing the load of WordEmbeddingsModel class from python
* Fixed wrong pretrained model names preventing some pretrained models to work properly
* Fixed BertEmbeddings not being capable of loading from file due a reader exception

---------------
Documentation
---------------
* Website documentation migrated to GitHub wiki page (WIP)

---------------
Developer API
---------------
* OcrHelper now reports failed file name when throwing exceptions (Thanks kgeis)
* Fixed Annotation function explodeAnnotations to consider replacing output column scenarios
* Fixed TRAVIS CI unit tests

========

2.0.3

Not secure
========
---------------
Overview
---------------
Short after 2.0.2, a hotfix release was made to address two bugs that prevented users from using pretrained tensorflow models in clusters.
Please read release notes for 2.0.2 to catch up!

---------------
Bugfixes
---------------
* Fixed logger serializable, causing issues in executors to serialize TensorflowWrapper
* Fixed contrib loading in cluster, when retrieving a Tensorflow session

========

2.0.2

Not secure
========
---------------
Overview
---------------
Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!

---------------
New Features
---------------
* NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata

---------------
Enhancements
---------------
* Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
* Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
* All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
* ContextSpellChecker now creates a window around the token to improve computation performance
* Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
* WordEmbeddings won't load twice if already loaded
* WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
* WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
* Contrib tensorflow dependencies now only load if necessary

---------------
Bugfixes
---------------
* Added missing Symmetric delete pretrained model
* Fixed a broken param name in Normalizer (thanks RobertSassen)
* Fixed Cloudera cluster support
* Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
* Fixed POS dataset creator to better handle corrupted pairs
* Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
* Fixed OCR Tess4J initialization problems in concurrent scenarios

---------------
Models and Pipelines
---------------
* Renaming of models and pipelines (work in progress)
* Better output column naming in pipelines

---------------
Developer API
---------------
* Unified more WordEmbeddings interface with dimension params and individual setters
* Improved unit tests for better compatibility on Windows
* Python embeddings moved to sparknlp.embeddings

========

2.0.1

Not secure
========
---------------
Overview
---------------
Thanks for following up after our 2.0.0 release!. This release covers a few holes left by the immense 2.0.0 release,
to address high priority issues found after release. More importantly, the library should now behave correctly when using
Spark cluster modes, and memory and CPU utilization should be reduced to normal levels after some serious profiling of Serialization
revealed a bunch of problems. Aside from performance and resource management improvements, we include an OCR dependency handler in start() function as well
as improve the support of GPU for NER Deep Learning models. Finally, check out our spark-nlp-workshop repo, it has cool features!

---------------
Enhancements
---------------
* Improved serialization of Deep Learning models, shows performance boosts of up to 2.5 times over 1.8.3
* Tensorflow contrib libraries now managed correctly across a cluster
* Reverted useFeatureBroadcasting after internal benchmarks proved it was performing better
* SparkNLP.start() and sparknlp.start() now accept an includeOCR parameter which allows to automatically include OCR library
* Recreated NerDL Graphs to allow GPU allow_growth in tensorflow to improve memory management with GPU
* Expanded GPU coverage in NerDL graph
* Reduced NerDL Batch Size for better compatibility with GPUs

---------------
Bugfixes
---------------
* Fixed deep learning models not working across cluster due a bug in inputBuffers from graph reading
* Fixed a bug in POS() training function which did not work correctly from Python
* Fixed a bug in OCR where page number and intersection was not correctly matched
* Correctly handle exceptions when training Norvig and Symmetric Spell Checkers from dataframes

---------------
Developer API
---------------
* ContextSpellChecker now follows Features API correctly

---------------
Documentation
---------------
* spark-nlp-workshop repository has been expanded with better documentation and new notebooks
* we are still catching up with 2.x release!

========

2.0.0

Not secure
========
---------------
Overview
---------------
Thank you for following up with the biggest changelog ever on Spark NLP: Spark NLP 2.0.0! Where to begin?
We have no less than 50 Pull Requests merged this time. Most importantly, we become the first library to have a production
ready implementation of BERT embeddings. Along with this interesting deep learning and context based embeddings algorithm, here is a quick overview of new things:
* Word Embeddings as well as Bert Embeddings are now annotators, just like any other component in the library. This means, embeddings can be
cached on memory through DataFrames, can be saved on disk and shared as part of pipelines!
* We revamped and enhanced Named Entity Recognition (NER) Deep Learning models to a new state of the art level, reaching up to 93% F1 micro-averaged accuracy in the industry standard.
* We upgraded tensorflow version and also started using contrib LSTM Cells.
* Performance and memory usage improvements also tag along by improving serialization throughput of Deep Learning annotators by receiving feedback from Apache Spark contributor Davies Liu.
* Revamping and expanding our pretrained pipelines list, plus the addition of new pretrained models for different languages together with
tons of new example notebooks, which include changes that aim the library to be easier to use. API overall was modified towards helping new comers get started.
* OCR module comes with a handful of improvements that increase accuracy.
All of this comes together with a full range of bug fixes and annotator improvements, follow up the details below!
Bear with us since documentation is still catching up a little bit behind, as well as new models to be made available. Stay tuned on Slack!

----------------
New Features
----------------
* BertEmbeddings annotator, with four google ready models ready to be used through Spark NLP as part of your pipelines, includes Wordpiece tokenization.
* WordEmbeddings, our previous embeddings system is now an Annotator to be serialized along Spark ML pipelines
* Created training helper functions that create spark datasets from files, such as CoNLL and POS tagging
* NER DL has been revamped by using contrib LSTM Cells. Added library handling for different OS.

----------------
Enhancements
----------------
* OCR improved handling of images by adding binarizing of buffered segments
* OCR now allows automatic adaptive scaling
* SentenceDetector params merged between DL and Rule based annotators
* SentenceDetector max length has been disabled by default, and now truncates by whitespace
* Part of Speech, NER, Spell Checking and Vivekn Sentiment Analysis annotators now train from dataset passed to fit() using Spark in the process
* Tokens and Chunks now hold metadata information regarding which sentence they belong to by sentence ID
* AnnotatorApproach annotators now allow a param trainingCols allowing them to use different inputs in training and in prediction. Improves Pipeline versatility.
* LightPipelines now allow method transform() to call against a DataFrame
* Noticeable performance gains by improving serialization performance in annotators through removal of transient variables
* Spark NLP in 30 seconds now provides a function SparkNLP.start() and sparknlp.start() (python) that automatically creates a local Spark session.
* Improved DateMatcher accuracy
* Improved Normalizer annotator by supporting and tokenizing a slang dictionary, with case sensitivity matching option
* ContextSpellChecker now is capable of handling multiple sentences in a row
* PretrainedPipeline feature now allows handling John Snow Labs remote pretrained pipelines to make it easy to update and access new models
* Symmetric Delete spell checking model improved training performance

----------------
Models and Pipelines
----------------
* Added more than 15 pretrained pipelines that cover a huge range of use cases. To be documented
* Improved multi language support by adding french and italian pipelines and models. More to come!
* Dependency Parser annotators now include a pretrained english model based on CoNLL-U 2009

----------------
Bugfixes
----------------
* Fixed python classname reference when deserializing pipelines
* Fixed serialization in ContextSpellChecker
* Fixed a bug in LightPipeline causing not to include output from embedded pipelines in a PipelineModel
* Fixed DateMatcher wrong param name not allowing to access it properly
* Fixed a bug where DateMatcher didn't know how to handle dash in dates where year had two digits instead of four
* Fixed a ContextSpellChecker bug that prevented it from being used repeatedly with collections in LightPipeline
* Fixed a bug in OCR that made it blow up with some image formats when using text preferred method
* Fixed a bug on OCR which made params not to work in cluster mode
* Fixed OCR setSplitPages and setSplitRegions to work properly if tesseract detected multiple regions

----------------
Developer API
----------------
* AnnotatorType params renamed to inputAnnotatorTypes and outputAnnotatorTypes
* Embeddings now serialize along a FloatArray in Annotation class
* Disabled useFeatureBroadcasting, showed better performance number when training large models in annotators that use Features
* OCR must be instantiated
* OCR works best with 4.0.0-beta.1

----------------
Build and release
----------------
* Added GPU build with tensorflow-gpu to Maven coordinates
* Removed .jar file from pip package

========

Page 18 of 23

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.