Sparknlp

Latest version: v1.0.0

Safety actively analyzes 682361 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 12

2.3.0

========
---------------
Overview
---------------
Thanks for your contributions and feedback on Slack. This amazing release comes with many new features in the embeddings scope,
allowing pipeline builders to retrieve embeddings for specific bodies of texts in any form given, from sentences to chunks or n-grams.
We also worked a lot on making sure Spark NLP on Java works as intended. Finally, we improved aws profiles compatibility for frameworks
that utilize multiple credential profiles. Unfortunately, we have deprected Eval and OCR due to internal patents in some of the latest improvements
John Snow Labs has contributed to.

---------------
New Features
---------------
* New SentenceEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate sentence or document embeddings
* New ChunkEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate chunk embeddings from Chunker or NGramGenerator outputs
* New StopWordsCleaner integrates Spark ML StopWordsRemoval function into Spark NLP pipeline
* New NGramGenerator annotator integrates Spark ML NGram function into Spark ML with a new cumulative feature to also generate range ngrams like the scikit-learn library

---------------
Enhancements
---------------
* Improved Java intercompatibility on Pretrained and LightPipeline APIs. Examples added.
* Finisher and LightPipelines Parse Embeddings Vector flag allows for optional vector processing to save memory and improve performance
* setInputCols in python can be passed as *args
* new Param enableScore in SentimentDetector to switch output types between confidence score and results (Thanks maxwellpaulm)
* spark_nlp profile name by default in AWS config allows for multiple profile download compatible

---------------
Bugfixes
---------------
* Fixed POS training dataset creator to improve performance

---------------
Deprecations
---------------
* OCR Module dropped from open source support
* Eval Module dropped from open source support

========

2.2.2

========
---------------
Overview
---------------
Thank you again for all your feedback and questions in our Slack channel. Such feedback from users and contributors
(thank you Stuart Lynn sllynn) helped to find several python module bugs. We also fixed and improved OCR support
towards extracting page coordinates and fixed NerDL evaluator from Python

---------------
Enhancements
---------------
* Added a create_models.py python script to generate Graphs for NerDL without the need of jupyter
* Added a new annotator Token2Chunk to convert all tokens to chunk types (useful for extracting token coordinates from OCR)
* Added OCR Page Dimensions
* Python setInputCols now accepts *args no need to input list

---------------
Bugfixes
---------------
* Fixed python support of NerDL evaluation not taking all params appropriately
* Fixed a bug in case sensitivity matching of embeddings format in python (Thanks sllynn)
* Fixed a bug in python DateMatcher with dateFormat param not working (Thanks sllynn)
* Fixed a bug in PositionFinder reporting duplicate coordinate elements

----------------
Developer API
----------------
* Renamed trainValidationProp to validationSplit in NerDLApproach

----------------
Documentation
----------------
* Added several missing annotator documentation in docs page

========

2.2.1

========
---------------
Overview
---------------
This short release is to address a few uncovered issues in the previous 2.2.0 release. Thank you all for quick feedback.

---------------
Enhancements
---------------
* NerDLApproach new param includeValidationProp allows partitioning the training set and exclude a fraction
* NerDLApproach trainValidationProp now randomly samples the data as opposed to head first

---------------
Bugfixes
---------------
* Fixed a bug in ResourceHelper causing folder resources to fail when a folder is empty (affects various annotators)
* Fixed a bug in python embeddings format not parsed to upper case
* Fixed a bug in python causing an incapability to load PipelineModels after loading embeddings

========

2.2.0

========
---------------
Overview
---------------
Last time, following a release candidate schedule proved to be a quite effective method to avoid silly bugs right after release!
Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!

---------------
New Features
---------------
* OCRHelper now returns coordinate positions matrix for text converted from PDF
* New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
* Evaluation module now also ported to Python
* WordEmbeddings now include coverage metadata information and new static functions `withCoverageColumn` and `overallCoverage` offer metric analysis
* NerDL Now has `includeConfidence` param that enables confidence scores on prediction metadata
* NerDLApproach now has `enableOutputLog` outputs training metric logs to file
* New Param in BERT `poolingLayer` allows for polling layer selection

---------------
Enhancements
---------------
* BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
* Progress bar and size estimate report when downloading pretrained models and loading embeddings
* Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
* Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
* Tokenizer now allows regular expressions in the list of Exceptions (Thank you atomobianco)
* PretrainedPipelines now allow function `fullAnnotate` to retrieve fully information of Annotations
* DocumentAssembler new cleanup modes: each, each_full and delete_full allow more control over text cleaning up (different ways of dealing with new lines and tabs)

---------------
Bugfixes
---------------
* Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
* Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
* Fixed missing setters for whitelist param in NerConverter
* Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
* Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
* Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)

========

2.1.1

========
---------------
Overview
---------------
Thank you so much for your feedback on slack. This release is to extend life length of the 2.1.x release, with important bugfixes from upstream

---------------
Bugfixes
---------------
* Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
* Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
* Fixed missing setters for whitelist param in NerConverter

========

2.1.0

========
---------------
Overview
---------------
Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler `trimAndClearNewLiens` was redesigned into a `cleanupMode` for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language based Tokenizers.
Another big introduction is the `eval` module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the `NerDLApproach`.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to csnardi for fixing a bug in one of the release candidates.

---------------
New Features
---------------
* Spark NLP Eval module, includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)

---------------
Enhancements
---------------
* DocumentAssembler new param `cleanupMode` allows user to decide what kind of cleanup to apply to source
* Tokenizer has been severely enhanced to allow easier and more intuitive customization
* Norvig and Symmetric spell checkers now report confidence scores in metadata
* NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through `setTrainValidationProp`
* Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development

---------------
Bugfixes
---------------
* Fixed Dependency Parser not reporting offsets correctly
* Dependency Parser now only shows head token as part of the result, instead of pairs
* Fixed NerDLModel not allowing to pick noncontrib versions from linux
* Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
* Removed unintentional gc calls causing some performance issues

---------------
Framework
---------------
* ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)

---------------
Documentation
---------------
* Scaladocs for Spark NLP reference
* Added Google Colab workthrough guide
* Added Approach and Model class names in reference documentation
* Fixed various typos and outdated pieces in documentation

========

Page 5 of 12

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.