========
---------------
Overview
---------------
This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.
---------------
New Features
---------------
* Built on top of Spark 2.4.0
* Dependency Parser annotator allows for sentence relationship encoding
* Typed Dependency Parser annotator allows for labeling relationships within dependency tags
* ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens
---------------
Enhancements
---------------
* More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
* OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
* Improved word embeddings performance when working in local filesystems
* Reduced the amount of disk IO when working with Word Embeddings
* All python notebooks improved for better readability and better documentation
* Simplified PySpark interface API
* CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
* EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths
---------------
Bugfixes
---------------
* Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
* Fixed application.conf reading bug which didn't properly refresh AWS credentials
* RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
* Solved various python default parameter settings
* Fixed circular dependency with jbig pdfbox image OCR
---------------
Deprecations
---------------
* DeIdentification annotator is no longer supported in the open source version of Spark-NLP
* AssertionStatus annotator is no longer supported in the open source version of Spark-NLP
========