Dedoc

Latest version: v2.3.1

Safety actively analyzes 687852 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

2.3.1

* Fix bug with bold lines in `DocxReader` (see [issue 479](https://github.com/ispras/dedoc/issues/479)).
* Upgraded requirements.txt (`beautifulsoup4` to 4.12.3 version).
* Added support for external grobid (added support parameter `Authorization`).
* Added GOST (Russian government standard) frame recognition in `PdfTabbyReader` (`need_gost_frame_analysis` parameter).
* Update documentation (added GOST frame recognition).
* Added multi-page table handling to `PdfTabbyReader`.

2.3

* [Dedoc telegram chat](https://t.me/dedoc_chat) created.
* Added `patterns` parameter for configuring default structure type.
* Added notebooks with Dedoc usage (see [issue 484](https://github.com/ispras/dedoc/issues/484)).
* Fix bug `OutOfMemoryError: Java heap space` in `PdfTabbyReader` (see [issue 489](https://github.com/ispras/dedoc/issues/489)).
* Fix bug with numeration in `DocxReader` (see [issue 494](https://github.com/ispras/dedoc/issues/494)).
* Added GOST (Russian government standard) frame recognition in `PdfImageReader` and `PdfTxtlayerReader` (`need_gost_frame_analysis` parameter).

2.2.7

* Fix bugs with `start`, `end` of `BBoxAnnotation` in `PdfTabbyReader`.
* Improve columns classification and orientation detection for PDF and images (`is_one_column_document` and `document_orientation` parameters).
* Upgrade `docker`: `docker-compose` is no longer supported, use `docker compose` instead.
* Fix bug of tables parsing in `DocxReader` (see [issue](https://github.com/ispras/dedoc/issues/478)).
* Added simple textual layer detection in `PdfAutoReader` (`fast_textual_layer_detection` parameter).
* Improve paragraph extraction from PDF documents and images.
* Retrain a classifier for diplomas (document_type="diploma") on a new dataset.

2.2.6

* Upgrade dependencies: `numpy<2.0` and `dedoc-utils==0.3.7`.

2.2.5

* Added internal functions and classes to support integration of Dedoc into [langchain](https://github.com/langchain-ai/langchain)
* Upgrade some dependencies, in particular, `xgboost>=1.6.0`, `pandas`, `pdfminer.six`

2.2.4

* Show page division and page numbers in the HTML output representation (API usage, return_format="html").
* Make imports from dedoc library faster.
* Added tutorial how to add a new language to dedoc (not finished entirely).
* Added additional page_id metadata for multi-page nodes (structure_type="tree" in API, `TreeConstructor` in the library).
* Updated OCR and orientation/columns classification benchmarks.
* Minor edits of `README.md`.
* Fixed empty cells handling in `CSVReader`.
* Fixed bounding boxes extraction for text in tables for `PdfTabbyReader`.

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.