Dedoc

Latest version: v2.3.2

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 4

2.2.4

* Show page division and page numbers in the HTML output representation (API usage, return_format="html").
* Make imports from dedoc library faster.
* Added tutorial how to add a new language to dedoc (not finished entirely).
* Added additional page_id metadata for multi-page nodes (structure_type="tree" in API, `TreeConstructor` in the library).
* Updated OCR and orientation/columns classification benchmarks.
* Minor edits of `README.md`.
* Fixed empty cells handling in `CSVReader`.
* Fixed bounding boxes extraction for text in tables for `PdfTabbyReader`.

2.2.3

* Show attached images and added ability to download attached files in the HTML output representation (API usage, return_format="html").
* Added hierarchy level information and annotations to `PptxReader`.

2.2.2

* Added images extraction to `ArticleReader`.
* Added attachments and references to them in the HTML output representation (return_format="html").
* Fixed functionality of parameter `need_content_analysis`.
* Fixed `CSVReader` (exclude BOM character from the output).
* Added handling files with wrong extension or without extension to `DedocManager` (detect file type by its content).
* Update `README.md`.

2.2.1

* Added `fintoc` structure type for parsing financial prospects according to the [FinTOC 2022 Shared task](https://wp.lancs.ac.uk/cfie/fintoc2022/) (`FintocStructureExtractor`).
* Fixed small bugs in `ArticleReader`: colspan for tables, keywords, sections numbering, etc.
* Added references to nodes and fixed small bugs in the HTML output representation (return_format="html").
* Removed `other_fields` from `LineMetadata` and `DocumentMetadata`.
* Update `README.md`.

2.2

* `PdfTabbyReader` improved: bugs fixes, speed increase of partial PDF extraction (with parameter `pages`).
* Added benchmarks for evaluation of PDF readers performance.
* Added `ReferenceAnnotation` class.
* Fixed bug in `can_read` method for all readers.
* Added `article` structure type for parsing scientific articles using [GROBID](https://grobid.readthedocs.io) (`ArticleReader`, `ArticleStructureExtractor`).

2.1.1

* Update README.md.
* Update table and time benchmarks.
* Re-label line-classifier datasets (law, diploma, paragraphs datasets).
* Update tasker creators (for the labeling system).
* Fix HTML table parsing.

Page 2 of 4

Releases

Has known vulnerabilities

Previous Next

Dedoc

Page 2 of 4

2.2.4

2.2.3

2.2.2

2.2.1

2.2

2.1.1

Page 2 of 4

Links

Releases