Dedoc

Latest version: v2.3.1

Safety actively analyzes 687852 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

2.2.3

* Show attached images and added ability to download attached files in the HTML output representation (API usage, return_format="html").
* Added hierarchy level information and annotations to `PptxReader`.

2.2.2

* Added images extraction to `ArticleReader`.
* Added attachments and references to them in the HTML output representation (return_format="html").
* Fixed functionality of parameter `need_content_analysis`.
* Fixed `CSVReader` (exclude BOM character from the output).
* Added handling files with wrong extension or without extension to `DedocManager` (detect file type by its content).
* Update `README.md`.

2.2.1

* Added `fintoc` structure type for parsing financial prospects according to the [FinTOC 2022 Shared task](https://wp.lancs.ac.uk/cfie/fintoc2022/) (`FintocStructureExtractor`).
* Fixed small bugs in `ArticleReader`: colspan for tables, keywords, sections numbering, etc.
* Added references to nodes and fixed small bugs in the HTML output representation (return_format="html").
* Removed `other_fields` from `LineMetadata` and `DocumentMetadata`.
* Update `README.md`.

2.2

* `PdfTabbyReader` improved: bugs fixes, speed increase of partial PDF extraction (with parameter `pages`).
* Added benchmarks for evaluation of PDF readers performance.
* Added `ReferenceAnnotation` class.
* Fixed bug in `can_read` method for all readers.
* Added `article` structure type for parsing scientific articles using [GROBID](https://grobid.readthedocs.io) (`ArticleReader`, `ArticleStructureExtractor`).

2.1.1

* Update README.md.
* Update table and time benchmarks.
* Re-label line-classifier datasets (law, diploma, paragraphs datasets).
* Update tasker creators (for the labeling system).
* Fix HTML table parsing.

2.1

* Custom loggers deleted (the common logger is used for all dedoc classes).
* Do not change the document image if it has a correct orientation (orientation correction function changed).
* Use only `PdfTabbyReader` during detection of a textual layer in PDF files.
* Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory).
* Added `BoldAnnotation` for words in `PdfImageReader`.
* More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR.
* Some fixes are made in a web-form of Dedoc.
* Tutorial how to add a new structure type to Dedoc added.
* Parsing of EML and HTML files fixed.

Page 2 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.