Dedoc

Latest version: v2.2.1

Safety actively analyzes 623586 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

2.2.1

* Added `fintoc` structure type for parsing financial prospects according to the [FinTOC 2022 Shared task](https://wp.lancs.ac.uk/cfie/fintoc2022/) (`FintocStructureExtractor`).
* Fixed small bugs in `ArticleReader`: colspan for tables, keywords, sections numbering, etc.
* Added references to nodes and fixed small bugs in the HTML output representation (return_format="html").
* Removed `other_fields` from `LineMetadata` and `DocumentMetadata`.
* Update `README.md`.

2.2

* `PdfTabbyReader` improved: bugs fixes, speed increase of partial PDF extraction (with parameter `pages`).
* Added benchmarks for evaluation of PDF readers performance.
* Added `ReferenceAnnotation` class.
* Fixed bug in `can_read` method for all readers.
* Added `article` structure type for parsing scientific articles using [GROBID](https://grobid.readthedocs.io) (`ArticleReader`, `ArticleStructureExtractor`).

2.1.1

* Update README.md.
* Update table and time benchmarks.
* Re-label line-classifier datasets (law, diploma, paragraphs datasets).
* Update tasker creators (for the labeling system).
* Fix HTML table parsing.

2.1

* Custom loggers deleted (the common logger is used for all dedoc classes).
* Do not change the document image if it has a correct orientation (orientation correction function changed).
* Use only `PdfTabbyReader` during detection of a textual layer in PDF files.
* Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory).
* Added `BoldAnnotation` for words in `PdfImageReader`.
* More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR.
* Some fixes are made in a web-form of Dedoc.
* Tutorial how to add a new structure type to Dedoc added.
* Parsing of EML and HTML files fixed.

2.0

* Fix table extraction from `PDF` using empty config (see [issue](https://github.com/ispras/dedoc/issues/373))
* Add more benchmarks for Tesseract
* Fix extension extraction for file names with several dots
* Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors).
Please look to the `Package reference` of [documentation](https://dedoc.readthedocs.io) for more details
* Add `AttachAnnotation` and `TableAnnotation` to `PPTX` (see [discussion](https://github.com/ispras/dedoc/discussions/386))
* Fix bugs in `DOCX` handling (see issues [378](https://github.com/ispras/dedoc/issues/378), [379](https://github.com/ispras/dedoc/issues/379)

1.1.1

* Use older `pydantic` version for improving compatibility with other libraries.
* Add support for `RTF` format.
* Fix bug in handling files' names with dots and spaces.
* Fix bug in non-integer values of text formatting in `DocxReader`.
* Add support of `on_gpu` parameter in `config`.
* Add attached images extraction for `PdfTabbyReader`.
* Fix partial file reading for `PdfTabbyReader`.
* Add tutorial how to create dedoc's basic data structures.
* Fix `attachments_dir` parameter for readers and attachments extractors.

Page 1 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.