Dedoc

Latest version: v2.3.1

Safety actively analyzes 687881 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 4

2.0

* Fix table extraction from `PDF` using empty config (see [issue](https://github.com/ispras/dedoc/issues/373))
* Add more benchmarks for Tesseract
* Fix extension extraction for file names with several dots
* Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors).
Please look to the `Package reference` of [documentation](https://dedoc.readthedocs.io) for more details
* Add `AttachAnnotation` and `TableAnnotation` to `PPTX` (see [discussion](https://github.com/ispras/dedoc/discussions/386))
* Fix bugs in `DOCX` handling (see issues [378](https://github.com/ispras/dedoc/issues/378), [379](https://github.com/ispras/dedoc/issues/379)

1.1.1

* Use older `pydantic` version for improving compatibility with other libraries.
* Add support for `RTF` format.
* Fix bug in handling files' names with dots and spaces.
* Fix bug in non-integer values of text formatting in `DocxReader`.
* Add support of `on_gpu` parameter in `config`.
* Add attached images extraction for `PdfTabbyReader`.
* Fix partial file reading for `PdfTabbyReader`.
* Add tutorial how to create dedoc's basic data structures.
* Fix `attachments_dir` parameter for readers and attachments extractors.

1.1.0

* Add `BBoxAnnotation` to table cells for `PdfTabbyReader`.
* Fix swagger, add api schema classes, remove `to_dict` method from `ParsedDocument`.
* Improve parsing PDF by `PdfTxtlayerReader`, add benchmarks.
* Fix `BBoxAnnotation` extraction for tables in `PdfImageReader` using `table_type=split_last_column` parameter.
* Change base method of metadata extractors, rename it to `extract_metadata`.
* Unify `BBoxAnnotation` extraction for all PDF readers - return only words bboxes.
* Increase timeout value for all converters.

1.0

* Remove `is_one_column_document_list` parameter.
* Add tutorial about support for a new document type to the documentation.
* Improve textual layer correctness classifier.
* Improve orientation and columns classifier.
* Change table's output structure - added `CellWithMeta` instead of a textual string.
* Add `BBoxAnnotation` to table cells for `PdfTxtlayerReader` and `PdfImageReader`.
* Add `ConfidenceAnnotation` to table cells for `PdfImageReader`.
* Remove `insert_table` parameter.
* Added information about table and page rotation to the table and document metadata respectively.
* Use [dedoc-utils](https://pypi.org/project/dedoc-utils) library for document images preprocessing.
* Change web interface, fix online-examples of document processing.
* Add comparison operator to `LineWithMeta`.

0.11.2

* Remove plexus-utils-1.1.jar.
* Update installation documentation.
* Add documentation for Tesseract OCR installation.
* Add documentation for annotations.
* Add documentation for secure torch.
* Fix examples.

0.11.1

* Add bbox annotations in `PdfTabbyReader`.
* Add bbox annotations for words in `PdfTxtlayerReader`.
* Add an option `plain_text` to the `return_format` parameter.
* Reduce size of the dedoc base image, move dockerfiles to the [separate repository](https://github.com/ispras/dedockerfiles).
* Refactor script for tesseract benchmarking.
* Make fixed dedoc dependencies as ranges.
* Add table cell properties in `PdfTabbyReader`.

Page 3 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.