Dedoc

Latest version: v2.3.2

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 4

2.1

* Custom loggers deleted (the common logger is used for all dedoc classes).
* Do not change the document image if it has a correct orientation (orientation correction function changed).
* Use only `PdfTabbyReader` during detection of a textual layer in PDF files.
* Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory).
* Added `BoldAnnotation` for words in `PdfImageReader`.
* More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR.
* Some fixes are made in a web-form of Dedoc.
* Tutorial how to add a new structure type to Dedoc added.
* Parsing of EML and HTML files fixed.

2.0

* Fix table extraction from `PDF` using empty config (see [issue](https://github.com/ispras/dedoc/issues/373))
* Add more benchmarks for Tesseract
* Fix extension extraction for file names with several dots
* Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors).
Please look to the `Package reference` of [documentation](https://dedoc.readthedocs.io) for more details
* Add `AttachAnnotation` and `TableAnnotation` to `PPTX` (see [discussion](https://github.com/ispras/dedoc/discussions/386))
* Fix bugs in `DOCX` handling (see issues [378](https://github.com/ispras/dedoc/issues/378), [379](https://github.com/ispras/dedoc/issues/379)

1.1.1

* Use older `pydantic` version for improving compatibility with other libraries.
* Add support for `RTF` format.
* Fix bug in handling files' names with dots and spaces.
* Fix bug in non-integer values of text formatting in `DocxReader`.
* Add support of `on_gpu` parameter in `config`.
* Add attached images extraction for `PdfTabbyReader`.
* Fix partial file reading for `PdfTabbyReader`.
* Add tutorial how to create dedoc's basic data structures.
* Fix `attachments_dir` parameter for readers and attachments extractors.

1.1.0

* Add `BBoxAnnotation` to table cells for `PdfTabbyReader`.
* Fix swagger, add api schema classes, remove `to_dict` method from `ParsedDocument`.
* Improve parsing PDF by `PdfTxtlayerReader`, add benchmarks.
* Fix `BBoxAnnotation` extraction for tables in `PdfImageReader` using `table_type=split_last_column` parameter.
* Change base method of metadata extractors, rename it to `extract_metadata`.
* Unify `BBoxAnnotation` extraction for all PDF readers - return only words bboxes.
* Increase timeout value for all converters.

1.0

* Remove `is_one_column_document_list` parameter.
* Add tutorial about support for a new document type to the documentation.
* Improve textual layer correctness classifier.
* Improve orientation and columns classifier.
* Change table's output structure - added `CellWithMeta` instead of a textual string.
* Add `BBoxAnnotation` to table cells for `PdfTxtlayerReader` and `PdfImageReader`.
* Add `ConfidenceAnnotation` to table cells for `PdfImageReader`.
* Remove `insert_table` parameter.
* Added information about table and page rotation to the table and document metadata respectively.
* Use [dedoc-utils](https://pypi.org/project/dedoc-utils) library for document images preprocessing.
* Change web interface, fix online-examples of document processing.
* Add comparison operator to `LineWithMeta`.

0.11.2

* Remove plexus-utils-1.1.jar.
* Update installation documentation.
* Add documentation for Tesseract OCR installation.
* Add documentation for annotations.
* Add documentation for secure torch.
* Fix examples.

Page 3 of 4

Releases

Has known vulnerabilities

Previous Next

Dedoc

Page 3 of 4

2.1

2.0

1.1.1

1.1.0

1.0

0.11.2

Page 3 of 4

Links

Releases