* Fix bug with bold lines in `DocxReader` (see [issue 479](https://github.com/ispras/dedoc/issues/479)). * Upgraded requirements.txt (`beautifulsoup4` to 4.12.3 version). * Added support for external grobid (added support parameter `Authorization`). * Added GOST (Russian government standard) frame recognition in `PdfTabbyReader` (`need_gost_frame_analysis` parameter). * Update documentation (added GOST frame recognition). * Added multi-page table handling to `PdfTabbyReader`.
2.3
* [Dedoc telegram chat](https://t.me/dedoc_chat) created. * Added `patterns` parameter for configuring default structure type. * Added notebooks with Dedoc usage (see [issue 484](https://github.com/ispras/dedoc/issues/484)). * Fix bug `OutOfMemoryError: Java heap space` in `PdfTabbyReader` (see [issue 489](https://github.com/ispras/dedoc/issues/489)). * Fix bug with numeration in `DocxReader` (see [issue 494](https://github.com/ispras/dedoc/issues/494)). * Added GOST (Russian government standard) frame recognition in `PdfImageReader` and `PdfTxtlayerReader` (`need_gost_frame_analysis` parameter).
2.2.7
* Fix bugs with `start`, `end` of `BBoxAnnotation` in `PdfTabbyReader`. * Improve columns classification and orientation detection for PDF and images (`is_one_column_document` and `document_orientation` parameters). * Upgrade `docker`: `docker-compose` is no longer supported, use `docker compose` instead. * Fix bug of tables parsing in `DocxReader` (see [issue](https://github.com/ispras/dedoc/issues/478)). * Added simple textual layer detection in `PdfAutoReader` (`fast_textual_layer_detection` parameter). * Improve paragraph extraction from PDF documents and images. * Retrain a classifier for diplomas (document_type="diploma") on a new dataset.
2.2.6
* Upgrade dependencies: `numpy<2.0` and `dedoc-utils==0.3.7`.
2.2.5
* Added internal functions and classes to support integration of Dedoc into [langchain](https://github.com/langchain-ai/langchain) * Upgrade some dependencies, in particular, `xgboost>=1.6.0`, `pandas`, `pdfminer.six`
2.2.4
* Show page division and page numbers in the HTML output representation (API usage, return_format="html"). * Make imports from dedoc library faster. * Added tutorial how to add a new language to dedoc (not finished entirely). * Added additional page_id metadata for multi-page nodes (structure_type="tree" in API, `TreeConstructor` in the library). * Updated OCR and orientation/columns classification benchmarks. * Minor edits of `README.md`. * Fixed empty cells handling in `CSVReader`. * Fixed bounding boxes extraction for text in tables for `PdfTabbyReader`.