Unstructured

Latest version: v0.16.11

Safety actively analyzes 687918 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 34

0.15.8

Enhancements

* **Bump unstructured.paddleocr to 2.8.1.0.**

Features

* **Add MixedbreadAI embedder** Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

Fixes

* **Replace `pillow-heif` with `pi-heif`**. Replaces `pillow-heif` with `pi-heif` due to more permissive licensing on the wheel for `pi-heif`.
* **Minify text_as_html from DOCX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
* **Fall back to filename extension-based file-type detection for unidentified OLE files.** Resolves a problem where a DOC file that could not be detected as such by `filetype` was incorrectly identified as a MSG file.

0.15.7

Enhancements

Features

Fixes

* **Fix NLTK data download path to prevent nested directories**. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

0.15.6

Enhancements

Features

Fixes

* **Bump to NLTK 3.9.x** Bumps to the latest `nltk` version to resolve CVE.
* **Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
* **Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.

0.15.5

Enhancements

Features

Fixes

* **Revert to using `unstructured.pytesseract` fork**. Due to the unavailability of some recent release versions of `pytesseract` on PyPI, the project now uses the `unstructured.pytesseract` fork to ensure stability and continued support.
* **Bump `libreoffice` verson in image.** Bumps the `libreoffice` version to `25.2.5.2` to address CVEs.
* **Downgrade NLTK dependency version for compatibility**. Due to the unavailability of `nltk==3.8.2` on PyPI, the NLTK dependency has been downgraded to `<3.8.2`. This change ensures continued functionality and compatibility.

0.15.4

Enhancements

Features

Fixes

* **Resolve an installation error with `pytesseract>=0.3.12` that occurred during `pip install unstructured[pdf]==0.15.3`.**

0.15.3

Enhancements

Features

Fixes

* **Remove the custom index URL from `extra-paddleocr.in` to resolve the error in the `setup.py` configuration.**

Page 4 of 34

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.