Unstructured

Latest version: v0.17.2

Safety actively analyzes 723607 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 23 of 39

0.6.11

Not secure
Enhancements

* Supports epub tests since pandoc is updated in base image

Features

Fixes

0.6.10

Not secure
Enhancements

* XLS support from auto partition

Features

Fixes

0.6.9

Not secure
Enhancements

* fast strategy for pdf now keeps element bounding box data
* setup.py refactor

Features

Fixes

* Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
* Adds additional MIME types for CSV

0.6.8

Not secure
Enhancements

Features

* Add `partition_csv` for CSV files.

Fixes

0.6.7

Not secure
Enhancements

* Deprecate `--s3-url` in favor of `--remote-url` in CLI
* Refactor out non-connector-specific config variables
* Add `file_directory` to metadata
* Add `page_name` to metadata. Currently used for the sheet name in XLSX documents.
* Added a `--partition-strategy` parameter to unstructured-ingest so that users can specify
partition strategy in CLI. For example, `--partition-strategy fast`.
* Added metadata for filetype.
* Add Discord connector to pull messages from a list of channels
* Refactor `unstructured/file-utils/filetype.py` to better utilise hashmap to return mime type.
* Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for `test_filetype.py`.

Features

* Add `partition_xml` for XML files.
* Add `partition_xlsx` for Microsoft Excel documents.

Fixes

* Supports `hml` filetype for partition as a variation of html filetype.
* Makes `pytesseract` a function level import in `partition_pdf` so you can use the `"fast"`
or `"hi_res"` strategies if `pytesseract` is not installed. Also adds the
`required_dependencies` decorator for the `"hi_res"` and `"ocr_only"` strategies.
* Fix to ensure `filename` is tracked in metadata for `docx` tables.

0.6.6

Not secure
Enhancements

* Adds an `"auto"` strategy that chooses the partitioning strategy based on document
characteristics and function kwargs. This is the new default strategy for `partition_pdf`
and `partition_image`. Users can maintain existing behavior by explicitly setting
`strategy="hi_res"`.
* Added an additional trace logger for NLP debugging.
* Add `get_date` method to `ElementMetadata` for converting the datestring to a `datetime` object.
* Cleanup the `filename` attribute on `ElementMetadata` to remove the full filepath.

Features

* Added table reading as html with URL parsing to `partition_docx` in docx
* Added metadata field for text_as_html for docx files

Fixes

* `fileutils/file_type` check json and eml decode ignore error
* `partition_email` was updated to more flexibly handle deviations from the RFC-2822 standard.
The time in the metadata returns `None` if the time does not match RFC-2822 at all.
* Include all metadata fields when converting to dataframe or CSV

Page 23 of 39

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.