Unstructured

Latest version: v0.16.11

Safety actively analyzes 687918 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 27 of 34

0.4.0

* Added generic `partition` brick that detects the file type and routes a file to the appropriate
partitioning brick.
* Added a file type detection module.
* Updated `partition_html` and `partition_eml` to support file-like objects in 'rb' mode.
* Cleaning brick for removing ordered bullets `clean_ordered_bullets`.
* Extract brick method for ordered bullets `extract_ordered_bullets`.
* Test for `clean_ordered_bullets`.
* Test for `extract_ordered_bullets`.
* Added `partition_docx` for pre-processing Word Documents.
* Added new REGEX patterns to extract email header information
* Added new functions to extract header information `parse_received_data` and `partition_header`
* Added new function to parse plain text files `partition_text`
* Added new cleaners functions `extract_ip_address`, `extract_ip_address_name`, `extract_mapi_id`, `extract_datetimetz`
* Add new `Image` element and function to find embedded images `find_embedded_images`
* Added `get_directory_file_info` for summarizing information about source documents

0.3.7

Fixes

* **Correct fsspec connectors date metadata field types** - sftp, azure, box and gcs
* **Fix Kafka source connection problems**
* **Fix Azure AI Search session handling**
* **Fixes issue with SingleStore Source Connector not being available**
* **Fixes issue with SQLite Source Connector using wrong Indexer** - Caused indexer config parameter error when trying to use SQLite Source
* **Fixes issue with Snowflake Destination Connector `nan` values** - `nan` values were not properly replaced with `None`
* **Fixes Snowflake source `'SnowflakeCursor' object has no attribute 'mogrify'` error**
* **Box source connector can now use raw JSON as access token instead of file path to JSON**
* **Fix fsspec upload paths to be OS independent**
* **Properly log elasticsearch upload errors**

Enhancements

* **Kafka source connector has new field: group_id**
* **Support personal access token for confluence auth**
* **Leverage deterministic id for uploaded content**
* **Makes multiple SQL connectors (Snowflake, SingleStore, SQLite) more robust against SQL injection.**
* **Optimizes memory usage of Snowflake Destination Connector.**
* **Added Qdrant Cloud integration test**
* **Add DuckDB destination connector** Adds support storing artifacts in a local DuckDB database.
* **Add MotherDuck destination connector** Adds support storing artifacts in MotherDuck database.
* **Update weaviate v2 example**

0.3.6

Fixes

* **Fix Azure AI Search Error handling**

0.3.5

* Add support for local inference
* Add new pattern to recognize plain text dash bullets
* Add test for bullet patterns
* Fix for `partition_html` that allows for processing `div` tags that have both text and child
elements
* Add ability to extract document metadata from `.docx`, `.xlsx`, and `.jpg` files.
* Helper functions for identifying and extracting phone numbers
* Add new function `extract_attachment_info` that extracts and decodes the attachment
of an email.
* Staging brick to convert a list of `Element`s to a `pandas` dataframe.
* Add plain text functionality to `partition_email`

0.3.4

* Python-3.7 compat

0.3.3

* Removes BasicConfig from logger configuration
* Adds the `partition_email` partitioning brick
* Adds the `replace_mime_encodings` cleaning bricks
* Small fix to HTML parsing related to processing list items with sub-tags
* Add `EmailElement` data structure to store email documents

Page 27 of 34

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.