Unstructured

Latest version: v0.16.11

Safety actively analyzes 687918 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 14 of 34

0.10.13

Enhancements

* Updated documentation: Added back support doc types for partitioning, more Python codes in the API page, RAG definition, and use case.
* Updated Hi-Res Metadata: PDFs and Images using Hi-Res strategy now have layout model class probabilities added ot metadata.
* Updated the `_detect_filetype_from_octet_stream()` function to use libmagic to infer the content type of file when it is not a zip file.
* Tesseract minor version bump to 5.3.2

Features

* Add Jira Connector to be able to pull issues from a Jira organization
* Add `clean_ligatures` function to expand ligatures in text

Fixes

* `partition_html` breaks on `<br>` elements.
* Ingest error handling to properly raise errors when wrapped
* GH issue 1361: fixes a sortig error that prevented some PDF's from being parsed
* Bump unstructured-inference
* Brings back embedded images in PDF's (0.5.23)

0.10.12

Enhancements

* Removed PIL pin as issue has been resolved upstream
* Bump unstructured-inference
* Support for yolox_quantized layout detection model (0.5.20)
* YoloX element types added

Features

* Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead

Fixes

* Bump unstructured-inference
* Avoid divide-by-zero errors swith `safe_division` (0.5.21)

0.10.11

Enhancements

* Bump unstructured-inference
* Combine entire-page OCR output with layout-detected elements, to ensure full coverage of the page (0.5.19)

Features

* Add in ingest cli s3 writer

Fixes

* Fix a bug where `xy-cut` sorting attemps to sort elements without valid coordinates; now xy cut sorting only works when **all** elements have valid coordinates

0.10.10

Enhancements

* Adds `text` as an input parameter to `partition_xml`.
* `partition_xml` no longer runs through `partition_text`, avoiding incorrect splitting
on carriage returns in the XML. Since `partition_xml` no longer calls `partition_text`,
`min_partition` and `max_partition` are no longer supported in `partition_xml`.
* Bump `unstructured-inference==0.5.18`, change non-default detectron2 classification threshold
* Upgrade base image from rockylinux 8 to rockylinux 9
* Serialize IngestDocs to JSON when passing to subprocesses

Features

Fixes

- Fix a bug where mismatched `elements` and `bboxes` are passed into `add_pytesseract_bbox_to_elements`

0.10.9

Enhancements

* Fix `test_json` to handle only non-extra dependencies file types (plain-text)

Features

* Adds `chunk_by_title` to break a document into sections based on the presence of `Title`
elements.
* add new extraction function `extract_image_urls_from_html` to extract all img related URL from html text.

Fixes

* Make cv2 dependency optional
* Edit `add_pytesseract_bbox_to_elements`'s (`ocr_only` strategy) `metadata.coordinates.points` return type to `Tuple` for consistency.
* Re-enable test-ingest-confluence-diff for ingest tests
* Fix syntax for ingest test check number of files
* Fix csv and tsv partitioners loosing the first line of the files when creating elements

0.10.8

Enhancements

* Release docker image that installs Python 3.10 rather than 3.8

Features

Fixes

Page 14 of 34

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.