Unstructured

Latest version: v0.13.7

Safety actively analyzes 625726 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 34

0.13.7

Enhancements

* **Remove `page_number` metadata fields** for HTML partition until we have a better strategy to decide page counting.
* **Extract OCRAgent.get_agent().** Generalize access to the configured OCRAgent instance beyond its use for PDFs.
* **Add calculation of table related metrics which take into account colspans and rowspans**

Features

* **add ability to get ratio of `cid` characters in embedded text extracted by `pdfminer`**.

Fixes

* **`partition_docx()` handles short table rows.** The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate `.text` and `.metadata.text_as_html` for these tables.
* **Remedy macOS test failure not triggered by CI.** Generalize temp-file detection beyond hard-coded Linux-specific prefix.
* **Remove unnecessary warning log for using default layout model.**
* **Add chunking to partition_tsv** Even though partition_tsv() produces a single Table element, chunking is made available because the Table element is often larger than the desired chunk size and must be divided into smaller chunks.

0.13.6

Enhancements

Features

Fixes

- **ValueError: Invalid file (FileType.UNK) when parsing Content-Type header with charset directive** URL response Content-Type headers are now parsed according to RFC 9110.

0.13.5

Enhancements

Features

Fixes

* **KeyError raised when updating parent_id** In the past, combining `ListItem` elements could result in reusing the same memory location which then led to unexpected side effects when updating element IDs.
* **Bump unstructured-inference==0.7.29**: table transformer predictions are now removed if confidence is below threshold

0.13.4

Enhancements

* **Unique and deterministic hash IDs for elements** Element IDs produced by any partitioning
function are now deterministic and unique at the document level by default. Before, hashes were
based only on text; however, they now also take into account the element's sequence number on a
page, the page's number in the document, and the document's file name.
* **Enable remote chunking via unstructured-ingest** Chunking using unstructured-ingest was
previously limited to local chunking using the strategies `basic` and `by_title`. Remote chunking
options via the API are now accessible.
* **Save table in cells format**. `UnstructuredTableTransformerModel` is able to return predicted table in cells format

Features

* **Add a `PDF_ANNOTATION_THRESHOLD` environment variable to control the capture of embedded links in `partition_pdf()` for `fast` strategy**.
* **Add integration with the Google Cloud Vision API**. Adds a third OCR provider, alongside Tesseract and Paddle: the Google Cloud Vision API.

Fixes

* **Remove ElementMetadata.section field.**. This field was unused, not populated by any partitioners.

0.13.3

Enhancements

* **Remove duplicate image elements**. Remove image elements identified by PDFMiner that have similar bounding boxes and the same text.
* **Add support for `start_index` in `html` links extraction**
* **Add `strategy` arg value to `_PptxPartitionerOptions`.** This makes this paritioning option available for sub-partitioners to come that may optionally use inference or other expensive operations to improve the partitioning.
* **Support pluggable sub-partitioner for PPTX Picture shapes.** Use a distinct sub-partitioner for partitioning PPTX Picture (image) shapes and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.
* **Introduce `starting_page_number` parameter to partitioning functions** It applies to those partitioners which support `page_number` in element's metadata: PDF, TIFF, XLSX, DOC, DOCX, PPT, PPTX.
* **Redesign the internal mechanism of assigning element IDs** This allows for further enhancements related to element IDs such as deterministic and document-unique hashes. The way partitioning functions operate hasn't changed, which means `unique_element_ids` continues to be `False` by default, utilizing text hashes.

Features

Fixes

* **Add support for extracting text from tag tails in HTML**. This fix adds ability to generate separate elements using tag tails.
* **Add support for extracting text from `<b>` tags in HTML** Now `partition_html()` can extract text from `<b>` tags inside container tags (like `<div>`, `<pre>`).
* **Fix pip-compile make target** Missing base.in dependency missing from requirments make file added

0.13.2

Enhancements

Features

Fixes

* **Brings back missing word list files** that caused `partition` failures in 0.13.1.

Page 1 of 34

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.