Unstructured

Latest version: v0.17.2

Safety actively analyzes 723607 Python packages for vulnerabilities to keep your Python projects secure.

Page 18 of 39

0.10.5

Not secure

Enhancements

* Create new CI Pipelines
- Checking text, xml, email, and html doc tests against the library installed without extras
- Checking each library extra against their respective tests
* `partition` raises an error and tells the user to install the appropriate extra if a filetype
is detected that is missing dependencies.
* Add custom errors to ingest
* Bump `unstructured-ingest==0.5.15`
- Handle an uncaught TesseractError (0.5.15)
- Add TIFF test file and TIFF filetype to `test_from_image_file` in `test_layout` (0.5.14)
* Use `entire_page` ocr mode for pdfs and images
* Add notes on extra installs to docs
* Adds ability to reuse connections per process in unstructured-ingest

Features

* Add delta table connector

Fixes

0.10.4

Not secure

* Pass ocr_mode in partition_pdf and set the default back to individual pages for now
* Add diagrams and descriptions for ingest design in the ingest README

Features

* Supports multipage TIFF image partitioning

Fixes

0.10.2

Not secure

Enhancements

* Bump unstructured-inference==0.5.13:
- Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.

Features

Fixes

0.10.1

Not secure

Enhancements

* Bump unstructured-inference==0.5.12:
- fix to avoid trace for certain PDF's (0.5.12)
- better defaults for DPI for hi_res and Chipper (0.5.11)
- implement full-page OCR (0.5.10)

Features

Fixes

* Fix dead links in repository README (Quick Start > Install for local development, and Learn more > Batch Processing)
* Update document dependencies to include tesseract-lang for additional language support (required for tests to pass)

0.10.0

Not secure

Enhancements

* Add `include_header` kwarg to `partition_xlsx` and change default behavior to `True`
* Update the `links` and `emphasized_texts` metadata fields

Features

Fixes

0.9.3

Not secure

Enhancements

* Pinned dependency cleanup.
* Update `partition_csv` to always use `soupparser_fromstring` to parse `html text`
* Update `partition_tsv` to always use `soupparser_fromstring` to parse `html text`
* Add `metadata.section` to capture epub table of contents data
* Add `unique_element_ids` kwarg to partition functions. If `True`, will use a UUID
for element IDs instead of a SHA-256 hash.
* Update `partition_xlsx` to always use `soupparser_fromstring` to parse `html text`
* Add functionality to switch `html` text parser based on whether the `html` text contains emoji
* Add functionality to check if a string contains any emoji characters
* Add CI tests around Notion

Features

* Add Airtable Connector to be able to pull views/tables/bases from an Airtable organization

Fixes

* fix pdf partition of list items being detected as titles in OCR only mode
* make notion module discoverable
* fix emails with `Content-Distribution: inline` and `Content-Distribution: attachment` with no filename
* Fix email attachment filenames which had `=` in the filename itself

Page 18 of 39

Releases

Has known vulnerabilities

Previous Next

Unstructured

Page 18 of 39

0.10.5

0.10.4

0.10.2

0.10.1

0.10.0

0.9.3

Page 18 of 39

Links

Releases