* Adds functionality to sort elements in `partition_pdf` for `fast` strategy * Adds ingest tests with `--fast` strategy on PDF documents * Adds --api-key to unstructured-ingest
Features
* Adds `partition_rst` for processed ReStructured Text documents.
Fixes
* Adds handling for emails that do not have a datetime to extract. * Adds pdf2image package as core requirement of unstructured (with no extras)
0.7.4
Not secure
Enhancements
* Allows passing kwargs to request data field for `partition_via_api` and `partition_multiple_via_api` * Enable MIME type detection if libmagic is not available * Adds handling for empty files in `detect_filetype` and `partition`.
Features
Fixes
* Reslove `grpcio` import issue on `weaviate.schema.validate_schema` for python 3.9 and 3.10 * Remove building `detectron2` from source in Dockerfile
0.7.3
Not secure
Enhancements
* Update IngestDoc abstractions and add data source metadata in ElementMetadata
Features
Fixes
* Pass `strategy` parameter down from `partition` for `partition_image` * Filetype detection if a CSV has a `text/plain` MIME type * `convert_office_doc` no longers prints file conversion info messages to stdout. * `partition_via_api` reflects the actual filetype for the file processed in the API.
0.7.2
Not secure
Enhancements
* Adds an optional encoding kwarg to `elements_to_json` and `elements_from_json` * Bump version of base image to use new stable version of tesseract
Features
Fixes
* Update the `read_txt_file` utility function to keep using `spooled_to_bytes_io_if_needed` for xml * Add functionality to the `read_txt_file` utility function to handle file-like object from URL * Remove the unused parameter `encoding` from `partition_pdf` * Change auto.py to have a `None` default for encoding * Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding. * Adds benchmark test with test docs in example-docs * Re-enable test_upload_label_studio_data_with_sdk * File detection now detects code files as plain text * Adds `tabulate` explicitly to dependencies * Fixes an issue in `metadata.page_number` of pptx files * Adds showing help if no parameters passed
0.7.1
Not secure
Enhancements
Features
* Add `stage_for_weaviate` to stage `unstructured` outputs for upload to Weaviate, along with a helper function for defining a class to use in Weaviate schemas. * Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
Fixes
0.7.0
Not secure
Enhancements
* Installing `detectron2` from source is no longer required when using the `local-inference` extra. * Updates `.pptx` parsing to include text in tables.
Features
Fixes
* Fixes an issue in `_add_element_metadata` that caused all elements to have `page_number=1` in the element metadata. * Adds `.log` as a file extension for TXT files. * Adds functionality to try other common encodings for email (`.eml`) files if an error related to the encoding is raised and the user has not specified an encoding. * Allow passed encoding to be used in the `replace_mime_encodings` * Fixes page metadata for `partition_html` when `include_metadata=False` * A `ValueError` now raises if `file_filename` is not specified when you use `partition_via_api` with a file-like object.