What's Changed
- Improve HTML linearization
- Add HTML table linearization format that uses merged cells information for `colspan` and `rowspan`
- Add prefix and suffix for `LAYOUT_FOOTER` and `LAYOUT_ENTITY`
- Add `<html><body>...</body></html>` to the output when calling `Document.to_html()`
- Use `pypdfium2` for PDF rasterization when available instead of `pdf2image`. This allows for better portability as the former does not have a dependency on OS libraries and should work out of the box with Lambda and SageMaker.
- Fix expenses with no summary fields
- Replace region mismatch with invalid S3 object exception
Backward-incompatible changes
* This update removes `s3_output_path` from the synchronous functions as `s3_output_path` is not a supported parameter for the Textract Synchronous API
* This update changes the exception raised by the `textractor.py` functions which will no longer raise `RegionMismatchError` (which is however kept in `textractor.exceptions` for backward compatibility.
* This update removes `confidence_score` from `KeyValue` entities in favour of `_confidence` which is used for all other entities.
**Full Changelog**: https://github.com/aws-samples/amazon-textract-textractor/compare/v1.7.12...v1.8.0