Added
- New unified `edspdf.data` api (pdf files, pandas, parquet) and LazyCollection object
to efficiently read / write data from / to different formats & sources. This API is
has been heavily inspired by the `edsnlp.data` API.
- New unified processing API to select the execution backend via `data.set_processing(...)`
to replace the old `accelerators` API (which is now deprecated, but still available).
- `huggingface-embedding` now supports quantization and other `AutoModel.from_pretrained` kwargs
- It is now possible to add convert a label to multiple labels in the `simple-aggregator` component :
ini
To build the "text" field, we will aggregate "title", "body" and "table" lines,
and output "title" lines in a separate field as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title": "title",
}
Fixed
- `huggingface-embedding` now resize bbox features for large PDFs, instead of making the model crash
- `huggingface-embedding` and `sub-box-cnn-pooler` now handle empty PDFs correctly