Edspdf

Latest version: v0.9.1

Safety actively analyzes 630026 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.9.1

Fixed

- It is now possible to recursively retrieve pdf files in a directory using `edspdf.data.read_files`

0.9.0

Added

- New unified `edspdf.data` api (pdf files, pandas, parquet) and LazyCollection object
to efficiently read / write data from / to different formats & sources. This API is
has been heavily inspired by the `edsnlp.data` API.
- New unified processing API to select the execution backend via `data.set_processing(...)`
to replace the old `accelerators` API (which is now deprecated, but still available).
- `huggingface-embedding` now supports quantization and other `AutoModel.from_pretrained` kwargs
- It is now possible to add convert a label to multiple labels in the `simple-aggregator` component :

ini
To build the "text" field, we will aggregate "title", "body" and "table" lines,
and output "title" lines in a separate field as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title": "title",
}


Fixed

- `huggingface-embedding` now resize bbox features for large PDFs, instead of making the model crash
- `huggingface-embedding` and `sub-box-cnn-pooler` now handle empty PDFs correctly

0.8.1

Fixed

- Fix typing to allow passing an accelerator dict to `Pipeline.pipe(...)`
- Removed multiprocessing accelerator debug output
- Fixed absolute links in github-pages docs (e.g. image assets)

Changed

- Added auto-links to components in the docs (by comparing span contents with entry points)

0.8.0

Added

- Add multi-modal transformers (`huggingface-embedding`) with windowing options
- Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
- Add inference utilities (`accelerators`), with simple mono process support and multi gpu / cpu support
- Packaging utils (`pipeline.package(...)`) to make a pip installable package from a pipeline

Changed

- Updated API to follow EDS-NLP's refactoring
- Updated `confit` to 0.4.2 (better errors) and `foldedtensor` to 0.3.0 (better multiprocess support)
- Removed `pipeline.score`. You should use `pipeline.pipe`, a custom scorer and `pipeline.select_pipes` instead.
- Better test coverage
- Use `hatch` instead of `setuptools` to build the package / docs and run the tests

Fixed

- Fixed `attrs` dependency only being installed in dev mode

0.7.0

Major refactoring of the library:

Core features
- new pipeline system whose API is inspired by spaCy
- first-class support for pytorch
- hybrid model inference and training (rules + deep learning)
- moved from pandas DataFrame to attrs dataclasses (`PDFDoc`, `Page`, `Box`, ...) for representing PDF documents
- new configuration system based on [config][https://github.com/aphp/config], with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features
- new extractors: pymupdf and poppler (separate packages for licensing reasons)
- many deep learning layers (box-transformer, 2d attention with relative position information, ...)
- trainable deep learning classifier
- training recipes for deep learning models

0.6.3

Fixed

- Allow corrupted PDF to not raise an error by default (they are treated as empty PDFs)
- Fix classification and aggregation for empty PDFs

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.