Edsnlp

Latest version: v0.11.2

Safety actively analyzes 630026 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 7

0.11.2

Fixed
- Fix `edsnlp.utils.file_system.normalize_fs_path` file system detection not working correctly
- Improved performance of `edsnlp.data` methods over a filesystem (`fs` parameter)

0.11.1

Added

- Automatic estimation of cpu count when using multiprocessing
- `optim.initialize()` method to create optim state before the first backward pass

Changed

- `nlp.post_init` will not tee lazy collections anymore (use `edsnlp.utils.collections.multi_tee` yourself if needed)

Fixed

- Corrected inconsistencies in `eds.span_linker`

0.11.0

Added

- Support for a `filesystem` parameter in every `edsnlp.data.read_*` and `edsnlp.data.write_*` functions
- Pipes of a pipeline are now easily accessible with `nlp.pipes.xxx` instead of `nlp.get_pipe("xxx")`
- Support builtin Span attributes in converters `span_attributes` parameter, e.g.
python
import edsnlp

nlp = ...
nlp.add_pipe("eds.sentences")

data = edsnlp.data.from_xxx(...)
data = data.map_pipeline(nlp)
data.to_pandas(converters={"ents": {"span_attributes": ["sent.text", "start", "end"]}})

- Support assigning Brat AnnotatorNotes as span attributes: `edsnlp.data.read_standoff(..., notes_as_span_attribute="cui")`
- Support for mapping full batches in `edsnlp.processing` pipelines with `map_batches` lazy collection method:
python
import edsnlp

data = edsnlp.data.from_xxx(...)
data = data.map_batches(lambda batch: do_something(batch))
data.to_pandas()

- New `data.map_gpu` method to map a deep learning operation on some data and take advantage of edsnlp multi-gpu inference capabilities
- Added average precision computation in edsnlp span_classification scorer
- You can now add pipes to your pipeline by instantiating them directly, which comes with many advantages, such as auto-completion, introspection and type checking !

python
import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
instead of nlp.add_pipe("eds.sentences")


*The previous way of adding pipes is still supported.*
- New `eds.span_linker` deep-learning component to match entities with their concepts in a knowledge base, in synonym-similarity or concept-similarity mode.

Changed

- `nlp.preprocess_many` now uses lazy collections to enable parallel processing
- :warning: Breaking change. Improved and simplified `eds.span_qualifier`: we didn't support combination groups before, so this feature was scrapped for now. We now also support splitting values of a single qualifier between different span labels.
- Optimized edsnlp.data batching, especially for large batch sizes (removed a quadratic loop)
- :warning: Breaking change. By default, the name of components added to a pipeline is now the default name defined in their class `__init__` signature. For most components of EDS-NLP, this will change the name from "eds.xxx" to "xxx".

Fixed

- Flatten list outputs (such as "ents" converter) when iterating: `nlp.map(data).to_iterable("ents")` is now a list of entities, and not a list of lists of entities
- Allow span pooler to choose between multiple base embedding spans (as likely produced by `eds.transformer`) by sorting them by Dice overlap score.
- EDS-NLP does not raise an error anymore when saving a model to an already existing, but empty directory

0.10.7

Added

- Support empty writer converter by default in `edsnlp.data` readers / writers (do not convert by default)
- Add support for polars data import / export
- Allow kwargs in `eds.transformer` to pass to the transformer model

Changed

- Saving pipelines now longer saves the `disabled` status of the pipes (i.e., all pipes are considered "enabled" when saved). This feature was not used and causing issues when saving a model wrapped in a `nlp.select_pipes` context.

Fixed

- Allow missing `meta.json`, `tokenizer` and `vocab` paths when loading saved models
- Save torch buffers when dumping machine learning models to disk (previous versions only saved the model parameters)
- Fix automatic `batch_size` estimation in `eds.transformer` when `max_tokens_per_device` is set to `auto` and multiple GPUs are used
- Fix JSONL file parsing

0.10.6

Added

- Added `batch_by`, `split_into_batches_after`, `sort_chunks`, `chunk_size`, `disable_implicit_parallelism` parameters to processing (`simple` and `multiprocessing`) backends to improve performance
and memory usage. Sorting chunks can improve yield up to **twice the speed** in some cases.
- The deep learning cache mechanism now supports multitask models with weight sharing in multiprocessing mode.
- Added `max_tokens_per_device="auto"` parameter to `eds.transformer` to estimate memory usage and automatically split the input into chunks that fit into the GPU.

Changed

- Improved speed and memory usage of the `eds.text_cnn` pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases.
- Deprecate the converters' (especially for BRAT/Standoff data) `bool_attributes`
parameter in favor of general `default_attributes`. This new mapping describes how to
set attributes on spans for which no attribute value was found in the input format.
This is especially useful for negation, or frequent attributes values (e.g. "negated"
is often False, "temporal" is often "present"), that annotators may not want to
annotate every time.
- Default `eds.ner_crf` window is now set to 40 and stride set to 20, as it doesn't
affect throughput (compared to before, window set to 20) and improves accuracy.
- New default `overlap_policy='merge'` option and parameter renaming in
`eds.span_context_getter` (which replaces `eds.span_sentence_getter`)

Fixed

- Improved error handling in `multiprocessing` backend (e.g., no more deadlock)
- Various improvements to the data processing related documentation pages
- Begin of sentence / end of sentence transitions of the `eds.ner_crf` component are now
disabled when windows are used (e.g., neither `window=1` equivalent to softmax and
`window=0`equivalent to default full sequence Viterbi decoding)
- `eds` tokenizer nows inherits from `spacy.Tokenizer` to avoid typing errors
- Only match 'ne' negation pattern when not part of another word to avoid false positives cases like `u[ne] cure de 10 jours`
- Disabled pipes are now correctly ignored in the `Pipeline.preprocess` method
- Add "eventuel*" patterns to `eds.hyphothesis`

0.10.5

Fixed

- Allow non-url paths when parquet filesystem is given

Page 1 of 7

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.