Edsnlp

Latest version: v0.12.0

Safety actively analyzes 634667 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 7

0.12.0

Added

- The `eds.transformer` component now accepts `prompts` (passed to its `preprocess` method, see breaking change below) to add before each window of text to embed.
- `LazyCollection.map` / `map_batches` now support generator functions as arguments.
- Window stride can now be disabled (i.e., stride = window) during training in the `eds.transformer` component by `training_stride = False`
- Added a new `eds.ner_overlap_scorer` to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold
- `edsnlp.load` now accepts EDS-NLP models from the huggingface hub 🤗 !
- New `python -m edsnlp.package` command to package a model for the huggingface hub or pypi-like registries

Changed

- :boom: Major breaking change in trainable components, moving towards a more "task-centric" design:
- the `eds.transformer` component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via the `preprocess` method, which now accepts more arguments than just the docs to process.
- similarly the `eds.span_pooler` is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in the `preprocess` method.

Consequently, the `eds.transformer` and `eds.span_pooler` no longer accept their `span_getter` argument, and the `eds.ner_crf`, `eds.span_classifier`, `eds.span_linker` and `eds.span_qualifier` components now accept a `context_getter` argument instead, as well as a `span_getter` argument for the latter two. This refactoring can be summarized as follows:

diff
- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter

- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter


and as an example for the `eds.span_linker` component:

diff
nlp.add_pipe(
eds.span_linker(
metric="cosine",
probability_mode="sigmoid",
+ span_getter="ents",
+ context_getter="ents", -> by default, same as span_getter
embedding=eds.span_pooler(
hidden_size=128,
- span_getter="ents",
embedding=eds.transformer(
- span_getter="ents",
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
),
),
name="linker",
)

- Trainable embedding components now all use `foldedtensor` to return embeddings, instead of returning a tensor of floats and a mask tensor.
- :boom: TorchComponent `__call__` no longer applies the end to end method, and instead calls the `forward` method directly, like all torch modules.
- The trainable `eds.span_qualifier` component has been renamed to `eds.span_classifier` to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not).
- `omop` converter now takes the `note_datetime` field into account by default when building a document
- `span._.date.to_datetime()` and `span._.date.to_duration()` now automatically take the `note_datetime` into account
- `nlp.vocab` is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway

Fixed

- `edsnlp.data.read_json` now correctly read the files from the directory passed as an argument, and not from the parent directory.
- Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
- Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors

0.11.2

Fixed
- Fix `edsnlp.utils.file_system.normalize_fs_path` file system detection not working correctly
- Improved performance of `edsnlp.data` methods over a filesystem (`fs` parameter)

0.11.1

Added

- Automatic estimation of cpu count when using multiprocessing
- `optim.initialize()` method to create optim state before the first backward pass

Changed

- `nlp.post_init` will not tee lazy collections anymore (use `edsnlp.utils.collections.multi_tee` yourself if needed)

Fixed

- Corrected inconsistencies in `eds.span_linker`

0.11.0

Added

- Support for a `filesystem` parameter in every `edsnlp.data.read_*` and `edsnlp.data.write_*` functions
- Pipes of a pipeline are now easily accessible with `nlp.pipes.xxx` instead of `nlp.get_pipe("xxx")`
- Support builtin Span attributes in converters `span_attributes` parameter, e.g.
python
import edsnlp

nlp = ...
nlp.add_pipe("eds.sentences")

data = edsnlp.data.from_xxx(...)
data = data.map_pipeline(nlp)
data.to_pandas(converters={"ents": {"span_attributes": ["sent.text", "start", "end"]}})

- Support assigning Brat AnnotatorNotes as span attributes: `edsnlp.data.read_standoff(..., notes_as_span_attribute="cui")`
- Support for mapping full batches in `edsnlp.processing` pipelines with `map_batches` lazy collection method:
python
import edsnlp

data = edsnlp.data.from_xxx(...)
data = data.map_batches(lambda batch: do_something(batch))
data.to_pandas()

- New `data.map_gpu` method to map a deep learning operation on some data and take advantage of edsnlp multi-gpu inference capabilities
- Added average precision computation in edsnlp span_classification scorer
- You can now add pipes to your pipeline by instantiating them directly, which comes with many advantages, such as auto-completion, introspection and type checking !

python
import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
instead of nlp.add_pipe("eds.sentences")


*The previous way of adding pipes is still supported.*
- New `eds.span_linker` deep-learning component to match entities with their concepts in a knowledge base, in synonym-similarity or concept-similarity mode.

Changed

- `nlp.preprocess_many` now uses lazy collections to enable parallel processing
- :warning: Breaking change. Improved and simplified `eds.span_qualifier`: we didn't support combination groups before, so this feature was scrapped for now. We now also support splitting values of a single qualifier between different span labels.
- Optimized edsnlp.data batching, especially for large batch sizes (removed a quadratic loop)
- :warning: Breaking change. By default, the name of components added to a pipeline is now the default name defined in their class `__init__` signature. For most components of EDS-NLP, this will change the name from "eds.xxx" to "xxx".

Fixed

- Flatten list outputs (such as "ents" converter) when iterating: `nlp.map(data).to_iterable("ents")` is now a list of entities, and not a list of lists of entities
- Allow span pooler to choose between multiple base embedding spans (as likely produced by `eds.transformer`) by sorting them by Dice overlap score.
- EDS-NLP does not raise an error anymore when saving a model to an already existing, but empty directory

0.10.7

Added

- Support empty writer converter by default in `edsnlp.data` readers / writers (do not convert by default)
- Add support for polars data import / export
- Allow kwargs in `eds.transformer` to pass to the transformer model

Changed

- Saving pipelines now longer saves the `disabled` status of the pipes (i.e., all pipes are considered "enabled" when saved). This feature was not used and causing issues when saving a model wrapped in a `nlp.select_pipes` context.

Fixed

- Allow missing `meta.json`, `tokenizer` and `vocab` paths when loading saved models
- Save torch buffers when dumping machine learning models to disk (previous versions only saved the model parameters)
- Fix automatic `batch_size` estimation in `eds.transformer` when `max_tokens_per_device` is set to `auto` and multiple GPUs are used
- Fix JSONL file parsing

0.10.6

Added

- Added `batch_by`, `split_into_batches_after`, `sort_chunks`, `chunk_size`, `disable_implicit_parallelism` parameters to processing (`simple` and `multiprocessing`) backends to improve performance
and memory usage. Sorting chunks can improve yield up to **twice the speed** in some cases.
- The deep learning cache mechanism now supports multitask models with weight sharing in multiprocessing mode.
- Added `max_tokens_per_device="auto"` parameter to `eds.transformer` to estimate memory usage and automatically split the input into chunks that fit into the GPU.

Changed

- Improved speed and memory usage of the `eds.text_cnn` pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases.
- Deprecate the converters' (especially for BRAT/Standoff data) `bool_attributes`
parameter in favor of general `default_attributes`. This new mapping describes how to
set attributes on spans for which no attribute value was found in the input format.
This is especially useful for negation, or frequent attributes values (e.g. "negated"
is often False, "temporal" is often "present"), that annotators may not want to
annotate every time.
- Default `eds.ner_crf` window is now set to 40 and stride set to 20, as it doesn't
affect throughput (compared to before, window set to 20) and improves accuracy.
- New default `overlap_policy='merge'` option and parameter renaming in
`eds.span_context_getter` (which replaces `eds.span_sentence_getter`)

Fixed

- Improved error handling in `multiprocessing` backend (e.g., no more deadlock)
- Various improvements to the data processing related documentation pages
- Begin of sentence / end of sentence transitions of the `eds.ner_crf` component are now
disabled when windows are used (e.g., neither `window=1` equivalent to softmax and
`window=0`equivalent to default full sequence Viterbi decoding)
- `eds` tokenizer nows inherits from `spacy.Tokenizer` to avoid typing errors
- Only match 'ne' negation pattern when not part of another word to avoid false positives cases like `u[ne] cure de 10 jours`
- Disabled pipes are now correctly ignored in the `Pipeline.preprocess` method
- Add "eventuel*" patterns to `eds.hyphothesis`

Page 1 of 7

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.