Added
- The `eds.transformer` component now accepts `prompts` (passed to its `preprocess` method, see breaking change below) to add before each window of text to embed.
- `LazyCollection.map` / `map_batches` now support generator functions as arguments.
- Window stride can now be disabled (i.e., stride = window) during training in the `eds.transformer` component by `training_stride = False`
- Added a new `eds.ner_overlap_scorer` to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold
- `edsnlp.load` now accepts EDS-NLP models from the huggingface hub 🤗 !
- New `python -m edsnlp.package` command to package a model for the huggingface hub or pypi-like registries
- Improve table detection in `eds.tables` and support new options in `table._.to_pd_table(...)`:
- `header=True` to use first row as header
- `index=True` to use first column as index
- `as_spans=True` to fill cells as document spans instead of strings
Changed
- :boom: Major breaking change in trainable components, moving towards a more "task-centric" design:
- the `eds.transformer` component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via the `preprocess` method, which now accepts more arguments than just the docs to process.
- similarly the `eds.span_pooler` is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in the `preprocess` method.
Consequently, the `eds.transformer` and `eds.span_pooler` no longer accept their `span_getter` argument, and the `eds.ner_crf`, `eds.span_classifier`, `eds.span_linker` and `eds.span_qualifier` components now accept a `context_getter` argument instead, as well as a `span_getter` argument for the latter two. This refactoring can be summarized as follows:
diff
- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter
- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter
and as an example for the `eds.span_linker` component:
diff
nlp.add_pipe(
eds.span_linker(
metric="cosine",
probability_mode="sigmoid",
+ span_getter="ents",
+ context_getter="ents", -> by default, same as span_getter
embedding=eds.span_pooler(
hidden_size=128,
- span_getter="ents",
embedding=eds.transformer(
- span_getter="ents",
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
),
),
name="linker",
)
- Trainable embedding components now all use `foldedtensor` to return embeddings, instead of returning a tensor of floats and a mask tensor.
- :boom: TorchComponent `__call__` no longer applies the end to end method, and instead calls the `forward` method directly, like all torch modules.
- The trainable `eds.span_qualifier` component has been renamed to `eds.span_classifier` to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not).
- `omop` converter now takes the `note_datetime` field into account by default when building a document
- `span._.date.to_datetime()` and `span._.date.to_duration()` now automatically take the `note_datetime` into account
- `nlp.vocab` is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway
Fixed
- `edsnlp.data.read_json` now correctly read the files from the directory passed as an argument, and not from the parent directory.
- Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
- Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors