Added
- Added `batch_by`, `split_into_batches_after`, `sort_chunks`, `chunk_size`, `disable_implicit_parallelism` parameters to processing (`simple` and `multiprocessing`) backends to improve performance
and memory usage. Sorting chunks can improve yield up to **twice the speed** in some cases.
- The deep learning cache mechanism now supports multitask models with weight sharing in multiprocessing mode.
- Added `max_tokens_per_device="auto"` parameter to `eds.transformer` to estimate memory usage and automatically split the input into chunks that fit into the GPU.
Changed
- Improved speed and memory usage of the `eds.text_cnn` pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases.
- Deprecate the converters' (especially for BRAT/Standoff data) `bool_attributes`
parameter in favor of general `default_attributes`. This new mapping describes how to
set attributes on spans for which no attribute value was found in the input format.
This is especially useful for negation, or frequent attributes values (e.g. "negated"
is often False, "temporal" is often "present"), that annotators may not want to
annotate every time.
- Default `eds.ner_crf` window is now set to 40 and stride set to 20, as it doesn't
affect throughput (compared to before, window set to 20) and improves accuracy.
- New default `overlap_policy='merge'` option and parameter renaming in
`eds.span_context_getter` (which replaces `eds.span_sentence_getter`)
Fixed
- Improved error handling in `multiprocessing` backend (e.g., no more deadlock)
- Various improvements to the data processing related documentation pages
- Begin of sentence / end of sentence transitions of the `eds.ner_crf` component are now
disabled when windows are used (e.g., neither `window=1` equivalent to softmax and
`window=0`equivalent to default full sequence Viterbi decoding)
- `eds` tokenizer nows inherits from `spacy.Tokenizer` to avoid typing errors
- Only match 'ne' negation pattern when not part of another word to avoid false positives cases like `u[ne] cure de 10 jours`
- Disabled pipes are now correctly ignored in the `Pipeline.preprocess` method
- Add "eventuel*" patterns to `eds.hyphothesis`