Edsnlp

Latest version: v0.16.0

Safety actively analyzes 722491 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 8

0.16.0

Added
- Hyperparameter Tuning for EDS-NLP: introduced a new script `edsnlp.tune` for hyperparameter tuning using Optuna. This feature allows users to efficiently optimize model parameters with options for single-phase or two-phase tuning strategies. Includes support for parameter importance analysis, visualization, pruning, and automatic handling of GPU time budgets.
- Provided a [detailed tutorial](https://aphp.github.io/edsnlp/v0.16.0/tutorials/tuning/) on hyperparameter tuning, covering usage scenarios and configuration options.
- `ScheduledOptimizer` (e.g., `core: "optimizer"`) now supports importing optimizers using their qualified name (e.g., `optim: "torch.optim.Adam"`).
- `eds.ner_crf` now computes confidence score on spans.

Changed

- The loss of `eds.ner_crf` is now computed as the mean over the words instead of the sum. This change is compatible with multi-gpu training.
- Having multiple stats keys matching a batching pattern now warns instead of raising an error.

Fixed

- Support packaging with poetry 2.0
- Solve pickling issues with multiprocessing when pytorch is installed
- Allow deep attributes like `a.b.c` for `span_attributes` in Standoff and OMOP doc2dict converters
- Fixed various aspects of stream shuffling:

- Ensure the Parquet reader shuffles the data when `shuffle=True`
- Ensure we don't overwrite the RNG of the data reader when calling `stream.shuffle()` with no seed
- Raise an error if the batch size in `stream.shuffle(batch_size=...)` is not compatible with the stream
- `eds.split` now keeps doc and span attributes in the sub-documents.

0.15.0

Added

- `edsnlp.data.read_parquet` now accept a `work_unit="fragment"` option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.
- Accept no validation data in `edsnlp.train` script
- Log the training config at the beginning of the trainings
- Support a specific model output dir path for trainings (`output_model_dir`), and whether to save the model or not (`save_model`)
- Specify whether to log the validation results or not (`logger=False`)
- Added support for the CoNLL format with `edsnlp.data.read_conll` and with a specific `eds.conll_dict2doc` converter
- Added a Trainable Biaffine Dependency Parser (`eds.biaffine_dep_parser`) component and metrics
- New `eds.extractive_qa` component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as in `eds.ner_crf`.

Fixed

- Fix `join_thread` missing attribute in `SimpleQueue` when cleaning a multiprocessing executor
- Support huggingface transformers that do not set `cls_token_id` and `sep_token_id` (we now also look for these tokens in the `special_tokens_map` and `vocab` mappings)
- Fix changing scorers dict size issue when evaluating during training
- Seed random states (instead of using `random.RandomState()`) when shuffling in data readers : this is important for
1. reproducibility
2. in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
- Bubble BaseComponent instantiation errors correctly
- Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona `sub_batch_size` argument of `TrainingData`.
- Support again edsnlp without pytorch installed
- We now test that edsnlp works without pytorch installed
- Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3

0.14.0

Added

- Support for setuptools based projects in `edsnlp.package` command
- Pipelines can now be instantiated directly from a config file (instead of having to cast a dict containing their arguments) by putting the core = "pipeline" or "load" field in the pipeline section)
- `edsnlp.load` now correctly takes disable, enable and exclude parameters into account
- Pipeline now has a basic repr showing is base langage (mostly useful to know its tokenizer) and its pipes
- New `python -m edsnlp.evaluate` script to evaluate a model on a dataset
- Sentence detection can now be configured to change the minimum number of newlines to consider a newline-triggered sentence, and disable capitalization checking.
- New `eds.split` pipe to split a document into multiple documents based on a splitting pattern (useful for training)
- Allow `converter` argument of `edsnlp.data.read/from_...` to be a list of converters instead of a single converter
- New revamped and documented `edsnlp.train` script and API
- Support YAML config files (supported only CFG/INI files before)
- Most of EDS-NLP functions are now clickable in the documentation
- ScheduledOptimizer now accepts schedules directly in place of parameters, and easy parameter selection:

ScheduledOptimizer(
optim="adamw",
module=nlp,
total_steps=2000,
groups={
"^transformer": {
lr will go from 0 to 5e-5 then to 0 for params matching "transformer"
"lr": {"schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,},
},
"": {
lr will go from 3e-4 during 200 steps then to 0 for other params
"lr": {"schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,},
},
},
)

Changed

- `eds.span_context_getter`'s parameter `context_sents` is no longer optional and must be explicitly set to 0 to disable sentence context
- In multi-GPU setups, streams that contain torch components are now stripped of their parameter tensors when sent to CPU Workers since these workers only perform preprocessing and postprocessing and should therefore not need the model parameters.
- The `batch_size` argument of `Pipeline` is deprecated and is not used anymore. Use the `batch_size` argument of `stream.map_pipeline` instead.

Fixed

- Sort files before iterating over a standoff or json folder to ensure reproducibility
- Sentence detection now correctly match capitalized letters + apostrophe
- We now ensure that the workers pool is properly closed whatever happens (exception, garbage collection, data ending) in the `multiprocessing` backend. This prevents some executions from hanging indefinitely at the end of the processing.
- Propagate torch sharing strategy to other workers in the `multiprocessing` backend. This is useful when the system is running out of file descriptors and `ulimit -n` is not an option. Torch sharing strategy can also be set via an environment variable `TORCH_SHARING_STRATEGY` (default is `file_descriptor`, [consider using `file_system` if you encounter issues](https://pytorch.org/docs/stable/multiprocessing.html#file-system-file-system)).

Data API changes

- `LazyCollection` objects are now called `Stream` objects
- By default, `multiprocessing` backend now preserves the order of the input data. To disable this and improve performance, use `deterministic=False` in the `set_processing` method
- :rocket: Parallelized GPU inference throughput improvements !

- For simple {pre-process → model → post-process} pipelines, GPU inference can be up to 30% faster in non-deterministic mode (results can be out of order) and up to 20% faster in deterministic mode (results are in order)
- For multitask pipelines, GPU inference can be up to twice as fast (measured in a two-tasks BERT+NER+Qualif pipeline on T4 and A100 GPUs)

- The `.map_batches`, `.map_pipeline` and `.map_gpu` methods now support a specific `batch_size` and batching function, instead of having a single batch size for all pipes
- Readers now have a `loop` parameter to cycle over the data indefinitely (useful for training)
- Readers now have a `shuffle` parameter to shuffle the data before iterating over it
- In `multiprocessing` mode, file based readers now read the data in the workers (was an option before)
- We now support two new special batch sizes

- "fragment" in the case of parquet datasets: rows of a full parquet file fragment per batch
- "dataset" which is mostly useful during training, for instance to shuffle the dataset at each epoch.
These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.

- :boom: Breaking change: a `map` function returning a list or a generator won't be automatically flattened anymore. Use `flatten()` to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output
- :boom: Breaking change: the `chunk_size` and `sort_chunks` are now deprecated : to sort data before applying a transformation, use `.map_batches(custom_sort_fn, batch_size=...)`

Training API changes

- We now provide a training script `python -m edsnlp.train --config config.cfg` that should fit many use cases. Check out the docs !
- In particular, we do not require pytorch's Dataloader for training and can rely solely on EDS-NLP stream/data API, which is better suited for large streamable datasets and dynamic preprocessing (ie different result each time we apply a noised preprocessing op on a sample).
- Each trainable component can now provide a `stats` field in its `preprocess` output to log info about the sample (number of words, tokens, spans, ...):

- these stats are both used for batching (e.g., make batches of no more than "25000 tokens")
- for logging
- for computing correct loss means when accumulating gradients over multiple mini-mini-batches
- for computing correct loss means in multi-GPU setups, since these stats are synchronized and accumulated across GPUs

- Support multi GPU training via hugginface `accelerate` and EDS-NLP `Stream` API consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variables

0.13.1

Added

- `eds.tables` accepts a minimum_table_size (default 2) argument to reduce pollution
- `RuleBasedQualifier` now expose a `process` method that only returns qualified entities and token without actually tagging them, deferring this task to the `__call__` method.
- Added new patterns for metastasis detection. Developed on CT-Scan reports.
- Added citation of articles

Changed

- Renamed `edsnlp.scorers` to `edsnlp.metrics` and removed the `_scorer` suffix from their
registry name (e.g, `scorers = ner_overlap_scorer` → `metrics = ner_overlap`)
- Rename `eds.measurements` to `eds.quantities`
- scikit-learn (used in `eds.endlines`) is no longer installed by default when installing `edsnlp[ml]`

Fixed

- Disorder and Behavior pipes don't use a "PRESENT" or "ABSENT" `status` anymore. Instead, `status=None` by default,
and `ent._.negation` is set to True instead of setting `status` to "ABSENT". To this end, the *tobacco* and *alcohol*
now use the `NegationQualifier` internally.
- Numbers are now only detected without trying to remove the pollution in between digits, ie `55 77777` could be detected as a full number before, but not anymore.
- Resolve encoding-related data reading issues by forcing utf-8

0.13.0

Added

- `data.set_processing(...)` now expose an `autocast` parameter to disable or tweak the automatic casting of the tensor
during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.
- Use `torch.inference_mode` to disable view tracking and version counter bumps during inference.
- Added a new NER pipeline for suicide attempt detection
- Added date cues (regular expression matches that contributed to a date being detected) under the extension `ent._.date_cues`
- Added tables processing in eds.measurement
- Added 'all' as possible input in eds.measurement measurements config
- Added new units in eds.measurement

Changed

- Default to mixed precision inference

Fixed

- `edsnlp.load("your/huggingface-model", install_dependencies=True)` now correctly resolves the python pip
(especially on Colab) to auto-install the model dependencies
- We now better handle empty documents in the `eds.transformer`, `eds.text_cnn` and `eds.ner_crf` components
- Support mixed precision in `eds.text_cnn` and `eds.ner_crf` components
- Support pre-quantization (<4.30) transformers versions
- Verify that all batches are non empty
- Fix `span_context_getter` for `context_words` = 0, `context_sents` > 2 and support assymetric contexts
- Don't split sentences on rare unicode symbols
- Better detect abbreviations, like `E.coli`, now split as [`E.`, `coli`] and not [`E`, `.`, `coli`]

0.12.3

Changed

Packages:

- Pip-installable models are now built with `hatch` instead of poetry, which allows us to expose `artifacts` (weights)
at the root of the sdist package (uploadable to HF) and move them inside the package upon installation to avoid conflicts.
- Dependencies are no longer inferred with dill-magic (this didn't work well before anyway)
- Option to perform substitutions in the model's README.md file (e.g., for the model's name, metrics, ...)
- Huggingface models are now installed with pip *editable* installations, which is faster since it doesn't copy around the weights

Page 1 of 8

Releases

Has known vulnerabilities

Edsnlp

Page 1 of 8

0.16.0

0.15.0

0.14.0

0.13.1

0.13.0

0.12.3

Page 1 of 8

Links

Releases