Miditok

Latest version: v3.0.5.post1

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 11

3.0.0

Switch to symusic

This major version marks the switch from the [miditoolkit](https://github.com/YatingMusic/miditoolkit) MIDI reading/writing library to [**symusic**](https://github.com/Yikai-Liao/symusic), and a large optimisation of the MIDI preprocessing steps.

Symusic is a MIDI reading / writing library written in C++ with Python binding, offering unmatched speeds, [**up to 500 times faster than native Python libraries**](https://github.com/Natooz/MidiTok/issues/112#issuecomment-1895948962). It is based on [minimidi](https://github.com/lzqlzzq/minimidi). The two libraries are created and maintained by Yikai-Liao and lzqlzzq, who did an amazing work, which is still ongoing as many useful features are on the roadmap! 🫶

**Tokenizers from previous versions are compatible with this new version, but their might be some time variations if you compare how MIDIs are tokenized and tokens decoded.**

Performance boost

These changes result in a way faster MIDI loading/writing and tokenization times! **The overall tokenization (loading MIDI and tokenizing it) is** [**between 5 to 12 times faster**](https://github.com/Natooz/MidiTok/issues/112#issuecomment-1896286910) depending the tokenizer and data. You can find other benchmarks [here](https://github.com/Natooz/MidiTok/issues/112#issuecomment-1895948962).

This huge speed gain allows to discard the previously recommended step of pre-tokenizing MIDI files as json tokens, and **directly tokenize the MIDIs on the fly while training/using a model**! We updated the [usage examples of the docs](https://miditok.readthedocs.io/en/latest/examples.html) accordingly, the code is now simplified.

Other major changes

* When using time signatures, time tokens are now computed in ticks per beat, as opposed to ticks per quarter note as done previously. This change is in line with the definition of time and duration tokens, which was not handled following the MIDI norm for note values other than the quarter note until now (https://github.com/Natooz/MidiTok/pull/124);
* Adding new ruff rules and their fixes to comply, increasing the code quality in https://github.com/Natooz/MidiTok/pull/115;
* MidiTok still supports `miditoolkit.MidiFile` objects, but those will be converted on the fly to a `symusic.Score` object and a depreciation warning will be thrown;
* The data augmentation methods on the token level has been removed, in favour of better data augmentation operating directly on MIDIs, now much faster, simplifying processes and now handling durations;
* The docs are fixed;
* The tokenization tests workflows has been unified and considerably simplified, leading to more robust test assertions. We also increased the number of test cases and configurations, while decreasing the test time.

Other minor changes

* Setting special tokens values in TokenizerConf in https://github.com/Natooz/MidiTok/pull/114
* Update README.md by kalyani2003 in https://github.com/Natooz/MidiTok/pull/120
* Readthedocs preview action for PRs in https://github.com/Natooz/MidiTok/pull/125

New Contributors
* kalyani2003 made their first contribution in https://github.com/Natooz/MidiTok/pull/120

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.8...v3.0.0

2.1.8

This new version brings a new additional token type: pitch intervals. It allows to represent pitch intervals for simultaneous and successive note. You can read more details about how it works in [the docs](https://miditok.readthedocs.io/en/v2.1.7/).
We greatly improved the tests and Ci workflow, and fixed a few minor bugs and improvements along the way.

This new version also **drops support for Python 3.7, and now requires Python 3.8 and newer**. You can read more about the decision and how to make it retro-compatible in the docs.

**We encourage you to update to the latest [miditoolkit](https://github.com/YatingMusic/miditoolkit) version**, which also features some fixes and improvements. The most notable one is a clean of the dependencies, and **compatibility with recent numpy versions!**

What's Changed

* Typos fixes in docs by eltociear (89), gfggithubleet (91 and 93), shresthasurav (94), THEFZNKHAN (98 and 99)
* Fixing a bug when learning bpe without special tokens by Natooz in https://github.com/Natooz/MidiTok/pull/92
* Switch lint/isort/format to Ruff by akx in https://github.com/Natooz/MidiTok/pull/105
* Adding pitch interval option by Natooz in https://github.com/Natooz/MidiTok/pull/103
* Switching to pyproject.toml and hatch packaging by Natooz in https://github.com/Natooz/MidiTok/pull/106
* Fix data augment by parneyw in https://github.com/Natooz/MidiTok/pull/109
* dealing with empty midi file by feiyuehchen in https://github.com/Natooz/MidiTok/pull/110
* Better tests + minor improvements by Natooz in https://github.com/Natooz/MidiTok/pull/108

New Contributors

* eltociear made their first contribution in https://github.com/Natooz/MidiTok/pull/89
* gfggithubleet made their first contribution in https://github.com/Natooz/MidiTok/pull/91
* shresthasurav made their first contribution in https://github.com/Natooz/MidiTok/pull/94
* THEFZNKHAN made their first contribution in https://github.com/Natooz/MidiTok/pull/98
* akx made their first contribution in https://github.com/Natooz/MidiTok/pull/105
* parneyw made their first contribution in https://github.com/Natooz/MidiTok/pull/109
* feiyuehchen made their first contribution in https://github.com/Natooz/MidiTok/pull/110

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.7...v2.1.8

2.1.7

**This release bring the integration of the Hugging Face Hub, along with a few important fixes and improvements!**

What's Changed

* 87 Hugging Face hub integration! You can now push and load MidiTok tokenizers from the Hugging Face hub, using the `.from_pretrained` and `push_to_hub` methods as you would do for your models! Special thanks to Wauplin and julien-c for the help and support! 🤗🤗
* 80 (78 leleogere) Adding `func_to_get_labels` argument to `DatasetTok` allowing to use it to retrieve labels when loading data;
* 81 (74 Chunyuan-Li) Fixing multi-stream decoding with several identical programs + fixes with the encoding / decoding of time signatures for Bar-based tokenizers;
* 84 (77 VDT5702) Fix in `detect_chords` when checking whether to use unknown chords;
* 82 (79 leleogere) `tokenize_midi_dataset` now reproduces the file tree of the source files. This change fixes issues when files with the same name were overwritten in the previous method. You can also specify wether to overwrite files in the destination directory or not.

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.6...v2.1.7

2.1.6

Changelog

* 72 (71) adding `program_change` config option, that will insert `Program` tokens whenever an event is from a different track than the previous one. They mimic the MIDI `ProgramChange` messages. If this parameter is disabled (by default), a `Program` token will prepend each track programs (as done in previous versions);
* 72 `MIDILike` decoding optimized;
* 72 deduplicating overlapping pitch bends during preprocess;
* 72 `tokenize_check_equals` test method and more test cases;
* 75 and 76 (73 and 74 by Chunyuan-Li) Fixing time signature encoding / decoding Time Signature workflows for `Bar`/`Position`-based tokenizer (`REMI`, `CPWord`, `Octuple`, `MMM`;
* 76 `Octuple` is now tested with time signature disabled: as `TimeSig` tokens are only carried with notes, `Octuple` cannot accurately represent time signatures; as a result, if a Time Signature change occurs and that the following bar do not contain any note, the time will be shifted by one or multiple bars depending on the previous time signature numerator and time gap between the last and current note. We do not recommend to use `Octuple` with MIDIs with several time signature changes (at least numerator changes);
* 76 `MMM` tokenization workflow speedup.

2.1.5

Changelog

* 69 bacea19e70ba596a05fbbcf9f2bf53beb9714540 sort notes in all cases when tokenizing as MIDIs can contain unsorted notes;
* 70 (68) New `one_token_stream_for_programs` parameter allowing treat all tracks of a MIDI as a single stream of tokens (adding `Program` tokens before `Pitch`/`NoteOn`...). This option is enabled by default, and corresponds to the default code behaviour of the previous versions. Disabling it allows to have `Program` tokens in the vocabulary (`config.use_programs` enabled) while converting each track independently;
* 70 (68) `TimeShift` and `Rest` tokens can now be created successively during the tokenization, happening when the largest `TimeShift` / `Rest` value of the tokenizer isn't sufficient;
* 70 (68) Rests are now represented using the same format as `TimeShift`s, and the `config.rest_range` parameter has been renamed `beat_res_rest` for simplicity and flexibility. The default value is `{(0, 1): 8, (1, 2): 4, (2, 12): 2}`;

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.4...v2.1.5

Thanks to caenopy for reporting the bugs fixed here.

Compatibility

* tokenizers of previous versions with `rest_range` parameter will be converted to the new `beat_res_rest` format.

2.1.4

Changelog

* ilya16 2e1978f5c533b0989c2c4929f5e976511e06c6bb Fix in `save_tokens` method, reading `kwargs` in the json file saved;
* 67 Adding sustain pedal and pitch bend tokens for `REMI`, `TSD` and `MIDILike` tokenizers

Compatibility

* `MMM` now adds additional tokens in the same order than other tokenizers, meaning previously saved `MMM` tokenizers with these tokens would need to be converted if needed.

Page 2 of 11

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.