Miditok

Latest version: v3.0.5.post1

Safety actively analyzes 714860 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 11

1.3.0

Highlight

Version 1.3.0 changes the way the vocabulary, and by extension tokenizers, handle special tokens: `PAD`, `SOS`, `EOS` and `MASK`. It brings a cleaner way to instantiate these classes.
It might bring incompatibilities with data and models used with previous MidiTok versions.

Changes
* b9218bf7cba9695f37f74424325037bb4dda1cb7 `Vocabulary` class now takes `pad` argument to specify to include special padding token. This option is set to True by default, as it is more common to train networks with batches of unequal sequence lengths.
* b9218bf7cba9695f37f74424325037bb4dda1cb7 `Vocabulary` class: the `event_to_token` argument of the constructor is renamed `events` and has to be given as a list of events.
* b9218bf7cba9695f37f74424325037bb4dda1cb7 `Vocabulary` class: when adding a token to the vocabulary, the index is automatically set. The index argument is removed as it could cause issues / confusion when mapping indexes with models.
* b9218bf7cba9695f37f74424325037bb4dda1cb7 The `Event` class now takes `value` argument in second (order)
* b9218bf7cba9695f37f74424325037bb4dda1cb7 Fix when learning BPE if `files_lim` were higher than the number of files itself
* f9cb1098df1334fb618550ea5c946657aea05881 For all tokenizers, a new constructor argument `pad` specifies to use padding token, and the `sos_eos_tokens` argument is renamed to `sos_eos`
* f9cb1098df1334fb618550ea5c946657aea05881 When creating a `Vocabulary`, the *SOS* and *EOS* tokens are now registered before the *MASK* token. This change is motivated so the order matches the one of special token arguments in tokenizers constructors, and as the *SOS* and *EOS* tokens are more commonly used in symbolic music applications.
* 84db19d16efb3db566d468e6c71a912b1193ae05 The dummy `StructuredEncoding`, `MuMIDIEncoding`, `OctupleEncoding` and `OctupleMonoEncoding` classes removed from `init.py`. These classes from early versions had no record of being used. Other dummy classes (REMI, MIDILike and CPWord) remain.

Compatibility
* You might need to update your code when creating your tokenizer to handle the new `pad` argument.
* Data tokenized with **REMI**, and models trained with, will be incompatible with v1.3.0 if you used special tokens. The *BAR* token was previously at index 1, and is now added after special tokens.
* If you created custom tokenizer inheriting `MIDITokenizer`, make sure to update the calls to `super().__init__` with new `pad` arg and renamed `sos_eos` arg (example for MIDILike: [f9cb109](https://github.com/Natooz/MidiTok/commit/f9cb1098df1334fb618550ea5c946657aea05881#diff-ee81c42cbf9a0e5150c860773b29d8d75edbe6774b95837bbcb09272d6408883))
* **If you used both *SOS/EOS* and *MASK* special tokens**, their order (indexes) is now swapped as *SOS/EOS* are now registered before *MASK*. As these tokens should are not used during the tokenization, **your previously tokenized datasets remain compatible**, unless you intentionally inserted *SOS*/*EOS*/*MASK* tokens. **Trained models will however be incompatible** as the indices are swapped. If you want to use v1.3.0 with a previously trained model, you can manually invert the predictions of these tokens.
* No incompatibilities outside of these cases

**Please reach out if you have any issue / question!** 🙌

1.2.9

Changes
* 212a9436bd223873cf535495bd545f21ad431fe5 **BPE**: Speed boost in `apply_bpe` method, about 1.5 times faster 🚀
* 4b8ccb9b8be25c5be46af65dbfb24763353791ed **BPE**: `tokens_to_events` method is not longer inplace
* be3e244ee5647b71f97055362ba597f45319260d `save_tokens` method now takes `**kwargs` arguments to save additional information in json files
* b690cabfed30733723e0ebc844750c129199b5d7 fix when computing `max_tick` attribute of a MIDI, when it have tracks with no notes
* f1855b6870c3e28fec687b7acd621a2f9be2aec6 MidiTok package version is now saved with tokenizer parameters. It allows to keep track of the version used.
* Lint and coverage improvements ✨

Compatibility
* If you explicitly used `tokens_to_events`, you might need to do an adaptation as it is no longer inplace.

1.2.8

Changes
* 82b2a1b16283f5191a40d42d81339f0a00d01ff4 Fix in `MuMIDI` `token_types_errors()`
* 0869c23a2abf2c4bc622d49acf79a5aa104f1621 Fix, BPE tokenizers now update the vocabulary `_token_types_indexes` attribute after being modified
* b3642c1bf02b435e9dda990a0fae5e792dcab7a7 `EOS` key added to `token_types_graph`, prevents crash just in case
* 7d873ca196344581803fd228112e058f7e1e471b MIDI objects converted from tokens now have `max_tick` attribute calculated
* 770d8b837e8ff0e781d621e5d503efc1ccd4db02 0869c23a2abf2c4bc622d49acf79a5aa104f1621 small fixes and typo corrections
* Fixes in tests and GitHub Action integration

Compatibility
* All good !

1.2.7

Changes
* 22fee1dba39630f80d0bc4341ce03bd9328b9692 TimeSignature parameter automatically set to False for incompatible tokenizers, also fixing a bug when it was not provided by the user
* 2e958f1dd761c5eb8ab115c3f2ebe627b010be5a TimeSignature of MIDI set to 4/4 if the original MIDI had none (rare but can happen)
* a46fd561d264cfe152088a616cbb94d9024592e9 unused import removed
* f416ff527198b378d5f5032eb0725ce0a61fe6ce BPE calculation in `apply_bpe` method speed up by precomputing token successions in a class attribute

Compatibility
* All good !

1.2.6

Changes
* 168c8c32230e1a3b714a6ab844b2c6e1825ad0c9 Bugfix in Octuple vocabulary creation, now only creates the selected programs
* bfe987e967a3704a3a5f50538a9134fe95181392 fix in **MuMIDI** and **Octuple** `token_types_errors` methods that could make crash when analyzing special tokens (Pas, Mask ...)
* 956738765147d7935088eb9e3e55bc8a4ab37271 bugfix in **CPWord** decoding (crash with special tokens), and **Octuple** now saves `_sos_eos ` and `_mask` attributes in `save_params `

Compatibility
* All good !

1.2.5

Changes
* 67c2926542528913ce820698a874d7324517d890 Introducing **TSD** tokenization (Time Shift Duration). It is similar to **MIDI-Like** but uses `Duration` tokens instead of `Note-Off`, and its main difference with **REMI** is the way it represents time.
* 8af6a6b074a5c38cb5fd68598a21732df1a805a7 `_add_pad_type_to_graph` method has been renamed `_add_special_tokens_to_types_graph`, and now also adds `SOS`, `EOS`, and `MASK` tokens to the graph.
* f755c70036af68c2faa62b408061bcd7fee94f06 and 4b069a2fe4887aa105f969e9e39d9cf66cb5092b `add_bpe_to_tokens_type_graph` method for byte pair encoding, fixing a bug when loading a tokenizer from config file.

Compatibility
* _add_pad_type_to_graph is still supported but will be removed in a future update, you should replace it by _add_special_tokens_to_types_graph in your code to stay up to date

Page 6 of 11

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.