Miditok

Latest version: v3.0.5.post1

Safety actively analyzes 723947 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 11

1.4.2

Changes
* f6225a19df21dfcaee9a1c55a001a9b061aaaf24 Added the option to have a `SEP` special token, that can be used to train models to perform tasks such as "Next sequence prediction"
* bb2451208c47bfddb40a0b055dcaa5a6b8eb3f3e Data augmentation can now receive the `all_offset_combinations ` argument, which will perform augmentation with all the combinations of offsets. With the offsets $\left( x\_1 , x\_2 , x\_3 \right)$, it will perform a total of $\prod\_i x\_i$ combinations ( $\prod\_i (x\_i \times 2)$ if going up and down). This is disabled by default to save you from hundreds of augmentations 🤓 (and is not chained with `tokenize_midi_dataset`), by defaults augmentations are done on the original input only.

1.4.1

Changes
* 0e9131d7325eaef1425fb53b99941110696063e3 Bugfix in `tokenize_midi_dataset` method when directly performing data augmentation, was not indented as it should

1.4.0

This pretty big update brings data augmentation, some bug fixes and optimizations, allowing to write more elegant code.

Changes
* 8f201e0299b503ce2b6976cbfa1d39660b4c3efe 308fb278dd1df2dee79541b8c85773a619ef5b02 [**Data augmentation**](https://github.com/Natooz/MidiTok/tree/main/miditok/data_augmentation) methods ! 🙌 They can be applied on both MIDI and tokens, to augment data by shifting the pitch, velocity and duration values.
* 1d8e9034c9354a60249de05dc938eb8e880f366e You can perform data augmentation while tokenizing a dataset (`tokenize_midi_dataset` method) with the `data_augment_offsets` argument. This will be done at the token level, as its faster than augmenting MIDI objects.
* 0634adee1f050fb51eed1d73ef39f982573c5d7d **BPE** is now implemented in the main tokenizer class! This means all tokenizers can benefit form it in a much prettier way!
* 0634adee1f050fb51eed1d73ef39f982573c5d7d **`bpe` method renamed to `learn_bpe`**, and now returns metrics (that are also showed in the progress bar during the learning) on the number of token combinations and sequence length reduction
* 7b8c9777cb0866a179b64e50c26c6c7cccec5cee Retrocompatibility when loading tokenizer config files with BPE from older versions
* 3cea9aa11c238486a71dff82d244a0f16a8a52e9 nturusin Example notebook of GPT2 Hugging Face music transformer: fixes in training
* 65afa6b1aaa35e0df396276f9811f12d10f67ea6 The `tokens_to_midi` and `save_tokens` methods **can now receive tokens as Tensors and numpy arrays. PyTorch, TensorFlow and Jax (numpy) tensors are supported**. The `convert_tokens_tensors_to_list` decorator will convert them to lists, you can use it on your custom methods.
* aab64aa4159ee27022b359597ece3154dc224513 The `__call__` magic method now automatically route to `midi_to_tokens` or `tokens_to_midi` following what you give it. You can now use more elegantly tokenizers as `tokenizer(midi_obj)` or `tokenizer(generated_tokens)`.
* e90b20a86283aa1dab2db071f5c6a49b161caa42 Bugfix in `Structured` causing a possible infinite while loop with illegal token types successions
* 947af8cfa0c72212a8835e10fe1f804356d13e8a Big refactor of MuMIDI, which have now fixed vocab / type idx. It is easier to handle and use. (thanks gonzaloarca)
* 947af8cfa0c72212a8835e10fe1f804356d13e8a CPWord "Ignore" tokens are all renamed `Ignore_None` by convention, making operations easier in data augmentation and other methods.

Compatibility
* code with BPE would have to updated: remove `bpe(tokenizer)` and just declare tokenizers normally, rename the `bpe` method to `learn_bpe`
* MuMIDI tokens and tokenizers will be incompatible with v1.4.0

1.3.3

Changes
* 4f4e49ef4ebd00d98f0b0c02b3689cfc9bba122b Magic method `len` bugfix with multi-vocal tokenizers, `len` is now also a property
* 925c7ae2eb60b8c64066bf41c974292db56cbac8 & 5b4f4102691e14aa07b02d2f8c9eb34a484b5221 Bugfix of token types initialization when loading tokenizer from params file
* c873456ad0e3017239a55770daed7b00f39d11cb Removed hyphens from token types names, for better visibility. Be convention tokens types are all written in CamelCase.
* 5e51e843126af06f2baa5e9ad64fb4c634f787cb New `multi_voc` property
* b3b0cc7c6f1f8cca453d5d50f76c459c49d5c910 `tokenize_dataset`, progress bar now show the saving directory name

Compatibility
* All good 🙌

1.3.2

Changes
* Fansesi - f92f4aa98c4e407de9ca1d925e47b44833050aba Corrects a bug when using `tokenize_dataset` with `out_dir` as non-`Path` object (issue 18)
* 27240627e2f65225e4fc398edb5b212cba8f18de Bugfix when using `files_lim` with `bpe`

Compatibility
* All good 🙌

1.3.1

Highlights
This versions uniformly cleans how the `save_params` is called, brings related minor fixes and new features.

Changes
* 3c4adf808c244fcb95f0da476227a588fccf01c6 Tokenizers now take a `unique_track` argument at creation. This parameter specifies if the tokenizer represents and handles music as a single track, or stream of tokens. This is the case of Octuple and MuMIDI, and probably most representations that natively support multitrack music. **If given True, the tokens will be saved in json files as a single track. This parameter can then help when loading tokenized datasets.**
* 3c4adf808c244fcb95f0da476227a588fccf01c6 `save_params` method: `out_dir` argument renamed to `out_path`
* 3c4adf808c244fcb95f0da476227a588fccf01c6 `save_params` method: `out_path` can now specify the full path and name of the config file saved
* 3c4adf808c244fcb95f0da476227a588fccf01c6 fixes in `save_params` method for MuMIDI
* 3c4adf808c244fcb95f0da476227a588fccf01c6 The current version number is fixed (was 1.2.9 instead of 1.3.0 for v1.3.0)
* 4be897bbdf8b84e74c5230449d28ed5dd7f1b8d5 `bpe` method (learning BPE vocabulary) now has a `print_seq_len_variation` argument, to optionally print the mean sequence length before and after BPE, and the variation in %. (default: True)

Compatibility
* You might need to update your code when:
* * creating a tokenizer, to handle the new `unique_track` argument.
* * saving a tokenizer's config to handle the `out_dir` argument renamed to `out_path`
* For datasets tokenized with BPE will need to change the `token_to_event` key to `vocab` in the associated tokenizer configuration file

Page 5 of 11

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.