Miditok

Latest version: v3.0.4

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 11

2.0.2

* 63110d7de15a140b6b8b79db794e0034e8ad222e fix in `_ids_are_bpe_encoded` method

2.0.1

Changes

* e26b088531befab41d9f8a7f6d7244498b544861 from atsukoba + help from muthissar: [`REMI+`](https://openreview.net/forum?id=NyR8OZFHw6i) is now implemented! 🎉 This multitrack tokenization can be seen as an extension of `REMI`.
* 29622115f6061579f5de5502bbcea8b05c3712a0 *Chord* tokens can now represent the root note within tokens (versus only chord quality previously). Chord parameters have to be specified in `additional_tokens` argument, with the keys `chord_maps`, `chord_tokens_with_root_note` and `chord_unknown`. You can use the [default value](https://github.com/Natooz/MidiTok/blob/main/miditok/constants.py#L41) as an example.
* e402b0d42f7eb39eeb074d439e839e63bf8a1098 `_in_as_seq` decorator now automatically checks if the input ids are encoded with BPE
* 2064ee944494d0d0583418ab6a2670c7861e561a fix with BPE containing spaces in merges, could not load tokenizers after training

Compatibility

* due to 2064ee944494d0d0583418ab6a2670c7861e561a, bytes and merges are shifted from v2.0.0. BPE tokenizers will be incompatible and would have to be retrained, or the bytes from their vocabularies and merges would have to be shifted. This only applies for BPE.

2.0.0

TL;DR

This major update brings:

* The integration of the Hugging Face [🤗tokenizers](https://github.com/huggingface/tokenizers) library as Byte Pair Encoding (BPE) backend. **BPE is now between 30 to 50 times faster, for both training and encoding !** 🙌
* A new **`TokSequence`** object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE).
* Many internal changes, methods and variables renamed, that require you to update some of your code (details below).

Changes
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a `Vocabulary` class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions;
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a New `special_tokens` constructor argument for all tokenizers, in place of the previous `pad`, `mask`, `sos_eos` and `sep` arguments. It is a list of tokens (str) for more versatility. By default, special tokens are `["PAD", "BOS", "EOS", "MASK"]`;
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a `__getitem__` now handles both ids (int) and tokens (str), with multi-vocab;
* 36bf0f66a392835e492a9f7decf7e382662f30aa Some methods of `MIDITokenizer` meant to be used internally are now protected;
* a2db7b9ed173149c34e9f01e3d873317a185288f New training method with 🤗tokenizers BPE model;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `TokSequence` object, used as in and out object for `midi_to_tokens` and `tokens_to_midi` methods, thanks to the `_in_as_seq` and `_out_as_complete_seq` decorators;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `complete_sequence` method allowing to automatically convert the uninitiated attributes of a `TokSequence` (ids, tokens);
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `tokens_to_events` renamed `_ids_to_tokens`, and new id / token / byte conversion methods with recursivity;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 **Tokens are now saved and loaded with the `ids` key (previously `tokens`)**;
* cddd29cc2939fda706f502c9be36ccfa06f5dd20 Tokenization files moves to dedicated tokenizations module;
* cddd29cc2939fda706f502c9be36ccfa06f5dd20 `decompose_bpe` method renamed `decode_bpe`;
* d5201287b93fd42121b76b39e08cabf575e9bdcd `tokenize_dataset` allows to apply BPE afterwards.

Compatibility

Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations.
However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:

* Tokens are now saved and loaded with the `ids` key (previously `tokens`). To adapt your previously saved tokens, open them with json and rewrite them with the `ids` key instead;
* `midi_to_tokens` (also called with `tokenizer(midi)`) now outputs a list of `TokSequence`s, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the `.ids` attribute, as `tokseq.ids`;
* `Vocabulary` class deleted. You can still access to the vocabulary with `tokenizer.vocab` but it is now a dictionary. The methods of the `Vocabulary` class are now directly integrated in `MIDITokenizer`;
* For all tokenizers, the `pad`, `mask`, `sos_eos` and `sep` constructor arguments need to be replaced with the new `special_tokens` argument;
* `decompose_bpe` method renamed `decode_bpe`.

Bug reports

With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.

1.4.3

Changes
* 77f7c533f333ef8926e7bc5f9cb11270a3560322 dinhviettoanle (24) Fixing a bug skipping tokens repetitions with BPE
* **New documentation** : [miditok.readthedocs.io](https://miditok.readthedocs.io/en/latest/). We finally have a proper documentation website! 🙌 With this comes many improvements and fixes in the docstrings.
* 201c9b738d3e29aaa11a4e99a3acfeedb59dd4a4 Legacy `REMIEncoding`, `MIDILikeEncoding` and `CPWordEncoding` classes removed.
* e92a414dfbf4f38e64c256eeab29bad980804d9b `token_types_errors` of `MIDITokenizer` class handling basic / common cases of errors
* Small minor code improvements
* 14862046f974ca889598dddbaaef67d45a008a8c Use of `dataclasses`. This means that Python 3.6 (and previous) are no longer compatible. Python 3.6 was compatible but not supported (tested) up to v1.4.2.

1.4.2

Changes
* f6225a19df21dfcaee9a1c55a001a9b061aaaf24 Added the option to have a `SEP` special token, that can be used to train models to perform tasks such as "Next sequence prediction"
* bb2451208c47bfddb40a0b055dcaa5a6b8eb3f3e Data augmentation can now receive the `all_offset_combinations ` argument, which will perform augmentation with all the combinations of offsets. With the offsets $\left( x\_1 , x\_2 , x\_3 \right)$, it will perform a total of $\prod\_i x\_i$ combinations ( $\prod\_i (x\_i \times 2)$ if going up and down). This is disabled by default to save you from hundreds of augmentations 🤓 (and is not chained with `tokenize_midi_dataset`), by defaults augmentations are done on the original input only.

1.4.1

Changes
* 0e9131d7325eaef1425fb53b99941110696063e3 Bugfix in `tokenize_midi_dataset` method when directly performing data augmentation, was not indented as it should

Page 4 of 11

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.