Miditok

Latest version: v3.0.5.post1

Safety actively analyzes 714875 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 7 of 11

1.2.4

Changes
* **[Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding)** is up ! it works with any tokenizer (except multi-embedding like CP Word or Octuple) as a wrapper to use as bpe(tokenizer_class, params) (see example in readme)
* 72a0f323c9234d3886f9c4867c44fa270d09db84 **Vocabulary** class now have a update_token_types_indexes method to create its _token_types_indexes attribute, which can be called after loading a tokenizer with its vocabulary saved (as with BPE)
* d232f4abb4ca6ceb388a8e04062e1fae98873f98 **Structured** now takes additional_tokens as constructor argument, to aligning with all other tokenizers
* 4b0dc9ff38021a375213149c005d4f2971a6ea0a Bugfix in **MIDITokenizer** base class for rest and beat range attributes when loading class from params
* eb3612fd194e1eae960756f78a6eae80f2e44e67 save_tokens now saves tokens as a dictionary with *tokens* and *programs* keys so that the distinction is clear
* tqdm is now used (and required) in tokenize_dataset and bpe methods

Compatibility
* Structured now takes additional_tokens as constructor argument, to aligning with all other tokenizers
* As from v1.2.4, tokens saved with the save_tokens method will now be saved as a dictionary, so that no confusion is made between tracks and programs (as it could before). You can still load tokens saved with < v1.2.4 with load_tokens with no consequences, as you then handle how to index from it.

1.2.3

Changes
* 87db4802bb70e056741154bdb58c993c88527868 fix in merge_tracks_per_class, some tracks were omitted when filtering pitch / tessitura

1.2.2

Changes
* bd951ec6406dee5c09521ee0c954a4fd6661c5ae merge_tracks_per_class now allows to remove the notes with pitch out of the recommended range (tessitura) as defined by the General MIDI 2 specs. Use the filter_pitches argument.
* 611754d107c645f78fd72f0946d0a5d76de816cf MuMIDI and Octuple now allowing to use custom sets of programs, reducing their vocabulary size. Use the program argument when constructing the the tokenizers.

1.2.1

Changes from 4141e00ae68cb27fd85a630bf6a5ec28516b0251
* get_midi_programs, remove_duplicated_notes, detect_chords, merge_tracks, merge_same_program_tracks and current_bar_pos methods have been moved from miditok/midi_tokenizer_base.py to miditok/utils.py, you can call them with **miditok.utils.the_method()**
* New method merge_tracks_per_class which allows to merge tracks of a MIDI of the same instrument class
* MIDI_INSTRUMENTS pitch range value changed from tuple to range
* INSTRUMENT_CLASSES changed from type Dict[int: Tuple[int, str]] to List[Dict[str: Union[str, range]]] so its fits the format of other constants. The index of the list corresponds to the index of each class.
* INSTRUMENT_CLASSES_RANGES replaced by CLASS_OF_INST to easily gets the class of any instrument / track by its program
* Minor cleans in imports

Compatibility
* See first point above if you used utils functions
* See above if you used MIDI_INSTRUMENTS, INSTRUMENT_CLASSES and INSTRUMENT_CLASSES_RANGES constants

1.2.0

Changes
* 7fe9df6ec224fcd034f2de31f87e1c01b44ccffc becea47db12efd464282eecdaf8c0a6aa6a40dca : CP Word, Octuple and MuMIDI tokenizers now have several Vocabulary objects within self.vocab, each for every token type (Pitch, Duration ...). This allows to easily create several input / output layers of different sizes, fitting the token types vocabulary sizes. example [here](https://github.com/Natooz/MidiTok/issues/13#issuecomment-1126423904)
* 05c1ab97ec39307ba9282f1b261cac61ea17bf5d MIDITokenizer base class now has MIDITokenizer call (link to midi_to_tokens), len (returns len(self.vocab)) and getitem (returns self.vocab[item], converting a token to an event and vice versa) magic methods.

Compatibility
* CP Word, Octuple and MuMIDI tokenizations from **< v1.2.0** will not be compatible anymore, datasets have to be retokenized

Thanks
Special thanks to envilk for his contribution !

1.1.11

Changes
* 13 d930de5f34782d2afd04934035448a9e758e774b Fail check when decoding tokens with Octuple, could lead to errors with wrong TimeSignature tokens
* a39b390a371abb5fd46cf7a010894d5d5a042c3a mask argument is now present for all tokenizer constructors. Masking tokens are then added to vocabularies at initialization.
* af85740538665ec052f505d8d276100ef28c79c4 unused Bar token removed from the vocabulary of Structured

Compatibility
* **Structured**: Bar token (value 1) has been removed, subsequent tokens values should be decreased by 1
* MASK token is now added to vocabulary at tokenization initialization, token indexes could be shifted in comparison with previous versions < 1.1.11, you should probably re-tokenize your data and retrain your models with v1.1.11 if you used masking tokens

Page 7 of 11

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.