Miditok

Latest version: v3.0.4

Safety actively analyzes 681866 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 11

2.1.7

**This release bring the integration of the Hugging Face Hub, along with a few important fixes and improvements!**

What's Changed

* 87 Hugging Face hub integration! You can now push and load MidiTok tokenizers from the Hugging Face hub, using the `.from_pretrained` and `push_to_hub` methods as you would do for your models! Special thanks to Wauplin and julien-c for the help and support! 🤗🤗
* 80 (78 leleogere) Adding `func_to_get_labels` argument to `DatasetTok` allowing to use it to retrieve labels when loading data;
* 81 (74 Chunyuan-Li) Fixing multi-stream decoding with several identical programs + fixes with the encoding / decoding of time signatures for Bar-based tokenizers;
* 84 (77 VDT5702) Fix in `detect_chords` when checking whether to use unknown chords;
* 82 (79 leleogere) `tokenize_midi_dataset` now reproduces the file tree of the source files. This change fixes issues when files with the same name were overwritten in the previous method. You can also specify wether to overwrite files in the destination directory or not.

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.6...v2.1.7

2.1.6

Changelog

* 72 (71) adding `program_change` config option, that will insert `Program` tokens whenever an event is from a different track than the previous one. They mimic the MIDI `ProgramChange` messages. If this parameter is disabled (by default), a `Program` token will prepend each track programs (as done in previous versions);
* 72 `MIDILike` decoding optimized;
* 72 deduplicating overlapping pitch bends during preprocess;
* 72 `tokenize_check_equals` test method and more test cases;
* 75 and 76 (73 and 74 by Chunyuan-Li) Fixing time signature encoding / decoding Time Signature workflows for `Bar`/`Position`-based tokenizer (`REMI`, `CPWord`, `Octuple`, `MMM`;
* 76 `Octuple` is now tested with time signature disabled: as `TimeSig` tokens are only carried with notes, `Octuple` cannot accurately represent time signatures; as a result, if a Time Signature change occurs and that the following bar do not contain any note, the time will be shifted by one or multiple bars depending on the previous time signature numerator and time gap between the last and current note. We do not recommend to use `Octuple` with MIDIs with several time signature changes (at least numerator changes);
* 76 `MMM` tokenization workflow speedup.

2.1.5

Changelog

* 69 bacea19e70ba596a05fbbcf9f2bf53beb9714540 sort notes in all cases when tokenizing as MIDIs can contain unsorted notes;
* 70 (68) New `one_token_stream_for_programs` parameter allowing treat all tracks of a MIDI as a single stream of tokens (adding `Program` tokens before `Pitch`/`NoteOn`...). This option is enabled by default, and corresponds to the default code behaviour of the previous versions. Disabling it allows to have `Program` tokens in the vocabulary (`config.use_programs` enabled) while converting each track independently;
* 70 (68) `TimeShift` and `Rest` tokens can now be created successively during the tokenization, happening when the largest `TimeShift` / `Rest` value of the tokenizer isn't sufficient;
* 70 (68) Rests are now represented using the same format as `TimeShift`s, and the `config.rest_range` parameter has been renamed `beat_res_rest` for simplicity and flexibility. The default value is `{(0, 1): 8, (1, 2): 4, (2, 12): 2}`;

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.4...v2.1.5

Thanks to caenopy for reporting the bugs fixed here.

Compatibility

* tokenizers of previous versions with `rest_range` parameter will be converted to the new `beat_res_rest` format.

2.1.4

Changelog

* ilya16 2e1978f5c533b0989c2c4929f5e976511e06c6bb Fix in `save_tokens` method, reading `kwargs` in the json file saved;
* 67 Adding sustain pedal and pitch bend tokens for `REMI`, `TSD` and `MIDILike` tokenizers

Compatibility

* `MMM` now adds additional tokens in the same order than other tokenizers, meaning previously saved `MMM` tokenizers with these tokens would need to be converted if needed.

2.1.3

This big update brings a few important changes and improvements.

A new common tokenization workflow for all tokenizers.

We distinguish now three types of tokens:
1. Global MIDI tokens, which represent attributes and events affecting the music globally, such as the tempo or time signature;
2. Track tokens, representing values of distinct tracks such as the notes, chords or effects;
3. Time tokens, which serve to structure and place the previous categories of tokens in time.

All tokenisations now follows the pattern:

1. Preprocess the MIDI;
2. Gather global MIDI events (tempo...);
3. Gather track events (notes, chords);
4. If "one token stream", concatenate all global and track events and sort them by time of occurrence. Else, concatenate the global events to each sequence of track events;
5. Deduce the time events for all the sequences of events (only one if "one token stream");
6. Return the tokens, as a combination of list of strings and list of integers (token ids).

This cleans considerably the code (DRY, less redundant methods), while bringing speedups as the calls to sorting methods has been reduced.

TLDR; other changes

* New submodule `pytorch_data` offering PyTorch `Dataset` objects and a data collator, to be used when training a PyTorch model. Learn more in the documentation of the module;
* `MIDILike`, `CPWord` and `Structured` now handle natively `Program` tokens in a multitrack / `one_token_stream` way;
* Time signature changes are now handled by `TSD`, `MIDILike` and `CPWord`;
* The `time_signature_range` config option is now more flexible / convenient.

Changelog

* 61 new `pytorch_data` submodule, with `DatasetTok` and `DatasetJsonIO` classes. This module is only loaded if `torch` is installed in the python environment;
* 61 `tokenize_midi_dataset()` method now have a `tokenizer_config_file_name` argument, allowing to save the tokenizer config with a custom file name;
* 61 "all-in-one" `DataCollator` object to be used with PyTorch `DataLoader`s;
* 62 `Structured` and `MIDILike` now natively handle `Program` tokens. When setting `config.use_programs` true, a `Program` token will be added before each `Pitch`/`NoteOn`/`NoteOff` token to associate its instrument. MIDIs will also be treated as a single stream of tokens in this case, whereas otherwise each track is converted into independent token sequences;
* 62 `miditok.utils.remove_duplicated_notes` method can now remove notes with the same pitch and onset time, regardless of their offset time / duration;
* 62 `miditok.utils.merge_same_program_tracks` is now called in `preprocess_midi` when `config.use_programs` is True;
* 62 Big refactor of the `REMI` codebase, that now has all the features of `REMIPlus`, and code clean and speedups (less calls to sorting). The `REMIPlus` class is now basically only a wrapped `REMI` with programs and time signature enabled;
* 62 `TSD` and `MIDILike` now encode and decode time signature changes;
* 63 ilya16 The `Tempo`s can now be created with a logarithmic scale, instead of the default linear scale.
* c53a008cadda0f111058a892c23375edde364077 and 5d1c12e18a35e3e633863f1f675374f28c8f7748 `track_to_tokens` and `tokens_to_track` methods are now partially removed. They are now protected, for classes that still rely on them, and removed from the others. These methods were made for internal calls and not recommended to use. Instead, the `midi_to_tokens` method is recommended;
* 65 ilya16 changes `time_signature_range` into a dictionary `{denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)}`;
* 65 ilya16 fix in the formula computing the number of ticks per bar.
* 66 Adds an option to `TokenizerConfig` to delete the successive tempo / time signature changes carrying the same value during MIDI preprocessing;
* 66 now using xdist for tests, big speedup on Github actions (ty ilya16 !);
* 66 `CPWord` and `Octuple` now follow the common tokenization workflow;
* 66 As a consequence to the previous point, `OctupleMono` is removed as there was no records of its use. It is now equivalent to `Octuple` without `config.use_programs`;
* 66 `CPWord` now handling time signature changes;
* 66 tests for tempo and time signatures changes are now more robust, exceptions were removed and fixed.
* 5a6378b26d4d8176ca84361c5ecab038d7026f8a `save_tokens` now by default doesn't save programs if `config.use_programs` is False

Compatibility

* Calls to `track_to_tokens` and `tokens_to_track` methods are not supported anymore. If you used these methods, you may replace them with `midi_to_tokens` and `tokens_to_midi` (or just __call__ the tokenizer) while selecting the appropriate token sequences / tracks;
* `time_signature_range` now needs to be given as a dictionary;
* Due to changes in the order of vocabularies of `Octuple` (as programs are now optional), tokenizers and tokens made with previous versions will not be compatible unless the vocabulary order is swapped, idx 3 moved to 5.

2.1.2

Thanks to Kapitan11 who spotted bugs when decodings tokens given as ids / integers (59), this update brings a few fixes that solve them alongside tests ensuring that the input / output (i/o) formats of the tokenizers are well handled in every cases.
The documentation has also been updated on this subject, that was unclear until now.

Changes

* 394dc4d Fix in `MuMIDI` and `Octuple` token encodings that performed the preprocessing steps twice;
* 394dc4d code of [single track tests](tests/test_one_track.py) improved and now covering tempos for most tokenizations;
* 394dc4d `MuMIDI` can now decode tempo tokens;
* 394dc4d `_in_as_seq` decorator now used solely for the `tokens_to_midi()` method, and removed from `tokens_to_track()` which explicitly expects a `TokSequence` object as argument (089fa74);
* 089fa74 `_in_as_seq` decorator now handling all token ids input formats as it should;
* 9fe7639 Fix in `TSD` decoding with multiple input sequences when not in `one_token_stream ` mode;
* 9fe7639 Adding i/o input ids tests;
* 8c2349bfb771145c805c8a652392ae8f11ed0756 `unique_track` property renamed to `one_token_stream` as it is more explicit and accurate;
* 8c2349bfb771145c805c8a652392ae8f11ed0756 new `convert_sequence_to_tokseq` method, which can convert any input sequence holding ids (integer), tokens (string) or events (Event) data into a `TokSequence` or list of `TokSequence`s objects, with the appropriate format depending on the tokenizer. This method is used by the `_in_as_seq` decorator;
* 8c2349bfb771145c805c8a652392ae8f11ed0756 new `io_format` tokenizer property, returning the tokenizer's io format as a tuple of strings. Their significations are: *I* for instrument (for non one_token_stream tokenizers), *T* for token, *C* for sub-token class (for multi-voc tokenizers)
* Minor code lint improvements;

Compatibility

* All good 🙌

Page 2 of 11

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.