Miditok

Latest version: v3.0.4

Safety actively analyzes 681866 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 11

3.0.4

This release introduces the `PerTok` tokenizer by Lemonaide AI, attribute controls tokens and minor fixes.

Highlights

PerTok: Performance Tokenizer

(associated paper to be released)

Developed by Julian Lenz (JLenzy) at [Lemonaide AI](https://www.lemonaide.ai) to capture expressive timing in symbolic scores while maintaining competitively low sequence lengths. It achieves this by dividing time differences into Macro and Micro categories, introducing a new MicroTime token type. Subtle deviations from the quantized beat are represented with these Timeshift tokens.
Furthermore, PerTok enables you to encode an unlimited number of note subdivisions by enabling multiple, overlapping values within the 'beat_res' parameter of the `TokenizerConfig`.

The micro timing tokens will be extended to all tokenizers in a future update.

 Attribute Control tokens

Attribute controls are additional tokens allowing to train models in order to control them during inference, by enforcing a model to predict music with specific features.

What's Changed
* updates to Example_HuggingFace_Mistral_Transformer.ipynb by briane412 in https://github.com/Natooz/MidiTok/pull/164
* `_model_name` is now a protected property by Natooz in https://github.com/Natooz/MidiTok/pull/165
* Fixing docs for tokenizer training by Natooz in https://github.com/Natooz/MidiTok/pull/167
* Default `continuing_subword_prefix` when splitting token sequences by Natooz in https://github.com/Natooz/MidiTok/pull/168
* small bug fix in MIDI pretokenization by shenranwang in https://github.com/Natooz/MidiTok/pull/170
* adding `no_preprocess_score` argument when tokenizing by Natooz in https://github.com/Natooz/MidiTok/pull/172
* `TokSequence` summable, `concatenate_track_sequences` arg for MMM by Natooz in https://github.com/Natooz/MidiTok/pull/173
* Docs update by Natooz in https://github.com/Natooz/MidiTok/pull/175
* Fixing split methods for empty files (no tracks and/or no notes) by Natooz in https://github.com/Natooz/MidiTok/pull/177
* Logo now with white outer stroke by Natooz in https://github.com/Natooz/MidiTok/pull/180
* Attribute controls feature by helloWorld199 in https://github.com/Natooz/MidiTok/pull/181
* better distinction between `one_token_stream` and `config.one_token_stream_for_programs` by Natooz in https://github.com/Natooz/MidiTok/pull/182
* making sure MMM token sequences are not concatenated when splitting them per bar/beat in tokenizer_training_iterator.py by Natooz in https://github.com/Natooz/MidiTok/pull/183
* rST Documentation fixes by scottclowe in https://github.com/Natooz/MidiTok/pull/184
* Bump actions/stale from 5.1.1 to 9.0.0 by dependabot in https://github.com/Natooz/MidiTok/pull/185
* Bump actions/download-artifact from 3 to 4 by dependabot in https://github.com/Natooz/MidiTok/pull/186
* Bump codecov/codecov-action from 3.1.0 to 4.5.0 by dependabot in https://github.com/Natooz/MidiTok/pull/187
* Bump actions/upload-artifact from 3 to 4 by dependabot in https://github.com/Natooz/MidiTok/pull/188
* Fixing bugs caused by changes from symusic v0.5.0 by Natooz in https://github.com/Natooz/MidiTok/pull/192
* `use_velocities` and `use_duration` configuration parameters by Natooz in https://github.com/Natooz/MidiTok/pull/193
* collator now handles decoder input ids (seq2seq models) by Natooz in https://github.com/Natooz/MidiTok/pull/194
* PerTok Tokenizer by JLenzy in https://github.com/Natooz/MidiTok/pull/191

New Contributors
* briane412 made their first contribution in https://github.com/Natooz/MidiTok/pull/164
* helloWorld199 made their first contribution in https://github.com/Natooz/MidiTok/pull/181
* scottclowe made their first contribution in https://github.com/Natooz/MidiTok/pull/184
* dependabot made their first contribution in https://github.com/Natooz/MidiTok/pull/185

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v3.0.3...v3.0.4

3.0.3

Highlights

* Support for abc files, which can be loaded and dumped with symusic similarly to MIDI files;
* The tokenizers can now also be trained with the **WordPiece** and **Unigram** algorithms!
* Tokenizer training and token ids encoding can now be performed "bar-wise" or "beat-wise", meaning the tokenizer can learn new tokens from successions of base tokens strictly within bars or beats. This is set by the `encode_ids_split` attribute of the tokenizer config;
* [symusic](https://github.com/Yikai-Liao/symusic) v0.4.3 or higher is now required to comply with the usage of the `clip` method;
* Better handling of file loading errors in `DatasetMIDI` and `DataCollator`;
* Introducing a new `filter_dataset` to clean a dataset of MIDI/abc files before using it;
* `MMM` tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (`REMI`, `TSD` and `MIDILike`) to allow more flexibility and interoperability;
* `TokSequence` objects can now be sliced and concatenated (eg `seq3 = seq1[:50] + seq2[50:]`);
* `TokSequence` objects tokenized from a tokenizer can now be split per bars or beats subsequences;
* minor fixes, code improvements and cleaning;

Methods renaming

A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.

<details>
<summary>Methods renamed with depreciation warning:</summary>

* `midi_to_tokens` --> `encode`;
* `tokens_to_midi` --> `decode`;
* `learn_bpe` --> `train`;
* `apply_bpe` --> `encode_token_ids`;
* `decode_bpe` --> `decode_token_ids`;
* `ids_bpe_encoded` --> `are_ids_encoded`;
* `vocab_bpe` --> `vocab_model`.
* `tokenize_midi_dataset` --> `tokenize_dataset`;
</details>

<details>
<summary>Methods renamed without depreciation warning (less usages, reduces the code messiness):</summary>

* `MIDITokenizer` --> `MusicTokenizer`;
* `augment_midi` --> `augment_score`;
* `augment_midi_dataset` --> `augment_dataset `;
* `augment_midi_multiple_offsets` --> `augment_score_multiple_offsets`;
* `split_midis_for_training` --> `split_files_for_training`;
* `split_midi_per_note_density` --> `split_score_per_note_density`;
* `get_midi_programs` --> `get_score_programs`;
* `merge_midis` --> `merge_scores`;
* `get_midi_ticks_per_beat` --> `get_score_ticks_per_beat`;
* `split_midi_per_ticks` --> `split_score_per_ticks`;
* `split_midi_per_beats` --> `split_score_per_beats`;
* `split_midi_per_tracks` --> `split_score_per_tracks`;
* `concat_midis` --> `concat_scores`;
</details>

<details>
<summary>Protected internal methods (no depreciation warning, advanced usages):</summary>

* `MIDITokenizer._tokens_to_midi` --> `MusicTokenizer._tokens_to_score`;
* `MIDITokenizer._midi_to_tokens` --> `MusicTokenizer._score_to_tokens`;
* `MIDITokenizer._create_midi_events` --> `MusicTokenizer._create_global_events`
</details>

There is no other compatibility issue beside these renaming.

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v3.0.2...v3.0.3

3.0.2

Tldr

This new version introduces a new `DatasetMIDI` class to use when training PyTorch models. It relies on the previously named `DatasetTok` class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new `miditok.pytorch_data.split_midis_for_training` method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge `symusic.Score` objects.
Thanks Kinyugo for the discussions and tests that guided the development of the features! (147)

The update also brings a few minor fixes, and the [docs](https://miditok.readthedocs.io/) have a new theme!

What's Changed

* Fix token_paths to files_paths, and config to model_config by sunsetsobserver in https://github.com/Natooz/MidiTok/pull/145
* Fix issues in Octuple with multiple different-beat time signatures by ilya16 in https://github.com/Natooz/MidiTok/pull/146
* Pitch interval decoding: discarding notes outside the tokenizer pitch range by Natooz in https://github.com/Natooz/MidiTok/pull/149
* Fixing `save_pretrained` to comply with huggingface_hub v0.21 by Natooz in https://github.com/Natooz/MidiTok/pull/150
* ability to `overwrite _create_durations_tuples` in init by JLenzy in https://github.com/Natooz/MidiTok/pull/153
* Refactor of PyTorch data loading classes and methods by Natooz and Kinyugo in https://github.com/Natooz/MidiTok/pull/148
* The docs have a new theme! Using the [furo](https://github.com/pradyunsg/furo) theme.

New Contributors
* sunsetsobserver made their first contribution in https://github.com/Natooz/MidiTok/pull/145
* JLenzy made their first contribution in https://github.com/Natooz/MidiTok/pull/153

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v3.0.1...v3.0.2

3.0.1

What's Changed
* `use_pitchdrum_tokens` option to use dedicated `PitchDrum` tokens for drums tracks
* Fixing time signature preprocessing (time division mismatch) in https://github.com/Natooz/MidiTok/pull/132 (#131 EterDelta)
* Fixing data augmentation example and considering all midi extensions in https://github.com/Natooz/MidiTok/pull/136 (#135 oiabtt)
* decoding: automatically making sure to decode BPE then completing `tokens` in https://github.com/Natooz/MidiTok/pull/138 (#137 oiabtt)
* `load_tokens` now returning `TokSequence` by in https://github.com/Natooz/MidiTok/pull/139 (#137 oiabtt)
* convert chord maps back to tuples from list when loading tokenizer from a saved configuration by shenranwang in https://github.com/Natooz/MidiTok/pull/141
* can now use `MIDITokenizer.from_pretrained` similarly to the `AutoTokenizer` in the Hugging Face transformers library by in https://github.com/Natooz/MidiTok/pull/142 (discussed in #127 oiabtt)

New Contributors
* shenranwang made their first contribution in https://github.com/Natooz/MidiTok/pull/141

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v3.0.0...v3.0.1

3.0.0

Switch to symusic

This major version marks the switch from the [miditoolkit](https://github.com/YatingMusic/miditoolkit) MIDI reading/writing library to [**symusic**](https://github.com/Yikai-Liao/symusic), and a large optimisation of the MIDI preprocessing steps.

Symusic is a MIDI reading / writing library written in C++ with Python binding, offering unmatched speeds, [**up to 500 times faster than native Python libraries**](https://github.com/Natooz/MidiTok/issues/112#issuecomment-1895948962). It is based on [minimidi](https://github.com/lzqlzzq/minimidi). The two libraries are created and maintained by Yikai-Liao and lzqlzzq, who did an amazing work, which is still ongoing as many useful features are on the roadmap! 🫶

**Tokenizers from previous versions are compatible with this new version, but their might be some time variations if you compare how MIDIs are tokenized and tokens decoded.**

Performance boost

These changes result in a way faster MIDI loading/writing and tokenization times! **The overall tokenization (loading MIDI and tokenizing it) is** [**between 5 to 12 times faster**](https://github.com/Natooz/MidiTok/issues/112#issuecomment-1896286910) depending the tokenizer and data. You can find other benchmarks [here](https://github.com/Natooz/MidiTok/issues/112#issuecomment-1895948962).

This huge speed gain allows to discard the previously recommended step of pre-tokenizing MIDI files as json tokens, and **directly tokenize the MIDIs on the fly while training/using a model**! We updated the [usage examples of the docs](https://miditok.readthedocs.io/en/latest/examples.html) accordingly, the code is now simplified.

Other major changes

* When using time signatures, time tokens are now computed in ticks per beat, as opposed to ticks per quarter note as done previously. This change is in line with the definition of time and duration tokens, which was not handled following the MIDI norm for note values other than the quarter note until now (https://github.com/Natooz/MidiTok/pull/124);
* Adding new ruff rules and their fixes to comply, increasing the code quality in https://github.com/Natooz/MidiTok/pull/115;
* MidiTok still supports `miditoolkit.MidiFile` objects, but those will be converted on the fly to a `symusic.Score` object and a depreciation warning will be thrown;
* The data augmentation methods on the token level has been removed, in favour of better data augmentation operating directly on MIDIs, now much faster, simplifying processes and now handling durations;
* The docs are fixed;
* The tokenization tests workflows has been unified and considerably simplified, leading to more robust test assertions. We also increased the number of test cases and configurations, while decreasing the test time.

Other minor changes

* Setting special tokens values in TokenizerConf in https://github.com/Natooz/MidiTok/pull/114
* Update README.md by kalyani2003 in https://github.com/Natooz/MidiTok/pull/120
* Readthedocs preview action for PRs in https://github.com/Natooz/MidiTok/pull/125

New Contributors
* kalyani2003 made their first contribution in https://github.com/Natooz/MidiTok/pull/120

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.8...v3.0.0

2.1.8

This new version brings a new additional token type: pitch intervals. It allows to represent pitch intervals for simultaneous and successive note. You can read more details about how it works in [the docs](https://miditok.readthedocs.io/en/v2.1.7/).
We greatly improved the tests and Ci workflow, and fixed a few minor bugs and improvements along the way.

This new version also **drops support for Python 3.7, and now requires Python 3.8 and newer**. You can read more about the decision and how to make it retro-compatible in the docs.

**We encourage you to update to the latest [miditoolkit](https://github.com/YatingMusic/miditoolkit) version**, which also features some fixes and improvements. The most notable one is a clean of the dependencies, and **compatibility with recent numpy versions!**

What's Changed

* Typos fixes in docs by eltociear (89), gfggithubleet (91 and 93), shresthasurav (94), THEFZNKHAN (98 and 99)
* Fixing a bug when learning bpe without special tokens by Natooz in https://github.com/Natooz/MidiTok/pull/92
* Switch lint/isort/format to Ruff by akx in https://github.com/Natooz/MidiTok/pull/105
* Adding pitch interval option by Natooz in https://github.com/Natooz/MidiTok/pull/103
* Switching to pyproject.toml and hatch packaging by Natooz in https://github.com/Natooz/MidiTok/pull/106
* Fix data augment by parneyw in https://github.com/Natooz/MidiTok/pull/109
* dealing with empty midi file by feiyuehchen in https://github.com/Natooz/MidiTok/pull/110
* Better tests + minor improvements by Natooz in https://github.com/Natooz/MidiTok/pull/108

New Contributors

* eltociear made their first contribution in https://github.com/Natooz/MidiTok/pull/89
* gfggithubleet made their first contribution in https://github.com/Natooz/MidiTok/pull/91
* shresthasurav made their first contribution in https://github.com/Natooz/MidiTok/pull/94
* THEFZNKHAN made their first contribution in https://github.com/Natooz/MidiTok/pull/98
* akx made their first contribution in https://github.com/Natooz/MidiTok/pull/105
* parneyw made their first contribution in https://github.com/Natooz/MidiTok/pull/109
* feiyuehchen made their first contribution in https://github.com/Natooz/MidiTok/pull/110

**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v2.1.7...v2.1.8

Page 1 of 11

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.