Highlights
* Support for abc files, which can be loaded and dumped with symusic similarly to MIDI files;
* The tokenizers can now also be trained with the **WordPiece** and **Unigram** algorithms!
* Tokenizer training and token ids encoding can now be performed "bar-wise" or "beat-wise", meaning the tokenizer can learn new tokens from successions of base tokens strictly within bars or beats. This is set by the `encode_ids_split` attribute of the tokenizer config;
* [symusic](https://github.com/Yikai-Liao/symusic) v0.4.3 or higher is now required to comply with the usage of the `clip` method;
* Better handling of file loading errors in `DatasetMIDI` and `DataCollator`;
* Introducing a new `filter_dataset` to clean a dataset of MIDI/abc files before using it;
* `MMM` tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (`REMI`, `TSD` and `MIDILike`) to allow more flexibility and interoperability;
* `TokSequence` objects can now be sliced and concatenated (eg `seq3 = seq1[:50] + seq2[50:]`);
* `TokSequence` objects tokenized from a tokenizer can now be split per bars or beats subsequences;
* minor fixes, code improvements and cleaning;
Methods renaming
A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.
<details>
<summary>Methods renamed with depreciation warning:</summary>
* `midi_to_tokens` --> `encode`;
* `tokens_to_midi` --> `decode`;
* `learn_bpe` --> `train`;
* `apply_bpe` --> `encode_token_ids`;
* `decode_bpe` --> `decode_token_ids`;
* `ids_bpe_encoded` --> `are_ids_encoded`;
* `vocab_bpe` --> `vocab_model`.
* `tokenize_midi_dataset` --> `tokenize_dataset`;
</details>
<details>
<summary>Methods renamed without depreciation warning (less usages, reduces the code messiness):</summary>
* `MIDITokenizer` --> `MusicTokenizer`;
* `augment_midi` --> `augment_score`;
* `augment_midi_dataset` --> `augment_dataset `;
* `augment_midi_multiple_offsets` --> `augment_score_multiple_offsets`;
* `split_midis_for_training` --> `split_files_for_training`;
* `split_midi_per_note_density` --> `split_score_per_note_density`;
* `get_midi_programs` --> `get_score_programs`;
* `merge_midis` --> `merge_scores`;
* `get_midi_ticks_per_beat` --> `get_score_ticks_per_beat`;
* `split_midi_per_ticks` --> `split_score_per_ticks`;
* `split_midi_per_beats` --> `split_score_per_beats`;
* `split_midi_per_tracks` --> `split_score_per_tracks`;
* `concat_midis` --> `concat_scores`;
</details>
<details>
<summary>Protected internal methods (no depreciation warning, advanced usages):</summary>
* `MIDITokenizer._tokens_to_midi` --> `MusicTokenizer._tokens_to_score`;
* `MIDITokenizer._midi_to_tokens` --> `MusicTokenizer._score_to_tokens`;
* `MIDITokenizer._create_midi_events` --> `MusicTokenizer._create_global_events`
</details>
There is no other compatibility issue beside these renaming.
**Full Changelog**: https://github.com/Natooz/MidiTok/compare/v3.0.2...v3.0.3