This pretty big update brings data augmentation, some bug fixes and optimizations, allowing to write more elegant code.
Changes
* 8f201e0299b503ce2b6976cbfa1d39660b4c3efe 308fb278dd1df2dee79541b8c85773a619ef5b02 [**Data augmentation**](https://github.com/Natooz/MidiTok/tree/main/miditok/data_augmentation) methods ! 🙌 They can be applied on both MIDI and tokens, to augment data by shifting the pitch, velocity and duration values.
* 1d8e9034c9354a60249de05dc938eb8e880f366e You can perform data augmentation while tokenizing a dataset (`tokenize_midi_dataset` method) with the `data_augment_offsets` argument. This will be done at the token level, as its faster than augmenting MIDI objects.
* 0634adee1f050fb51eed1d73ef39f982573c5d7d **BPE** is now implemented in the main tokenizer class! This means all tokenizers can benefit form it in a much prettier way!
* 0634adee1f050fb51eed1d73ef39f982573c5d7d **`bpe` method renamed to `learn_bpe`**, and now returns metrics (that are also showed in the progress bar during the learning) on the number of token combinations and sequence length reduction
* 7b8c9777cb0866a179b64e50c26c6c7cccec5cee Retrocompatibility when loading tokenizer config files with BPE from older versions
* 3cea9aa11c238486a71dff82d244a0f16a8a52e9 nturusin Example notebook of GPT2 Hugging Face music transformer: fixes in training
* 65afa6b1aaa35e0df396276f9811f12d10f67ea6 The `tokens_to_midi` and `save_tokens` methods **can now receive tokens as Tensors and numpy arrays. PyTorch, TensorFlow and Jax (numpy) tensors are supported**. The `convert_tokens_tensors_to_list` decorator will convert them to lists, you can use it on your custom methods.
* aab64aa4159ee27022b359597ece3154dc224513 The `__call__` magic method now automatically route to `midi_to_tokens` or `tokens_to_midi` following what you give it. You can now use more elegantly tokenizers as `tokenizer(midi_obj)` or `tokenizer(generated_tokens)`.
* e90b20a86283aa1dab2db071f5c6a49b161caa42 Bugfix in `Structured` causing a possible infinite while loop with illegal token types successions
* 947af8cfa0c72212a8835e10fe1f804356d13e8a Big refactor of MuMIDI, which have now fixed vocab / type idx. It is easier to handle and use. (thanks gonzaloarca)
* 947af8cfa0c72212a8835e10fe1f804356d13e8a CPWord "Ignore" tokens are all renamed `Ignore_None` by convention, making operations easier in data augmentation and other methods.
Compatibility
* code with BPE would have to updated: remove `bpe(tokenizer)` and just declare tokenizers normally, rename the `bpe` method to `learn_bpe`
* MuMIDI tokens and tokenizers will be incompatible with v1.4.0