Highlight
Version 1.3.0 changes the way the vocabulary, and by extension tokenizers, handle special tokens: `PAD`, `SOS`, `EOS` and `MASK`. It brings a cleaner way to instantiate these classes.
It might bring incompatibilities with data and models used with previous MidiTok versions.
Changes
* b9218bf7cba9695f37f74424325037bb4dda1cb7 `Vocabulary` class now takes `pad` argument to specify to include special padding token. This option is set to True by default, as it is more common to train networks with batches of unequal sequence lengths.
* b9218bf7cba9695f37f74424325037bb4dda1cb7 `Vocabulary` class: the `event_to_token` argument of the constructor is renamed `events` and has to be given as a list of events.
* b9218bf7cba9695f37f74424325037bb4dda1cb7 `Vocabulary` class: when adding a token to the vocabulary, the index is automatically set. The index argument is removed as it could cause issues / confusion when mapping indexes with models.
* b9218bf7cba9695f37f74424325037bb4dda1cb7 The `Event` class now takes `value` argument in second (order)
* b9218bf7cba9695f37f74424325037bb4dda1cb7 Fix when learning BPE if `files_lim` were higher than the number of files itself
* f9cb1098df1334fb618550ea5c946657aea05881 For all tokenizers, a new constructor argument `pad` specifies to use padding token, and the `sos_eos_tokens` argument is renamed to `sos_eos`
* f9cb1098df1334fb618550ea5c946657aea05881 When creating a `Vocabulary`, the *SOS* and *EOS* tokens are now registered before the *MASK* token. This change is motivated so the order matches the one of special token arguments in tokenizers constructors, and as the *SOS* and *EOS* tokens are more commonly used in symbolic music applications.
* 84db19d16efb3db566d468e6c71a912b1193ae05 The dummy `StructuredEncoding`, `MuMIDIEncoding`, `OctupleEncoding` and `OctupleMonoEncoding` classes removed from `init.py`. These classes from early versions had no record of being used. Other dummy classes (REMI, MIDILike and CPWord) remain.
Compatibility
* You might need to update your code when creating your tokenizer to handle the new `pad` argument.
* Data tokenized with **REMI**, and models trained with, will be incompatible with v1.3.0 if you used special tokens. The *BAR* token was previously at index 1, and is now added after special tokens.
* If you created custom tokenizer inheriting `MIDITokenizer`, make sure to update the calls to `super().__init__` with new `pad` arg and renamed `sos_eos` arg (example for MIDILike: [f9cb109](https://github.com/Natooz/MidiTok/commit/f9cb1098df1334fb618550ea5c946657aea05881#diff-ee81c42cbf9a0e5150c860773b29d8d75edbe6774b95837bbcb09272d6408883))
* **If you used both *SOS/EOS* and *MASK* special tokens**, their order (indexes) is now swapped as *SOS/EOS* are now registered before *MASK*. As these tokens should are not used during the tokenization, **your previously tokenized datasets remain compatible**, unless you intentionally inserted *SOS*/*EOS*/*MASK* tokens. **Trained models will however be incompatible** as the indices are swapped. If you want to use v1.3.0 with a previously trained model, you can manually invert the predictions of these tokens.
* No incompatibilities outside of these cases
**Please reach out if you have any issue / question!** 🙌