TL;DR
This major update brings:
* The integration of the Hugging Face [🤗tokenizers](https://github.com/huggingface/tokenizers) library as Byte Pair Encoding (BPE) backend. **BPE is now between 30 to 50 times faster, for both training and encoding !** 🙌
* A new **`TokSequence`** object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE).
* Many internal changes, methods and variables renamed, that require you to update some of your code (details below).
Changes
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a `Vocabulary` class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions;
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a New `special_tokens` constructor argument for all tokenizers, in place of the previous `pad`, `mask`, `sos_eos` and `sep` arguments. It is a list of tokens (str) for more versatility. By default, special tokens are `["PAD", "BOS", "EOS", "MASK"]`;
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a `__getitem__` now handles both ids (int) and tokens (str), with multi-vocab;
* 36bf0f66a392835e492a9f7decf7e382662f30aa Some methods of `MIDITokenizer` meant to be used internally are now protected;
* a2db7b9ed173149c34e9f01e3d873317a185288f New training method with 🤗tokenizers BPE model;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `TokSequence` object, used as in and out object for `midi_to_tokens` and `tokens_to_midi` methods, thanks to the `_in_as_seq` and `_out_as_complete_seq` decorators;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `complete_sequence` method allowing to automatically convert the uninitiated attributes of a `TokSequence` (ids, tokens);
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `tokens_to_events` renamed `_ids_to_tokens`, and new id / token / byte conversion methods with recursivity;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 **Tokens are now saved and loaded with the `ids` key (previously `tokens`)**;
* cddd29cc2939fda706f502c9be36ccfa06f5dd20 Tokenization files moves to dedicated tokenizations module;
* cddd29cc2939fda706f502c9be36ccfa06f5dd20 `decompose_bpe` method renamed `decode_bpe`;
* d5201287b93fd42121b76b39e08cabf575e9bdcd `tokenize_dataset` allows to apply BPE afterwards.
Compatibility
Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations.
However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:
* Tokens are now saved and loaded with the `ids` key (previously `tokens`). To adapt your previously saved tokens, open them with json and rewrite them with the `ids` key instead;
* `midi_to_tokens` (also called with `tokenizer(midi)`) now outputs a list of `TokSequence`s, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the `.ids` attribute, as `tokseq.ids`;
* `Vocabulary` class deleted. You can still access to the vocabulary with `tokenizer.vocab` but it is now a dictionary. The methods of the `Vocabulary` class are now directly integrated in `MIDITokenizer`;
* For all tokenizers, the `pad`, `mask`, `sos_eos` and `sep` constructor arguments need to be replaced with the new `special_tokens` argument;
* `decompose_bpe` method renamed `decode_bpe`.
Bug reports
With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.