Miditok

Latest version: v3.0.5.post1

Safety actively analyzes 714973 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 11

2.0.4

Changes

* 456a6ce149fc856d58b65a73f71f2af9e3e4af87 bugfix on the velocity feature when performing data augmentation at token level

2.0.3

Changes

* ff1bb5eebff335258cfda68339c6d304a1a82541 and 195cb6594ee579a302ba4bce9f7228f5c8bb2353 the `__call__` magic method allows to load MIDI and json files before converting them
* c045630f2ae2dd250c30dfa9a6eaef47dc425588 `TokSequence`s are now subscriptable! (you can do `tok_seq[id_]`)
* a63221435fc058523669b39ac0b08656a8754d84 Special tokens are now stored without the `None` value
* Minor code and documentation improvements

Compatibility

* In case you use `token_type_graph` and `tokens_errors`: previous config files store special tokens with None value (eg `PAD_None`), have to modified to remove it (eg just `PAD`) (`special_tokens` entry only). No change in vocabulary / tokens.

2.0.2

* 63110d7de15a140b6b8b79db794e0034e8ad222e fix in `_ids_are_bpe_encoded` method

2.0.1

Changes

* e26b088531befab41d9f8a7f6d7244498b544861 from atsukoba + help from muthissar: [`REMI+`](https://openreview.net/forum?id=NyR8OZFHw6i) is now implemented! 🎉 This multitrack tokenization can be seen as an extension of `REMI`.
* 29622115f6061579f5de5502bbcea8b05c3712a0 *Chord* tokens can now represent the root note within tokens (versus only chord quality previously). Chord parameters have to be specified in `additional_tokens` argument, with the keys `chord_maps`, `chord_tokens_with_root_note` and `chord_unknown`. You can use the [default value](https://github.com/Natooz/MidiTok/blob/main/miditok/constants.py#L41) as an example.
* e402b0d42f7eb39eeb074d439e839e63bf8a1098 `_in_as_seq` decorator now automatically checks if the input ids are encoded with BPE
* 2064ee944494d0d0583418ab6a2670c7861e561a fix with BPE containing spaces in merges, could not load tokenizers after training

Compatibility

* due to 2064ee944494d0d0583418ab6a2670c7861e561a, bytes and merges are shifted from v2.0.0. BPE tokenizers will be incompatible and would have to be retrained, or the bytes from their vocabularies and merges would have to be shifted. This only applies for BPE.

2.0.0

TL;DR

This major update brings:

* The integration of the Hugging Face [🤗tokenizers](https://github.com/huggingface/tokenizers) library as Byte Pair Encoding (BPE) backend. **BPE is now between 30 to 50 times faster, for both training and encoding !** 🙌
* A new **`TokSequence`** object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE).
* Many internal changes, methods and variables renamed, that require you to update some of your code (details below).

Changes
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a `Vocabulary` class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions;
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a New `special_tokens` constructor argument for all tokenizers, in place of the previous `pad`, `mask`, `sos_eos` and `sep` arguments. It is a list of tokens (str) for more versatility. By default, special tokens are `["PAD", "BOS", "EOS", "MASK"]`;
* a9b82e4ffb0f77b1541c5a236af4377ea156d77a `__getitem__` now handles both ids (int) and tokens (str), with multi-vocab;
* 36bf0f66a392835e492a9f7decf7e382662f30aa Some methods of `MIDITokenizer` meant to be used internally are now protected;
* a2db7b9ed173149c34e9f01e3d873317a185288f New training method with 🤗tokenizers BPE model;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `TokSequence` object, used as in and out object for `midi_to_tokens` and `tokens_to_midi` methods, thanks to the `_in_as_seq` and `_out_as_complete_seq` decorators;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `complete_sequence` method allowing to automatically convert the uninitiated attributes of a `TokSequence` (ids, tokens);
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 `tokens_to_events` renamed `_ids_to_tokens`, and new id / token / byte conversion methods with recursivity;
* 9befb8d76c90b24533f8badbff9b368bf80e6da5 **Tokens are now saved and loaded with the `ids` key (previously `tokens`)**;
* cddd29cc2939fda706f502c9be36ccfa06f5dd20 Tokenization files moves to dedicated tokenizations module;
* cddd29cc2939fda706f502c9be36ccfa06f5dd20 `decompose_bpe` method renamed `decode_bpe`;
* d5201287b93fd42121b76b39e08cabf575e9bdcd `tokenize_dataset` allows to apply BPE afterwards.

Compatibility

Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations.
However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:

* Tokens are now saved and loaded with the `ids` key (previously `tokens`). To adapt your previously saved tokens, open them with json and rewrite them with the `ids` key instead;
* `midi_to_tokens` (also called with `tokenizer(midi)`) now outputs a list of `TokSequence`s, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the `.ids` attribute, as `tokseq.ids`;
* `Vocabulary` class deleted. You can still access to the vocabulary with `tokenizer.vocab` but it is now a dictionary. The methods of the `Vocabulary` class are now directly integrated in `MIDITokenizer`;
* For all tokenizers, the `pad`, `mask`, `sos_eos` and `sep` constructor arguments need to be replaced with the new `special_tokens` argument;
* `decompose_bpe` method renamed `decode_bpe`.

Bug reports

With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.

1.4.3

Changes
* 77f7c533f333ef8926e7bc5f9cb11270a3560322 dinhviettoanle (24) Fixing a bug skipping tokens repetitions with BPE
* **New documentation** : [miditok.readthedocs.io](https://miditok.readthedocs.io/en/latest/). We finally have a proper documentation website! 🙌 With this comes many improvements and fixes in the docstrings.
* 201c9b738d3e29aaa11a4e99a3acfeedb59dd4a4 Legacy `REMIEncoding`, `MIDILikeEncoding` and `CPWordEncoding` classes removed.
* e92a414dfbf4f38e64c256eeab29bad980804d9b `token_types_errors` of `MIDITokenizer` class handling basic / common cases of errors
* Small minor code improvements
* 14862046f974ca889598dddbaaef67d45a008a8c Use of `dataclasses`. This means that Python 3.6 (and previous) are no longer compatible. Python 3.6 was compatible but not supported (tested) up to v1.4.2.

Page 4 of 11

Releases

Has known vulnerabilities

Previous Next

Miditok

Page 4 of 11

2.0.4

2.0.3

2.0.2

2.0.1

2.0.0

1.4.3

Page 4 of 11

Links

Releases