Sentencepiece

Latest version: v0.2.0

Safety actively analyzes 681881 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 5

0.1.83

- Use the official docker image to build tf_sentencepiece ops
- support tf 1.14.0 and tf 2.0.0-beta1.

0.1.82

Bug fix: fixed the behavior of is_unknown method in Python module.

0.1.81

0.1.9

Features:
- **--byte_fallback:** fallback UNK token into UTF-8 byte sequences. 256 byte symbols are reserved in advance.
https://arxiv.org/pdf/1909.03341.pdf Note that you need to set --character_coverage less than 1.0, otherwise byte-fall-backed token may not appear in the training data.
- **BPE-dropout:** Implemented BPE dropout. https://arxiv.org/abs/1910.13267
Sampling API is available for the BPE.
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.h#L287
- **--required_chars=chars:** Specify the set of Unicode chars that must be included in the final vocab.
- **--split_digits:** Split all digits (0-9) into separate pieces (disabled by default)
- **Denormalization:** Apply extra normalization rule after decoding. We can specify the rule as TSV via --denormalization_rule_tsv=file flag. Note that offset information may not always be preserved.
- -**-train_extremely_large_corpus:** Train the unigram model from extremely large corpus (> 10M sentences) to avoid integer overflow. Note that it will increase the memory usage. 300GB or larger memory might be necessary.


Performance improvement:
- 30%-50% performance improvement is obtained in the default unigram one-best tokenization.

New API
- [Python] Added Python friendly API. New API allows to feed any chars to user_defined_symbols during the training. The old methods are still available.
https://github.com/google/sentencepiece/tree/master/python#segmentation
- [C++] Added the interface to feed training data via arbitrary iterator object.
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_trainer.h#L40
- [C++] Added the interface to set set a pre-tokenizer to specify the word boundary. This is used as a word-boundary constraint to set the seed vocabulary, and not used in the inference time.
https://github.com/google/sentencepiece/blob/master/src/pretokenizer_for_training.h

0.1.8

Feature: Get rid of the dependency to external protobuf
Feature: added (Encode|Decode)AsSerializedProto interface so Python module can get full access to the SentencePieceText proto including the byte offsets/aligments
Feature: added --treat_whitespace_as_suffix option to make _ be a suffix of word.
Feature: Added normalization rules to remove control characters in the default nmt_* normalizers
Minor fix: simplify the error messager
Minor fix: do not emit full source path in LOG(INFO)

For more detail: https://github.com/google/sentencepiece/compare/v0.1.7...v0.1.8

0.1.7

Deprecated: `--mining_sentence_size` and `--training_sentence_size`. Load all sentences by default. `--input_sentence_size` can be specified to limit the sentences to be loaded
Feature: added `--unk_piece/--bos_piece/--eos_piece/--pad_piece` flags to change the surface representations of these special symbols.
Bug fix: added third_party directory for cmake's subdirectory.

For more detail:
https://github.com/google/sentencepiece/compare/v0.1.6...v0.1.7

Page 4 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.