Somajo

Latest version: v2.4.3

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 9

2.0.0

New features and improvements

- New API: Use new class SoMaJo instead of Tokenizer and
SentenceSplitter. Currently, the old API is still supported but will
issue deprecation warnings.
- Speed-up: Due to a new internal representation of the input text
during processing (as a doubly linked list of Token objects),
tokenization is now two to three times faster.
- Incremental and parallel processing of XML: If a sensible set of
eos_tags is specified, the XML input will be processed incrementally
(allowing for arbitrarily large XML input). In addition, if a
sensible set of eos_tags is specified, processing can also be
parallelized.
- New option --strip-tags to suppress the output of XML tags.
- Support for textual representations of emojis (:smile:,
:stuck_out_tongue_winking_eye:, etc.).
- Support for textfaces (༼ʘ̚ل͜ʘ̚༽, ╚(ಠ_ಠ)=┐, etc.).

Breaking changes

- Removed the tokenizer script (deprecated since version 1.5.0
released in October 2017). Use somajo-tokenizer instead.
- Language codes contain the tokenization guideline: "de_CMC" instead
of "de" and "en_PTB" instead of "en".

1.11.0

- XML sentence splitting: Added hr tag to default sentence breaks
- Recognize Reddit links in shorthand notation
- Improved robustness of XML processing

1.10.7

- Make recognition of gender star case insensitive
- Fix problem with “nasty” character as last character of text unit

1.10.6

- Recognize gender star.
- Improve recognition of lists of numbers, section numbers and IPv4
addresses

1.10.5

- Correctly tokenize flags followed by a variation selector.
- Delete variation selector that occurs on its own.

1.10.4

- Bugfix related to the --version option.

Page 5 of 9

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.