Semiformal

Latest version: v0.7.0

Safety actively analyzes 622096 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.7.0

- Added an emoji token type.
- Removed the “Symbol - other” class (which includes emoji and other characters that should be tokenized on their own) from the possible_word_chars which can be tokenized as part of a word if they occur next to letters.

0.4.9

Fast tokenizer/lexer written in C for semiformal unicode text using Unicode's TR-29 segmentation rules (https://unicode.org/reports/tr29/), as well as some web text types like URLs and emails.

- Handles many cases like hyphenated words, acronyms, URLs, emails, etc.
- Does not try to disambiguate sentence boundaries and abbreviations, or add too many opinions at this stage, simply focuses on lexical units. Preserves whitespace so the original text can be reconstructed from the tokens and abbreviations can be derived in postprocessing.
- ideograms and Hangul syllables are simply broken into character/syllable respectively and should be merged in postprocessing using a modeled approach.
- Usable as a standalone C library through [clib](https://github.com/clibs/clib) or through Python with `pip install semiformal`, which should download wheels without requiring C compilation
- Updating tokenizer's pattern-based rules at https://github.com/goodcleanfun/semiformal will trigger a new build of the underlying C FSA using [re2c](https://github.com/skvadrik/re2c/).
- Additional tokenizers/lexers can be built using the template repo https://github.com/goodcleanfun/tokenizer which uses a [copier](https://github.com/copier-org/copier) template as needed

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.