Fast tokenizer/lexer written in C for semiformal unicode text using Unicode's TR-29 segmentation rules (https://unicode.org/reports/tr29/), as well as some web text types like URLs and emails.
- Handles many cases like hyphenated words, acronyms, URLs, emails, etc.
- Does not try to disambiguate sentence boundaries and abbreviations, or add too many opinions at this stage, simply focuses on lexical units. Preserves whitespace so the original text can be reconstructed from the tokens and abbreviations can be derived in postprocessing.
- ideograms and Hangul syllables are simply broken into character/syllable respectively and should be merged in postprocessing using a modeled approach.
- Usable as a standalone C library through [clib](https://github.com/clibs/clib) or through Python with `pip install semiformal`, which should download wheels without requiring C compilation
- Updating tokenizer's pattern-based rules at https://github.com/goodcleanfun/semiformal will trigger a new build of the underlying C FSA using [re2c](https://github.com/skvadrik/re2c/).
- Additional tokenizers/lexers can be built using the template repo https://github.com/goodcleanfun/tokenizer which uses a [copier](https://github.com/copier-org/copier) template as needed