- Improvements to tokenization of words containing numbers (e.g. COVID-19-Pandemie, FFP2-Maske).
2.2.3
- Improvements to tokenization: Roman ordinals, abbreviation “Art.” preceding a number, certain units of measurement at the end of a sentence (e.g. km/h).
- New feature: Prune XML tags and their contents from the input before tokenization (via the command line option --prune TAGNAME1 --prune TAGNAME2 … or by passing prune_tags=["TAGNAME1", "TAGNAME2", …] to tokenize_xml or tokenize_xml_file). This can be useful when processing HTML files, e.g. for removing any <script> and <style> tags from the input.
2.1.6
- Recognize more URLs without protocol. - Fix a small bug in implementation of doubly linked lists.