- Prefer the 'fork' method for creating the worker processes for parallel tagging, if it is supported by the operating system. This is much faster than the 'spawn' method that is the default on some non-Linux systems (issue 14).
1.8.0
- Add option --use-nfkc to the command line interface and option use_nfkc to the constructor of ASPTagger (issue 11). If this option is used, the internal representation of the input data uses Unicode normalization form NFKC. This can be useful for social media input that misuses mathematical symbols for their typographic effects (e.g. βπ΄ππππππππππβ instead of βImpfausweisβ). - Add option --sentence-tag to specify an XML tag in the input data that marks sentence boundaries (issue 12). This is particularly useful in combination with the --sentence-tag option of SoMaJo.
1.7.3
- Use less memory when loading a model if the ijson library is present and the Python version is at least 3.7 (at least 3.6 for CPython) (issue 9). - Restructured code for parallel tagging (issue 8).
1.7.2
- Bugfix: Do not choke on chunks of XML that do not contain actual word tokens (usually at the end of a file). - Updated regular expressions for emojis, emoticons, numbers and URLs.
1.7.1
- Fixed an XML-related bug in STTS_IBK_postprocessor - Fixed a minor bug in emoticon regex
1.7.0
- Added Reddit links and Reddit-specific emoticons - Moved command-line interface to cli.py - Helper script for tagging multiple files (somewe-tagger-multifile) - Postprocessing script for some deterministic tagging decisions in STTS_IBK, e.g. URLs, Emoticons, etc. (STTS_IBK_postprocessor)