- Bugfix: Removed trailing space from last token in
paragraph/sentence.
- SoMaJo should be run as 'somajo-tokenizer'. The 'tokenizer' command
is deprecated.
- XML entities (&, &75;, &x7f;) are recognized as single tokens.
- Some abbreviations (usw., usf., etc., uvam.) indicate sentence
boundaries if they are followed by a potential sentence start.
- We also print a log message that indicates tokenization speed.