Matching of items containing “+” or “&” or being written in camel case has been optimized a bit. Now the tokenizer runs roughly three to four times faster.
1.2.0
Two new options added: With -s/--paragraph_separator, you can specify how paragraphs are delimited in the input data, i.e. by empty lines or by single newlines. The --parallelization option makes it possible to use a pool of worker processes to speed up tokenization.
1.1.2
The example in the documentation is now self-contained: Sample input has been added and the output will be printed.
1.1.1
The link in the Evaluation section of the Readme now points to the complete gold standard data.
1.1.0
SoMaJo can now output additional information about the original spelling of the tokens, i.e. if a token was followed by whitespace or if a token contained internal whitespace (according to the tokenization guidelines, things like “: )” get normalized to “:)”). To use this feature, provide the tokenizer script with the -e option.
1.0.3
This version works around a bug in the regex module that caused exponential runtimes on certain inputs.