Tokenizer

Latest version: v3.4.5

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 8

2.2.0

Fixed ``correct_spaces()`` to handle compounds such as *Atvinnu-, nýsköpunar- og ferðamálaráðuneytið* and *bensínstöðvar, -dælur og -tankar*.

2.1.0

* Changed handling of periods at end of sentences if they are a part of an abbreviation. Now, the period is kept attached to the abbreviation, not split off into a separate period token, as before.

2.0.7

* Added ``TOK.COMPANY`` token type
* Fixed a few abbreviations
* Renamed parameter ``text`` to ``text_or_gen`` in functions that accept a string or a string iterator

2.0.6

Fixed handling of abbreviations such as *m.v.* (*miðað við*) that should not start a new sentence even if the following word is capitalized.

2.0.5

* Fixed bug where single uppercase letters were erroneously being recognized as abbreviations, causing prepositions such as 'Í' and 'Á' at the beginning of sentences to be misunderstood in ReynirPackage
* Added several abbreviations
* Silently correct connectors *-og* and *-eða* that are affixed to words, splitting them up in the tokenization process; *Tösku-og hanskabúðin* thus becomes a single token with correct spacing: *Tösku- og hanskabúðin*

2.0.4

* Added imperfect abbreviations (*amk.*, *osfrv.*)
* Recognized *klukkan hálf tvö* as a ``TOK.TIME``

Page 4 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.