Tokenizer

Latest version: v3.4.5

Safety actively analyzes 687918 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 8

2.0.3

* Fixed bug in `detokenize()` where abbreviations, domains and e-mails containing periods were wrongly split

2.0.2

* Spelled-out day ordinals are no longer included as a part of `TOK.DATEREL` tokens. Thus, *þriðji júní* is now a `TOK.WORD` followed by a `TOK.DATEREL`. *3. júní* continues to be parsed as a single `TOK.DATEREL`.

2.0.1

* Order of abbreviation meanings within the ``token.val`` field made deterministic. Abbreviations are listed in the same order in token.val as they appear in the ``Abbrev.conf`` file.
* Fixed bug in measurement unit handling

2.0.0

* Added command line tool
* Added ``split_into_sentences()`` and ``detokenize()`` functions
* Removed ``convert_telno`` option
* Splitting of coalesced tokens made more robust
* Added ``TOK.SSN``, ``TOK.MOLECULE``, ``TOK.USERNAME`` and ``TOK.SERIALNUMBER`` token kinds
* Abbreviations can now have multiple meanings

1.4.1

* Abbreviations of verbs (*dags.*, *f.*, *d.*) now return the verb stem as the associated word.
* Source code formatting improved.
* Preparations for more fine-grained control of tokenizer behavior via configuration flags.

1.4.0

* Added configuration option parameters to the `tokenizer.tokenize()` function, controlling the conversion of numbers and telephone numbers to canonical/Icelandic format, and the handling of 'kludgy' ordinals (*3ji*, *2ja*).
* Added several abbreviations.
* Minor performance enhancements.
* Added a number of test cases.

Page 5 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.