Wordfreq

Latest version: v3.1.1

Safety actively analyzes 714973 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 5

2.3

- Python 3.5 is the oldest maintained version of Python, and we have stopped
claiming support for earlier versions.

- Updated to langcodes 2.0.

- Deprecated the `match_cutoff` parameter, which was intended for situations
where we need to approximately match a language code, but was not usefully
configurable in those situations.

2.2.2

Library change:

- Fixed an incompatibility with newly-released `msgpack 1.0`.

2.2.1

Library changes:

- Relaxed the version requirement on the 'regex' dependency, allowing
compatibility with spaCy.

The range of regex versions that wordfreq now allows is from 2017.07.11 to
2018.02.21. No changes to word boundary matching were made between these
versions.

- Fixed calling `msgpack.load` with a deprecated parameter.

2.2

Library change:

- While the sign is usually considered a symbol and not part of a word, there
is a case where it acts like a letter. It's used in one way of writing
gender-neutral words in Spanish and Portuguese, such as "ls niñs". The
tokenizer in wordfreq will now allow words to end with "" or "s", so it
can recognize these words.

Data changes:

- Updated the data from Exquisite Corpus to filter the ParaCrawl web crawl
better. ParaCrawl provides two metrics (Zipporah and Bicleaner) for the
goodness of its data, and we now filter it to only use texts that get
positive scores on both metrics.

- The input data includes the change to tokenization described above, giving
us word frequencies for words such as "ls".

2.1

Data changes:

- Updated to the data from the latest Exquisite Corpus, which adds the
ParaCrawl web crawl and updates to OpenSubtitles 2018
- Added small word list for Latvian
- Added large word list for Czech
- The Dutch large word list once again has 5 data sources

Library changes:

- The output of `word_frequency` is rounded to three significant digits. This
provides friendlier output, and better reflects the precision of the
underlying data anyway.

- The MeCab interface can now look for Korean and Japanese dictionaries
in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them
when they are installed from source.

2.0.1

Fixed edge cases that inserted spurious token boundaries when Japanese text is
run through `simple_tokenize`, because of a few characters that don't match any
of our "spaceless scripts".

It is not a typical situation for Japanese text to be passed through
`simple_tokenize`, because Japanese text should instead use the
Japanese-specific tokenization in `wordfreq.mecab`.

However, some downstream uses of wordfreq have justifiable reasons to pass all
terms through `simple_tokenize`, even terms that may be in Japanese, and in
those cases we want to detect only the most obvious token boundaries.

In this situation, we no longer try to detect script changes, such as between
kanji and katakana, as token boundaries. This particularly allows us to keep
together Japanese words where ヶ appears between kanji, as well as words that
use the iteration mark 々.

This change does not affect any word frequencies. (The Japanese word list uses
`wordfreq.mecab` for tokenization, not `simple_tokenize`.)

Page 3 of 5

Releases

Has known vulnerabilities

Previous Next

Wordfreq

Page 3 of 5

2.3

2.2.2

2.2.1

2.2

2.1

2.0.1

Page 3 of 5

Links

Releases