This is the "handle numbers better" release.
Previously, wordfreq would group all digit sequences of the same 'shape',
with length 2 or more, into a single token and return the frequency of that
token, which would be a vast overestimate.
Now it distributes the frequency over all numbers of that shape, with an
estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).
More changes related to digits:
- Functions such as `iter_wordlist` and `top_n_list` no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
as "0000").
- `lossy_tokenize` no longer replaces digit sequences with 0s. That happens
instead in a place that's internal to the `word_frequency` function, so we can
look at the values of the digits before they're replaced.
Other changes:
- wordfreq is now developed using `poetry` as its package manager, and with
`pyproject.toml` as the source of configuration instead of `setup.py`.
- The minimum version of Python supported is 3.7.
- Type information is exported using `py.typed`.