Word-forms

Latest version: v2.1.0

Safety actively analyzes 626071 Python packages for vulnerabilities to keep your Python projects secure.

2.1.0

Message from CubieDev

Hello again!

chrisjbryant suggested the use of python-Levenshtein for getting similarity ratios between strings in the comments of pull request 10, and though I've been busy until now, I finally got around to testing it.
This pull request has some scripts in the Markdown. They are included in case you want to reproduce my tests and results. However, I would recommend not giving the scripts themselves much attention, and focusing on the explanations and outputs.

Is it equivalent?

I created the following file in the root folder of the project, where it could read from `test_values.py`.
python
!/usr/bin/env python
encoding: utf-8

from difflib import SequenceMatcher
from Levenshtein import ratio
import unittest

class TestWordForms(unittest.TestCase):
"""
Simple TestCase for a specific input to output, one instance generated per test case for use in a TestSuite
"""

def __init__(self, text_input: str, expected_output: dict, description: str = ""):
super().__init__()
self.text_input = text_input
self.expected_output = expected_output
self.description = description

def setUp(self):
pass

def tearDown(self):
pass

def runTest(self):
self.assertEqual(
SequenceMatcher(None, self.text_input, self.expected_output).ratio(),
ratio(self.text_input, self.expected_output),
self.description,
)

if __name__ == "__main__":
from test_values import test_values

suite = unittest.TestSuite()
suite.addTests(
TestWordForms(
inp,
value,
f"difflib.SequenceMatcher(None, {repr(inp)}, {repr(value)}) ~= Levenshtein.ratio({repr(inp)}, {repr(value)})",
)
for inp, out in test_values
for values in out.values()
for value in values
)
unittest.TextTestRunner().run(suite)

In short, this takes all input values from `test_values.py`, and all individual outputs for these inputs, and checks whether the similarity ratio between these two is identical when using `difflib.SequenceMatcher().ratio()` vs `Levenshtein.ratio()`. The output is the following:

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 621 tests in 0.080s

OK

So, the `Levenshtein.ratio()` is indeed equivalent to `difflib.SequenceMatcher().ratio()` for these test cases.

But is it faster?
Again, I wrote a quick script for this. None of these are included in the actual commits as they should not be packaged with `word_forms`. The script is:
python
!/usr/bin/env python
encoding: utf-8

from timeit import timeit
from difflib import SequenceMatcher
from Levenshtein import ratio

from test_values import test_values
test_cases = [
(
inp,
value,
)
for inp, out in test_values
for values in out.values()
for value in values
]

n = 100
ratio_list = []
for str_one, str_two in test_cases:
diff = timeit(lambda: SequenceMatcher(None, str_one, str_two).ratio(), number=n)
leven = timeit(lambda: ratio(str_one, str_two), number=n)
ratio_list.append(diff / leven)
print(f"Levenshtein.ratio() is {ratio_list[-1]:.4f} times as fast as difflib.SequenceMatcher().ratio() for {repr(str_one)} to {repr(str_two)}")
print(f"Minimum performance gain (ratio): {min(ratio_list):.4f} times as fast")
print(f"Maximum performance gain (ratio): {max(ratio_list):.4f} times as fast")
print(f"Median performance gain (ratio) : {sorted(ratio_list)[round(len(ratio_list)/2)]:.4f} times as fast")

Which outputted:

Minimum performance gain (ratio): 21.2509 times as fast
Maximum performance gain (ratio): 194.4625 times as fast
Median performance gain (ratio) : 78.2975 times as fast

So, yes, it is much faster. Will it be a noticable performance increase when implemented in `word_forms`? Well, I made a quick script for that too.

Is it actually noticably faster in get_word_forms?
python
!/usr/bin/env python
encoding: utf-8

from timeit import timeit
from word_forms.word_forms import get_word_forms
from test_values import test_values

n = 200
speed_list = []
for test_case in test_values:
str_one = test_case[0]
speed = timeit(lambda: get_word_forms(str_one), number=n)
speed_list.append(speed)
print(f"Took {speed:.8f}s")
print(f"Minimum execution time (seconds): {min(speed_list):.8f}s")
print(f"Maximum execution time (seconds): {max(speed_list):.8f}s")
print(f"Median execution time (seconds) : {sorted(speed_list)[round(len(speed_list)/2)]:.8f}s")

Which outputted the following when using `difflib.SequenceMatcher().ratio()`: (Note, the execution time is for calling the function 200 times)

Minimum execution time (seconds): 0.01940580s
Maximum execution time (seconds): 1.16317950s
Median execution time (seconds) : 0.07265300s

and the following for `Levenshtein.ratio()`:

Minimum execution time (seconds): 0.01647300s
Maximum execution time (seconds): 1.23246420s
Median execution time (seconds) : 0.05827050s

When considering the median, there is a noticable performance increase of some ~20%, but this is likely not enough for any user to actually notice. Regardless, using `Levenshtein.ratio()` is definitely preferable, unless you desperately want to keep the amount of dependencies to a minimum.

---

Thank you for the recommendation chrisjbryant

All tests still pass, as expected.

- Tom Aarsen

Message from gutfeeling

1. Added a simple lemmatizer in `word_forms.lemmatizer`. It uses `get_word_forms()` to generate all forms of the word and then picks the shortest form that appears first in the dictionary (i.e. in alphabetically sorted order).
2. Updated dependencies. `Unipath` has been replaced by Python3's `pathlib`. `NLTK` and `inflect` versions have been bumped.

2.0.0

This release is a result of the awesome work done by CubieDev.

Improvement Overview

- 1. Added a Testing Suite to track results. Run it with `python test_word_forms.py`.
- 2. Fixed strange case sensitivity for some nouns. e.g.:
python
get_word_forms("death") => {
'n': {'death', 'dying', 'dice', 'Death', 'dyings', 'die', 'deaths', 'Deaths'},
...
}

- 3. Resolved some of the incorrect pluralisation provided by inflect: No more `"politicss"`.
- 4. Resolved some issues with weird unrelated results, such as:
python
get_word_forms("verb") => {
'a': {'verbal'},
'n': {'wordings', 'wording', 'word', 'verbs', 'verb', 'words'},
'r': {'verbally'},
'v': {'verbified',
'verbifies',
'verbify',
'verbifying',
'word',
'worded',
'wording',
'words'}
}

This is now:
python
get_word_forms("verb") => {
"a": {"verbal"},
"n": {"verbs", "verb"},
"r": {"verbally"},
"v": {"verbifying", "verbified", "verbify", "verbifies"},
}

- 5. Resolved words like `"ran"`, `"am"` and `"was"` not returning any values. The old system returns:
python
get_word_forms("ran") => {'n': set(), 'a': set(), 'v': set(), 'r': set()}
get_word_forms("run") => {
'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'},
'a': {'running', 'runny'},
'v': {'running', 'run', 'runs', 'ran'},
'r': set()
}

This is now:
python
get_word_forms("ran") => {
'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'},
'a': {'running', 'runny'},
'v': {'running', 'run', 'runs', 'ran'},
'r': set()
}
get_word_forms("run") => {
'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'},
'a': {'running', 'runny'},
'v': {'running', 'run', 'runs', 'ran'},
'r': set()
}

- 6. Heavily improved performance of importing the module. The following program
python
from time import time
t = time()
from word_forms.word_forms import get_word_forms
print(f"It took {time() - t:.4f}s to load the module")

used to output

It took 10.1868s to load the module

and now it outputs

It took 2.7437s to load the module

---

In addition to these fixes, this pull request solves issues 2 and 3.

---

Detailed changes
I've split up my work in small commits, so that with each commit the reasoning and changes are concise and hopefully followable. I would heavily recommend going through each commit in order to see what I've done, rather than immediately jumping to see the overall effect they had on the two main files of your project.

I'll go through each of my commits and tell you my reasoning and what they did.
- 885445c Added a test suite with a bunch of words and their (expected) results. These were manually edited to remove any errors, such as `"politicss"`.
- ead7285 Rather than looping over all synsets, and then over all lemmas, to get all words, I now call `nltk.corpus.words.words()`, which gives you (if I recall correctly) some 240000 words rather than the previous 180000, and in considerably less time. This is responsible for improvement vi.
- 0f6d9b1 Commit title and description speak for itself.
- 4f848e4 The function `get_related_lemmas()` is more intuitively implemented as a recursive function. I keep track of a list of known lemmas. When the function is called, it will take the `word` parameter, get the lemmas for that word exactly, add those to `known_lemmas` if not already present, and recursively call `get_related_lemmas()` again with lemmas related to the original `word`. Each of these recursive calls will add more entries to `known_lemmas`, which is then returned.
Note that at this time (this will change in a later commit), this function is identical in functionality, it just has a different implementation.
- 8a05104 Commit title and description speak for itself.
- fe740e6 Slight simplification using `.copy()` on a list rather than using a list comprehension to copy a list.
- 2e8c2a9 Moved away from iterating over a nested list, and using a dictionary instead. Previously, for each verb you iterated over the entire CONJUGATED_VERB_LIST, and then for each nested list checked whether the verb was in that nested list. This is a very slow operation, one which does not need to be this slow. Now, I use a dict that points to an instance of the Verb class, which holds a *set* of strings. To illustrate the change, I'll show the old and new representations:
python
CONJUGATED_VERB_LIST = [
['convoluted', 'convoluting', 'convolute', 'convolutes'],
['fawn', 'fawned', 'fawns', 'fawning']
]

python
v1 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
v2 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
CONJUGATED_VERB_DICT = {
'convoluted': v1,
'convoluting': v1,
'convolute': v1,
'convolutes': v1,
'fawn': v2,
'fawned': v2,
'fawns': v2,
'fawning': v2
}

In the old system, we need:
python
for verb in verb_set:
for word_list in CONJUGATED_VERB_LIST:
if verb in word_list:
for word in word_list:
related_word_dict["v"].add(word)

You can count the amount of nested loops for yourself.
Now we only need:
python
for verb in verb_set:
if verb in CONJUGATED_VERB_DICT:
related_word_dict["v"] |= CONJUGATED_VERB_DICT[verb].verbs

This is considerably faster. Note that `|=` is a set union operation.
- 30c8282 The changes in line 35 are reverted later on. Furthermore, now every pertainym is used in ADJECTIVE_TO_ADVERB, rather than just the first one. Other than that, `word_forms` is just slightly optimized to not need a list comprehension.
- a9cd983 This is a very interesting commit. It uses something very similar to the `difflib.get_close_matches()` used in `constants.py`: `difflib.SequenceMatcher`. Now, new lemmas found in `get_related_lemmas` will only be considered if they are deemed at least 40% similar to the previous lemma. This will avoid jumps like `"verbal" -> "word"` and `"genetic" -> "origin"`.
- 7f6d5d2 Commit title and description speak for itself.
- 4bd4cd5 Commit title and description speak for itself.
- d1c6340 Commit title and description speak for itself. This is responsible for improvement ii.
- b2a9daf Rather than blindly adding the plural of any noun, we ensure that the pluralized noun does not end with `"css"` to avoid `"geneticss"` and `"politicss"`. We override this change later in commit 2ea150e.
- 80d27bf Commit title and description speak for itself.
- 5f31968 Commit title and description speak for itself.
- 77b9c25 Commit title and description speak for itself.
- 0cf0628 Commit title and description speak for itself.
- 359fc0c Rather than turning words inputted to `get_word_forms` to singular, we now use the NLTK `WordNetLemmatizer()` to lemmatize the input word for nouns, verbs, adjectives and adverbs. With this change, we get `"{run}"` as words when the input was `"ran"`, or `{"be", "am"}` if we input `"am"`. Then, for each of these words we call `get_related_lemmas`, and duplicates lemmas are removed. This is responsible for improvement v.
- 18eeba6 Commit title and description speak for itself.
- 2ea150e Improves on commit b2a9daf. Uses a simple regular expression to detect whether the pluralised noun ends with a consonant followed by "ss". Commit title and description speak for itself. This is responsible for improvement iii.
- c832b6b Commit title and description speak for itself.
- 384f298 Commit title and description speak for itself.
- e71f64f Now that the program output has been improved, some of the examples are outdated. I've updated them, and added the now relevant examples of `"am"` and `"ran"`. The Contribution section is also updated to reflect that a bug was now fixed.
- d11297e Commit title and description speak for itself.

---

Tests
The modified program passes all provided tests. The original program will fail 9 of them. The output of the tests on the original program is provided here:
[original_test_output.txt](https://github.com/gutfeeling/word_forms/files/5171711/original_test_output.txt)

I've tested my changes with Python 3.7.4. Whether the program is still Python 2 compatible I do not know.

Potential additions/considerations for you
- Add tests to setup.py so tests can be run using setup.py.
- Check/confirm Python 2 compatibility.
- Consider adding a contributors section in the README with contributors.
- Consider checking the README to see if it still has any inconsistencies with the new version, e.g. if it warns for bugs that no longer exist.
- Note that the pluralisation by `inflect` can still cause issues, as it comes up with words like `"runninesses"` as the plural of `"runniness"`.

1.0.0

Releases

Has known vulnerabilities