Corpkit

Latest version: v2.3.8

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 2

2.0.13

There were some issues with large XML file processing that have now been resolved.

Have fun!

v.2.0.8
- Speed increases, especially for feature counting
- Multiprocessing for parsing, very useful when you have access to a big machine
- Improved searching for CoreNLP (looking in all paths), automating download and installation
- Simpler backend implementation of keywords and ngrams
- Better documentation, especially at [ReadTheDocs](http://corpkit.readthedocs.io/en/latest/)
- Code has been refactored and made largely PEP8 compliant, aiding collaboration
- Can now sort by subcorpus name in `interrogation.edit()` method

Very little difference to the API, however!

2.0.0

In this major release, stability and performance have been improved in dozens of ways:
- Python 2/3 compatibility
- Smart multiprocessing
- Useful documentation, _ReadTheDocs_ site generation
- Much smaller repository size
- Compatible with multiple versions of _CoreNLP_
- Increased object orientation generally
- _Nose_ tests
- _Travis CI_ integration
- Faster save/load via _cPickle_
- Countless bugfixes

Levels of abstraction have been added beyond `Corpus` (`Corpora`) and `Interrogation` (`Interrodict`), with useful methods attached to each. Interrogation and concordancing have become two sides of the same coin, rather than separate tasks, helping to build computational workflows that investigate functional linguistic notions of probabilistic grammar and lexis as delicate grammar.

One emerging part of corpkit is the `configurations()` method, which automatically analyses the behaviour of a lexical item or items in the corpus. This will be very useful in automated workflows that seek to identify key participants and processes, and then to generate an overview of how each behaves. A little more work is still needed here, however. Also on the horizon are multilingual support and the use of _spaCy_ ... but perhaps some of this needs to wait until I've made peace with my thesis.

1.87

The main thing going on now is some decent docstrings, which allow for some decent documentation via http://corpkit.readthedocs.org/en/latest/. Since the last release, things have also gotten more stable. `Corpus` class, and its subclasses, are working really nicely: it's now easy to search particular subcorpora, multiprocess, or treat files as subcorpora. the `interrogate` method has also impoved a lot. `conc` has been subsumed within `interrogate`. All is well.

1.82

This release marks a transition to a class-and-method structure, rather than a collection of functions. Users now instantiate a `Corpus` object with methods for parsing, interrogating and concordancing. Interrogations output `Interrogation` objects, which have methods for editing, plotting, saving, etc.

Another major update is that the `concordance()` method takes the same core arguments as the `interrogate()` method. This means that users can quickly check that their interrogation is counting what they think it is.

There have also been some bugfixes, documentation updates, and that kind of usual stuff.

1.76

This release is designed to reflect a change from purpose-built `interrogator()` search functions to the `search` and `show` arguments, which are much more powerful. Users can construct a `dict` object with one or more dependency criteria to match, and elect to match all criteria or any criterion with `searchmode = 'any'/'all'`.

python
>>> criteria = {'lemma': ['think', 'feel', 'want'],
... 'pos': r'^V',
... 'function': 'root'}

>>> r = interrogator(corpus, search = criteria, show = ['word'], searchmode = 'all')
>>> list(r.results.columns)[:5]

might return:

python
['think', 'thinking', 'want', 'wants', 'feel']

Passing in a longer list for the `show` argument will set what is given in the output, as well as its order:

python
>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'], searchmode = 'all')
>>> list(r.results.columns)[:3]

will produce column names with concatenated function, pos and lemma:

python
['root/vbp/think', 'root/vbg/thinking', 'root/vb/want']

Another improvement is the `exclude` argument, which takes the place of `blacklist`, `function_filter` and `pos_filter`. Alongside `excludemode = 'any'/'all'`, it operates just like `search`, allowing the user to exclude results matching one or more criteria:

python
>>> excs = {'pos': r'^V', 'word': r'ing$'}
>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'],
... searchmode = 'all', exclude = excs, excludemode = 'all')

would remove any verbal token ending in `ing`. Changing `excludemode` to `'any'` would remove all verbs and all words ending in `ing`.

The release has various other bugfixes, code cleanup, and some miscellaneous bits and pieces, such as a function for turning results into Pandas Multi Index DataFrames. Full API documentation is forthcoming.

Corpkit

Page 1 of 2

2.0.13

2.0.0

1.87

1.82

1.76

1.26

Page 1 of 2

Links

Releases