**API-breaking changes:**
The `Reader` class has been completely rewritten.
A couple methods have been removed, while others have been renamed.
For methods that remain (renamed or not),
their behavior for output data structure and arguments allowed has been changed.
The details are in the following.
Added
* New classmethods of `Reader` for reader instantiation:
- `from_zip`
- `from_dir`
* New classes to better structure CHAT data:
- `Utterance`
- `Token`
- `Gra`
* New Reader methods:
- `append_left`, `extend`, `extend_left`, `pop`, `pop_left`
- `tokens` (which gives `Token` objects, essentially the "tagged words" from before)
* In the header dictionary, each participant's info has the new key `"dob"`
for date of birth (if the info is available in the CHAT header).
The corresponding value is a `datetime.date` object.
(The same info was previously exposed as the `Reader` method `date_of_birth`,
now removed.)
* The test suite now covers code snippets in both the docstrings and `.rst` doc files.
Changed
* CHAT parsing in `Reader` instantiation has been completely rewritten.
The previous private class `_SingleReader` has been removed.
This private class duplicated a lot of the `Reader` code,
which made it hard to make changes.
* The `Reader` rewrite has also greatly sped up the reading and parsing of CHAT data.
* The `by_files` argument, which many `Reader` methods has,
now gives you a simpler list of results for each data file,
no longer the previous output of a dict that mapped a file path to the file's
result.
* The `participant` argument, which many `Reader` methods has for specifying
which participants' data to include in the output, has been renamed as
`participants` to avoid confusion. There is no change to its behavior of
handling either a single string (e.g., ``"CHI"``) or a collection of strings
(e.g., ``{"CHI", "MOT"}``) .
* The following `Reader` methods have been renamed as indicated,
some for stylistic or Pythonic reasons, others for reasons as given:
- `age` -> `ages`
- `number_of_utterances` -> `n_utterances`
- `number_of_files` -> `n_files`
- `filenames` -> `file_paths`
- `MLU` -> `mlu`
- `MLUm` -> `mlum`
- `MLUw` -> `mluw`
- `TTR` -> `ttr`
- `IPSyn` -> `ipsyn`
- `word_frequency` -> `word_frequencies`
- `from_chat_str` -> `from_strs`
- `from_chat_files` -> `from_files`
- `add` -> `append`.
Since the data files in a `Reader` have a natural ordering (by time of
recording sessions, and therefore commonly by file paths as well),
a reader is list-like rather than an unordered set of data files,
which `add` would suggest.
- `participant_codes` -> `participants`.
Before this version, the methods `participant_codes` (for CHI, MOT, etc) and
`participants` (for, say, Eve, Mother, Investigator, etc) co-existed,
but in practice we mostly only care about CHI, MOT, etc.
So the method `participants` for Eve etc has been removed,
and `participant_codes` has been renamed as `participants`.
* Each participant's info in a header dictionary has these keys renamed:
- `participant_name` -> `name`
- `participant_role` -> `role`
- `SES` -> `ses` (socioeconomic status)
* The class `DependencyGraph` has been made private
(i.e., now `_DependencyGraph` with a leading underscore).
Its functionality hasn't really changed (it's used in the computation of IPSyn).
It may be made more visible again in the future if more functionality
related to grammatical relations is developed in the package.
* Switched to sphinx-rtd-theme as the documentation theme.
* Switched to CircleCI orbs; update dev requirements' versions.
Deprecated
* The following Reader methods have been deprecated:
- `tagged_sents` (use `tokens` with `by_utterances=True` instead)
- `tagged_words` (use `tokens` with `by_utterances=False` instead)
- `sents` (use `words` with `by_utterances=True` instead)
Removed
* The following methods of the `Reader` class have been removed:
- `abspath`. Use `file_paths` instead.
- `index_to_tiers`. All the unparsed tiers are now available from `utterances`.
- `participant_codes`. It's been renamed as `participants`, another method now removed; see "Changed" above.
- `part_of_speech_tags`
- `update` and `remove`. A reader is a list-like collection of CHAT data files,
not a set (which `update` and `remove` would suggest).
- `search` and `concordance`. To search, use one of
the `words`, `tokens`, and `utterances` methods to walk through a reader's CHAT data
and keep track of elements of interest.
- `date_of_birth`. The info is now available under `headers`, in each participant's
`"dob"` key.
Fixed
* Handled `[/-]` in cleaning utterances.
* `[x <number>]` means a repetition of the previous word/item, not repetition
of the entire utterance.