------------------
* Added
- New `extract_` functions:
* Generic `extract` used by all others, and takes
arbitrary regex to extract text.
* `extract_questions` to get question mark statistics, as
well as the text of questions asked.
* `extract_currency` shows text that has currency symbols in it, as
well as surrounding text.
* `extract_intense_words` gets statistics about, and extract words with
any character repeated three or more times, indicating an intense
feeling (+ve or -ve).
- New function `word_tokenize`:
* Used by `word_frequency` to get tokens of
1,2,3-word phrases (or more).
* Split a list of text into tokens of a specified number of words each.
- New stop-words from the ``spaCy`` package:
**current:** Arabic, Azerbaijani, Danish, Dutch, English, Finnish,
French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian,
Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
**new:** Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian,
Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu,
Thai, Ukrainian, Urdu, Vietnamese
* Changed
- `word_frequency` takes new parameters:
* `regex` defaults to words, but can be changed to anything '\S+'
to split words and keep punctuation for example.
* `sep` not longer used as an option, the above `regex` can
be used instead
* `num_list` now optional, and defaults to counts of 1 each if not
provided. Useful for counting `abs_freq` only if data not
available.
* `phrase_len` the number of words in each split token. Defaults
to 1 and can be set to 2 or higher. This helps in analyzing phrases
as opposed to words.
- Parameters supplied to `serp_goog` appear at the beginning
of the result df
- `serp_youtube` now contains `nextPageToken` to make
paginating requests easier