were made to provide normalized version for measure that did not inherently
range from 0 to 1. The other major focus was the addition of 12 tokenizers, in
service of expanding distance measure options.
Changes:
- Support for Python 3.3 was dropped.
- Deprecated functions that merely wrap class methods to maintain API
compatibility, for removal in 0.6.0
- Added methods to ConfusionTable to return:
- its internal representation
- false negative rate
- false omission rate
- positive & negative likelihood ratios
- diagnostic odds ratio
- error rate
- prevalence
- Jaccard index
- D-measure
- Phi coefficient
- joint, actual, & predicted entropies
- mutual information
- proficiency (uncertainty coefficient)
- information gain ratio
- dependency
- lift
- Deprecated f-measure & g-measure from ConfusionTable for removal in
0.6.0
- Added notes to indicate when functions, classes, & methods were added
- Added the following 12 tokenizers:
- QSkipgrams
- CharacterTokenizer
- RegexpTokenizer, WhitespaceTokenizer, & WordpunctTokenizer
- COrVClusterTokenizer, CVClusterTokenizer, & VCClusterTokenizer
- SonoriPyTokenizer & LegaliPyTokenizer
- NLTKTokenizer
- SAPSTokenizer
- Added the UnigramCorpus class & a facility for downloading data, such as
pre-processed/trained data, from storage on GitHub
- Added the Wåhlin phonetic encoding
- Added the following 211 similarity/distance/correlation measures:
- ALINE
- AMPLE
- Anderberg
- Andres & Marzo's Delta
- Average Linkage
- AZZOO
- Baroni-Urbani & Buser I & II
- Batagelj & Bren
- Baulieu I-XV
- Benini I & II
- Bennet
- Bhattacharyya
- BI-SIM
- BLEU
- Block Levenshtein
- Brainerd-Robinson
- Braun-Blanquet
- Canberra
- Chord
- Clement
- Cohen's Kappa
- Cole
- Complete Linkage
- Consonni & Todeschini I-V
- Cormode's LZ
- Covington
- Dennis
- Dice Asymmetric I & II
- Digby
- Dispersion
- Doolittle
- Dunning
- Eyraud
- Fager & McGowan
- Faith
- Fellegi-Sunter
- Fidelity
- Fleiss
- Fleiss-Levin-Paik
- FlexMetric
- Forbes I & II
- Fossum
- FuzzyWuzzy Partial String
- FuzzyWuzzy Token Set
- FuzzyWuzzy Token Sort
- Generalized Fleiss
- Gilbert
- Gilbert & Wells
- Gini I & II
- Goodall
- Goodman & Kruskal's Lambda
- Goodman & Kruskal's Lambda-r
- Goodman & Kruskal's Tau A & B
- Gower & Legendre
- Guttman's Lambda A & B
- Gwet's AC
- Hamann
- Harris & Lahey
- Hassanat
- Hawkins & Dotson
- Hellinger
- Higuera & Mico
- Hurlbert
- Iterative SubString
- Jaccard-NM
- Jensen-Shannon
- Johnson
- Kendall's Tau
- Kent & Foster I & II
- Koppen I & II
- Kuder & Richardson
- Kuhns I-XII
- Kulczynski I & II
- Longest Common Prefix
- Longest Common Suffix
- Lorentzian
- Maarel
- Marking
- Marking Metric
- MASI
- Matusita
- Maxwell & Pilliner
- McConnaughey
- McEwen & Michael
- MetaLevenshtein
- Michelet
- MinHash
- Mountford
- Mean Squared Contingency
- Mutual Information
- NCD with LZSS
- NCD with PAQ9a
- Ozbay
- Pattern
- Pearson's Chi-Squared
- Pearson & Heron II
- Pearson II & III
- Pearson's Phi
- Peirce
- Positional Q-Gram Dice, Jaccard, & Overlap
- Q-Gram
- Quantitative Cosine, Dice, & Jaccard
- Rees-Levenshtein
- Roberts
- Rogers & Tanimoto
- Rogot & Goldberg
- Rouge-L, -S, -SU, & -W
- Russell & Rao
- SAPS
- Scott's Pi
- Shape
- Shapira & Storer I
- Sift4 Extended
- Single Linkage
- Size
- Soft Cosine
- SoftTF-IDF
- Sokal & Michener
- Sokal & Sneath I-V
- Sorgenfrei
- Steffensen
- Stiles
- Stuart's Tau
- Tarantula
- Tarwid
- Tetrachoric
- TF-IDF
- Tichy
- Tulloss's R, S, T, & U
- Unigram Subtuple
- Unknown A-M
- Upholt
- Warrens I-V
- Weighted Jaccard
- Whittaker
- Yates' Chi-Squared
- YJHHR
- Yujian & Bo
- Yule's Q, Q II, & Y
- Four intersection types are now supported for all distance measure that are
based on _TokenDistance. In addition to basic crisp intersections, soft,
fuzzy, and group linkage intersections have been provided.