This is a major release in which the API (in particular for weakly-supervised algorithms) was largely refurbished in order to make it more unified and largely compatible with scikit-learn. Note that for this reason, you might encounter a significant amount of `DeprecationWarning` and `ChangedBehaviourWarning`. These warnings will disappear in version 0.6.0. The changes are summarized below:
- All algorithms:
- Uniformize initialization for all algorithms: all algorithms that have a 'prior' or an 'init' as a parameter, can now choose it in a unified way, between (more) different choices ('identity', 'random', etc...) (195 )
- Rename `num_dims` to `n_components` for algorithms that have such a parameter. (193)
- `metric()` method has been renamed into `get_mahalanobis_matrix` (152)
- You can now use the function `score_pairs` to score a bunch of pair of points (return the distance between them), or `get_metric` to get a metric function that can be plugged into scikit-learn estimators like any scipy distance.
- Weakly supervised algorithms
- major API changes (139, 217, 220, 197, 168) allowing greater compatibility with scikit-learn routines:
- in order to `fit` weakly supervised algorithms, users now have to provide 3d arrays of tuples (and possibly an array of labels `y`). For pairs learners, instead of `X` and `[a, b, c, d]` as before, we should have an array `pairs` such that `pairs[i] = X[a[k], b[k]]` if `y[i] == 1` or `X[c[k], d[k]] if y[i] != 1`, where `k` is some integer (you can obtain such a representation by stacking horizontally `a` and `b`, then `c` and `d`, stacking these vertically, and taking X[this array of indices]). For quadruplets learners, one should have the same form of input, instead that there is no need for `y`, and that the 3d array will be an array of 4-uples instead of 2-uples. The two first elements of each quadruplet are the ones that we want to be more similar to each other than the last two.
- Alternatively, a "preprocessor" can be used, if users instead want to give tuples of indices and not tuples of plain points, for less redundant manipulation of data. Custom preprocessor can be easily written for advanced use (e.g., to load and encode images from file paths).
- You can also use `predict` on a given pair or quadruplet, i.e. predict whether the pair is similar or not, or in the case of quadruplets, whether a given new quadruplet is in the right ordering or not
- For pairs, this prediction depends on a threshold that can be set with `set_threshold` and calibrated on some data with `calibrate_threshold`.
- For pairs, a default `score` is defined, which is the AUC (Area under the ROC Curve). For quadruplets, the default `score` is the accuracy (proportion of quadruplets given in the right order).
- All of the above allows the algorithms to be compatible with scikit-learn for cross-validation, grid-search etc...
- For more information about these changes, see the new documentation
- Supervised algorithms
- deprecation of `num_labeled` parameter (119):
- ITML_supervised bounds must now be set in init and not fit anymore (163)
- deprecation of `use_pca` in LMNN (231).
- the random seed for generating constraints has now to be put at initialization rather than fit time (224).
- removed preprocessing the data for RCA (194).
- removed shogun dependency for LMNN (216).
- Improved documentation:
- mathematical formulation of algorithms (178)
- general introduction to metric learning, use cases, different problem formulations (145)
- description of the API in the user guide (208 and 229)
- Bug fixes:
- scikit-learn's fix https://github.com/scikit-learn/scikit-learn/pull/13276 fixed SDML when the matrix to reconstruct is PSD, and the use of skggm fixed it in cases where the matrix is not PSD but we can still converge. The use of skggm is now recommended (i.e. we recommend to install skggm to use SDML).
- For all the algorithms that had a parameter `num_dims` (renamed to `n_components`, see above), it will now be checked to be between 1 and `n_features`, with `n_features` the number of dimensions of the input space
- LMNN did not update impostors at each iteration, which could result in problematic cases. Impostors are now recomputed at each iteration, which solves these problems (228).
- The pseudo-inverse is now used in `Covariance` instead of the plain inverse, which allows to make `Covariance` work even in the case where the covariance matrix is not invertible (e.g. if the data lies on a space of smaller dimension).(206)
- There was an error in 101 that caused LMNN to return a wrong gradient (one dot product with `L` was missing). This has been fixed in 201.