Out of Distribution and Outlier Detection
1. Detect **out of distribution** examples in a dataset based on its numeric **feature embeddings**
python
from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)
To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)
2. Detect **out of distribution** examples in a dataset based on **predicted class probabilities** from a trained classifier
python
from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)
To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)
Multi-annotator -- support data with multiple labels
3. For data **labeled by multiple annotators** (stored as matrix `multiannotator_labels` whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities `pred_probs` from *any* trained classifier
python
from cleanlab.multiannotator import get_label_quality_multiannotator
get_label_quality_multiannotator(multiannotator_labels, pred_probs)
Support Token Classification tasks
4. Cleanlab v2.1 can now find label issues in **token classification** (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition). This relies on three inputs:
- `tokens`: List of tokenized sentences whose `i`th element is a list of strings corresponding to tokens of the `i`th sentence in dataset.
Example: `[..., ["I", "love", "cleanlab"], ...]`
- `labels`: List whose `i`th element is a list of integers corresponding to class labels of each token in the `i`th sentence. Example: `[..., [0, 0, 1], ...]`
- `pred_probs`: List whose `i`th element is a np.ndarray of shape `(N_i, K)` corresponding to predicted class probabilities for each token in the `i`th sentence (assuming this sentence contains `N_i` tokens and dataset has `K` possible classes). These should be out-of-sample `pred_probs` obtained from a token classification model via cross-validation.
Example: `[..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]`
Using these, you can easily find and display mislabeled tokens in your data
python
from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues
issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
class_names=optional_list_of_ordered_class_names)
Support pd.DataFrames, Keras/PyTorch/TF Datasets, Keras models, etc.
5. `CleanLearning` can now operate directly on **non-array dataset** formats like tensorflow/pytorch `Datasets` and use **arbitrary Keras models**:
python
import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel
dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array)) example tensorflow dataset created from numpy arrays
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)
def make_model(num_features, num_classes):
inputs = tf.keras.Input(shape=(num_features,))
outputs = tf.keras.layers.Dense(num_classes)(inputs)
return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")
model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array) variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset) equivalent to model.predict() after training on cleaner data
Change Log
* Fix edgecase divide-by-0 in entropy-score by jwmueller in https://github.com/cleanlab/cleanlab/pull/241
* Fix some typos. by Yulv-git in https://github.com/cleanlab/cleanlab/pull/242
* Updated project urls in setup.py by calebchiam in https://github.com/cleanlab/cleanlab/pull/249
* FeatureReq 33: Added custom sample_weight by rushic24 in https://github.com/cleanlab/cleanlab/pull/248
* Allow users to pass custom weights for ensemble label quality scoring by JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/255
* Fix line index of CleanLearning(), some text of links, etc. by Yulv-git in https://github.com/cleanlab/cleanlab/pull/260
* Copy the docs build artifacts to the "stable" folder by weijinglok in https://github.com/cleanlab/cleanlab/pull/231
* Add Negative Log Loss Weighting Scheme for Ensemble Label Quality Score by JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/267
* Developed class that allow the use of cleanlab with tensorflow and huggingface models by MattiaSangermano in https://github.com/cleanlab/cleanlab/pull/247
* Add KNN distance OOD scoring function and unit tests by JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/268
* Dataset documentation clarifications by jwmueller in https://github.com/cleanlab/cleanlab/pull/270
* Add issue templates by anishathalye in https://github.com/cleanlab/cleanlab/pull/278
* Fix bug. get thresholds broken for multi_label by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/264
* Clarify labels format by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/282
* Drop dependency on SciPy by anishathalye in https://github.com/cleanlab/cleanlab/pull/286
* Make CleanLearning work with pandas and other non-numpy feature objects X by jwmueller in https://github.com/cleanlab/cleanlab/pull/285
* Allow CleanLearning to use validation data in each fold by huiwengoh in https://github.com/cleanlab/cleanlab/pull/295
* Created FAQ Page in the Cleanlab documentation by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/294
* Proper validation of `labels` values/format across package by jwmueller in https://github.com/cleanlab/cleanlab/pull/301
* Add static type checking by anishathalye in https://github.com/cleanlab/cleanlab/pull/306
* error for missing classes, consistency on determining num_classes by jwmueller in https://github.com/cleanlab/cleanlab/pull/308
* Added support to build KNN graph for OOD detection with only training data by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/305
* Standardize naming on K, num_classes and N, num_examples by huiwengoh in https://github.com/cleanlab/cleanlab/pull/312
* Added outlier detection tutorial into docs by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/310
* Updating tutorials hyperlink to 2.0.0 release by aravindputrevu in https://github.com/cleanlab/cleanlab/pull/318
* Allow KNN object to be returned by get_outlier_scores, Improved OOD tutorial by jwmueller in https://github.com/cleanlab/cleanlab/pull/319
* Some FAQ tips on how to improve CleanLearning by jwmueller in https://github.com/cleanlab/cleanlab/pull/324
* Updated tutorials to include quickstart by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/323
* Add y argument as alternative to labels in CleanLearning.fit() by elisno in https://github.com/cleanlab/cleanlab/pull/322
* validation.py: Annotate function args and return values by elisno in https://github.com/cleanlab/cleanlab/pull/317
* Fixed package version issues for audio tutorial by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/325
* Add compatibility for tensorflow and pytorch Dataset objects by jwmueller in https://github.com/cleanlab/cleanlab/pull/311
* Re-order find_label_issues args for better clarity by jwmueller in https://github.com/cleanlab/cleanlab/pull/329
* Comment on missing/rare classes in FAQ by jwmueller in https://github.com/cleanlab/cleanlab/pull/332
* update sphinx to v5 by jwmueller in https://github.com/cleanlab/cleanlab/pull/327
* Allow missing classes in get_label_quality_scores by huiwengoh in https://github.com/cleanlab/cleanlab/pull/334
* Allow missing classes in assert_valid_class_labels by huiwengoh in https://github.com/cleanlab/cleanlab/pull/335
* Changed all docstring instances of np.array to np.ndarray by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/336
* Update Contributing.md with Projects link and getting started instructions by jwmueller in https://github.com/cleanlab/cleanlab/pull/349
* Switch docs links from latest release to stable by elisno in https://github.com/cleanlab/cleanlab/pull/379
* Extending cleanlab to find label errors in token classification datasets by ericwang1997 in https://github.com/cleanlab/cleanlab/pull/347
* Cleanlab functionality for multiannotator data by huiwengoh in https://github.com/cleanlab/cleanlab/pull/333
* Cleanup token classification code by elisno in https://github.com/cleanlab/cleanlab/pull/390
* Fix typing for find_label_issues by elisno in https://github.com/cleanlab/cleanlab/pull/391
* Match token/s in color_sentence by elisno in https://github.com/cleanlab/cleanlab/pull/397
* Escape special regex characters by elisno in https://github.com/cleanlab/cleanlab/pull/404
* Add FAQ question on how to get predicted labels by jwmueller in https://github.com/cleanlab/cleanlab/pull/402
* Implementing get_ood_scores function by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/338
* Add termcolor dependency by huiwengoh in https://github.com/cleanlab/cleanlab/pull/415
* Add token classification tutorial notebook to docs.cleanlab.ai by elisno in https://github.com/cleanlab/cleanlab/pull/411
* Update examples links by huiwengoh in https://github.com/cleanlab/cleanlab/pull/421
* Polish multiannotator docs by huiwengoh in https://github.com/cleanlab/cleanlab/pull/422
* Text tutorial improvements by jwmueller in https://github.com/cleanlab/cleanlab/pull/429
* suppress tensorflow warning logs in tutorials if not properly installed by jwmueller in https://github.com/cleanlab/cleanlab/pull/432
* Add autodoc-typehints extension for sphinx by elisno in https://github.com/cleanlab/cleanlab/pull/412
* Strip input prompts when copying code snippets by elisno in https://github.com/cleanlab/cleanlab/pull/439
* Extend KerasWrapper to Functional API by huiwengoh in https://github.com/cleanlab/cleanlab/pull/434
* Deploy documentation for token classification module by elisno in https://github.com/cleanlab/cleanlab/pull/438
* Updated labels to allow array_like by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/426
* Add keras wrapper to docs by jwmueller in https://github.com/cleanlab/cleanlab/pull/443
* Format all return docstrings and add typing by jwmueller in https://github.com/cleanlab/cleanlab/pull/437
* make num_label_issues = cj calibrated offdiag sum by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/445
* fix bug in hard-coded test. generalize the test by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/448
* Change output of display_issues by elisno in https://github.com/cleanlab/cleanlab/pull/450
* More improvements to token classification code and documentation by jwmueller in https://github.com/cleanlab/cleanlab/pull/452
* Fix details disclosure elements in docs by anishathalye in https://github.com/cleanlab/cleanlab/pull/456
* Add missing backticks and language annotation by anishathalye in https://github.com/cleanlab/cleanlab/pull/461
* Error handling for rare classes in multiannotator data by huiwengoh in https://github.com/cleanlab/cleanlab/pull/455
* Fix docs build in CI by anishathalye in https://github.com/cleanlab/cleanlab/pull/462
* Added support for returning ranked issue idxs by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/459
* update readme for v2.1 by jwmueller in https://github.com/cleanlab/cleanlab/pull/457
* Clearer code examples on docs main page by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/430
New Contributors
* Yulv-git made their first contribution in https://github.com/cleanlab/cleanlab/pull/242
* rushic24 made their first contribution in https://github.com/cleanlab/cleanlab/pull/248
* MattiaSangermano made their first contribution in https://github.com/cleanlab/cleanlab/pull/247
* ulya-tkch made their first contribution in https://github.com/cleanlab/cleanlab/pull/293
* huiwengoh made their first contribution in https://github.com/cleanlab/cleanlab/pull/295
* aravindputrevu made their first contribution in https://github.com/cleanlab/cleanlab/pull/318
* elisno made their first contribution in https://github.com/cleanlab/cleanlab/pull/322
* ericwang1997 made their first contribution in https://github.com/cleanlab/cleanlab/pull/340
**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.0.0...v2.1.0