Cleanlab

Latest version: v2.7.0

Safety actively analyzes 682457 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 4

2.2

Finding label issues in multi-label classification is done using the same code and inputs as before (and the same object is returned as before):
python
from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
multi_label=True,
return_indices_ranked_by="self_confidence",
)

Where for a 3-class multi-label dataset with 4 examples, we might have say:
python
labels = [[0], [0, 1], [0, 2], [1]]

pred_probs = np.array(
[[0.9, 0.1, 0.1],
[0.9, 0.1, 0.8],
[0.9, 0.1, 0.6],
[0.2, 0.8, 0.3]]
)



The following code (in which class 1 is missing from the dataset) did not previously work but now runs without problem in cleanlab v2.2.0:
python
from cleanlab.filter import find_label_issues
import numpy as np

labels = [0, 0, 2, 0, 2]
pred_probs = np.array(
[[0.8, 0.1, 0.1],
[0.7, 0.1, 0.2],
[0.3, 0.1, 0.6],
[0.5, 0.2, 0.3],
[0.1, 0.1, 0.8]]
)

label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
)


Looking forward

The next major release of this package will introduce a paradigm shift in the way people check their datasets. Today this involves significant manual labor, but software should be able to help! Our research has developed algorithms that can automatically detect many types of common issues that plague real-world ML datasets. The next version of cleanlab will offer an easy-to-use line of code that runs all of our appropriate algorithms to help ensure a given dataset is issue-free and well-suited for supervised learning.

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed [on our github](https://github.com/cleanlab/cleanlab/wiki#ideas-for-contributing-to-cleanlab) or you can jump into the discussions on [Slack](https://cleanlab.ai/slack).

Change Log
* updated label_quality_utils.py and rebuilt the doc by ethanotran in https://github.com/cleanlab/cleanlab/pull/475
* Add workflow for skipping notebooks by huiwengoh in https://github.com/cleanlab/cleanlab/pull/472
* Fix return type in token classification get_label_quality_scores by jwmueller in https://github.com/cleanlab/cleanlab/pull/477
* Adding pylint CI checks by mohitsaxenaknoldus in https://github.com/cleanlab/cleanlab/pull/465
* CI: Build check cleanlab works without optional dependencies by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/470
* Outlier tutorial: move uninteresting code to hidden cell by jwmueller in https://github.com/cleanlab/cleanlab/pull/492
* Update DEVELOPMENT.md with howto add new modules by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/494
* Minor asthetic fix for tutorials.ipynb by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/493
* Update __init__.py to include major files by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/490
* Make type checking pass with mypy 0.981 by anishathalye in https://github.com/cleanlab/cleanlab/pull/488
* Update issues returned by num_label_issues by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/485
* Mypy typechecking fix for count.py by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/500
* Add basic utilities for handling quality scores for multilabel data by elisno in https://github.com/cleanlab/cleanlab/pull/499
* reinvented algorithms for multilabel find_label_issues by aditya1503 in https://github.com/cleanlab/cleanlab/pull/483
* Trying to fix typings by ChinoCodeDemon in https://github.com/cleanlab/cleanlab/pull/502
* Add internal function to properly format labels by huiwengoh in https://github.com/cleanlab/cleanlab/pull/504
* Mention internal format label function in multiannotator docs by huiwengoh in https://github.com/cleanlab/cleanlab/pull/506
* Multilabel code restructuring with aggregation/scorer functions by aditya1503 in https://github.com/cleanlab/cleanlab/pull/509
* Separate word_coloring from token_replacement in color_sentence by elisno in https://github.com/cleanlab/cleanlab/pull/514
* Add support for missing classes by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/511
* Better missing class support for label quality scoring by jwmueller in https://github.com/cleanlab/cleanlab/pull/518
* moving multilabel functions by aditya1503 in https://github.com/cleanlab/cleanlab/pull/515
* restrict typecheck to python v3.10 by jwmueller in https://github.com/cleanlab/cleanlab/pull/521
* support missing classes in multiannotator functions by huiwengoh in https://github.com/cleanlab/cleanlab/pull/519
* fix mypy typing by huiwengoh in https://github.com/cleanlab/cleanlab/pull/524
* Add studio banner by cmauck10 in https://github.com/cleanlab/cleanlab/pull/525
* added missing classes test for multilabel by aditya1503 in https://github.com/cleanlab/cleanlab/pull/523
* Improve tutorials language/formatting by jwmueller in https://github.com/cleanlab/cleanlab/pull/526
* Validate forgetting factor in EMA by elisno in https://github.com/cleanlab/cleanlab/pull/527
* Remove strong worded requirement for out-of-sample pred probs by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/520
* Ensure type checks pass with new mypy v0.990 by jwmueller in https://github.com/cleanlab/cleanlab/pull/530
* replace pylint --> flake8 by ilnarkz in https://github.com/cleanlab/cleanlab/pull/531
* Tutorial for multi-label classification by aditya1503 in https://github.com/cleanlab/cleanlab/pull/517
* Fix multilabel_py dimensionality by elisno in https://github.com/cleanlab/cleanlab/pull/535
* cleanlab install on colab for multilabel tutorial by jwmueller in https://github.com/cleanlab/cleanlab/pull/537
* Refactor MultilabelScorer helper methods and tests by elisno in https://github.com/cleanlab/cleanlab/pull/540
* Make a public method for multilabel quality scores by jwmueller in https://github.com/cleanlab/cleanlab/pull/542
* Improve and standardize documentation in label error detection methods for classification datasets by jwmueller in https://github.com/cleanlab/cleanlab/pull/543
* Move mypy configuration to config file by anishathalye in https://github.com/cleanlab/cleanlab/pull/545
* Fix types to work with latest pandas-stubs by anishathalye in https://github.com/cleanlab/cleanlab/pull/546
* Fix passing of kwargs to get_label_quality_scores by anishathalye in https://github.com/cleanlab/cleanlab/pull/547
* Switch CI cron schedule by anishathalye in https://github.com/cleanlab/cleanlab/pull/548
* Remove unnecessary type: ignore annotations by anishathalye in https://github.com/cleanlab/cleanlab/pull/549
* update readme for v2.2 by jwmueller in https://github.com/cleanlab/cleanlab/pull/551

New Contributors
* ethanotran made their first contribution in https://github.com/cleanlab/cleanlab/pull/475
* mohitsaxenaknoldus made their first contribution in https://github.com/cleanlab/cleanlab/pull/465
* aditya1503 made their first contribution in https://github.com/cleanlab/cleanlab/pull/483
* ChinoCodeDemon made their first contribution in https://github.com/cleanlab/cleanlab/pull/502
* cmauck10 made their first contribution in https://github.com/cleanlab/cleanlab/pull/525
* ilnarkz made their first contribution in https://github.com/cleanlab/cleanlab/pull/531
* Po-He Tseng helped run some early tests of our new multi-label algorithms

**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.1.0...v2.2.0

2.2.0

Examples corresponding to [cleanlab's v2.2.0 release](https://github.com/cleanlab/cleanlab/releases/tag/v2.2.0).

2.1

Out of Distribution and Outlier Detection

1. Detect **out of distribution** examples in a dataset based on its numeric **feature embeddings**
python
from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)

To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)


2. Detect **out of distribution** examples in a dataset based on **predicted class probabilities** from a trained classifier
python
from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)

To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)


Multi-annotator -- support data with multiple labels

3. For data **labeled by multiple annotators** (stored as matrix `multiannotator_labels` whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities `pred_probs` from *any* trained classifier
python
from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)


Support Token Classification tasks

4. Cleanlab v2.1 can now find label issues in **token classification** (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition). This relies on three inputs:

- `tokens`: List of tokenized sentences whose `i`th element is a list of strings corresponding to tokens of the `i`th sentence in dataset.
Example: `[..., ["I", "love", "cleanlab"], ...]`
- `labels`: List whose `i`th element is a list of integers corresponding to class labels of each token in the `i`th sentence. Example: `[..., [0, 0, 1], ...]`
- `pred_probs`: List whose `i`th element is a np.ndarray of shape `(N_i, K)` corresponding to predicted class probabilities for each token in the `i`th sentence (assuming this sentence contains `N_i` tokens and dataset has `K` possible classes). These should be out-of-sample `pred_probs` obtained from a token classification model via cross-validation.
Example: `[..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]`

Using these, you can easily find and display mislabeled tokens in your data
python
from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues

issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
class_names=optional_list_of_ordered_class_names)


Support pd.DataFrames, Keras/PyTorch/TF Datasets, Keras models, etc.

5. `CleanLearning` can now operate directly on **non-array dataset** formats like tensorflow/pytorch `Datasets` and use **arbitrary Keras models**:

python
import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel

dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array)) example tensorflow dataset created from numpy arrays
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)

def make_model(num_features, num_classes):
inputs = tf.keras.Input(shape=(num_features,))
outputs = tf.keras.layers.Dense(num_classes)(inputs)
return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")

model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array) variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset) equivalent to model.predict() after training on cleaner data



Change Log
* Fix edgecase divide-by-0 in entropy-score by jwmueller in https://github.com/cleanlab/cleanlab/pull/241
* Fix some typos. by Yulv-git in https://github.com/cleanlab/cleanlab/pull/242
* Updated project urls in setup.py by calebchiam in https://github.com/cleanlab/cleanlab/pull/249
* FeatureReq 33: Added custom sample_weight by rushic24 in https://github.com/cleanlab/cleanlab/pull/248
* Allow users to pass custom weights for ensemble label quality scoring by JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/255
* Fix line index of CleanLearning(), some text of links, etc. by Yulv-git in https://github.com/cleanlab/cleanlab/pull/260
* Copy the docs build artifacts to the "stable" folder by weijinglok in https://github.com/cleanlab/cleanlab/pull/231
* Add Negative Log Loss Weighting Scheme for Ensemble Label Quality Score by JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/267
* Developed class that allow the use of cleanlab with tensorflow and huggingface models by MattiaSangermano in https://github.com/cleanlab/cleanlab/pull/247
* Add KNN distance OOD scoring function and unit tests by JohnsonKuan in https://github.com/cleanlab/cleanlab/pull/268
* Dataset documentation clarifications by jwmueller in https://github.com/cleanlab/cleanlab/pull/270
* Add issue templates by anishathalye in https://github.com/cleanlab/cleanlab/pull/278
* Fix bug. get thresholds broken for multi_label by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/264
* Clarify labels format by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/282
* Drop dependency on SciPy by anishathalye in https://github.com/cleanlab/cleanlab/pull/286
* Make CleanLearning work with pandas and other non-numpy feature objects X by jwmueller in https://github.com/cleanlab/cleanlab/pull/285
* Allow CleanLearning to use validation data in each fold by huiwengoh in https://github.com/cleanlab/cleanlab/pull/295
* Created FAQ Page in the Cleanlab documentation by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/294
* Proper validation of `labels` values/format across package by jwmueller in https://github.com/cleanlab/cleanlab/pull/301
* Add static type checking by anishathalye in https://github.com/cleanlab/cleanlab/pull/306
* error for missing classes, consistency on determining num_classes by jwmueller in https://github.com/cleanlab/cleanlab/pull/308
* Added support to build KNN graph for OOD detection with only training data by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/305
* Standardize naming on K, num_classes and N, num_examples by huiwengoh in https://github.com/cleanlab/cleanlab/pull/312
* Added outlier detection tutorial into docs by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/310
* Updating tutorials hyperlink to 2.0.0 release by aravindputrevu in https://github.com/cleanlab/cleanlab/pull/318
* Allow KNN object to be returned by get_outlier_scores, Improved OOD tutorial by jwmueller in https://github.com/cleanlab/cleanlab/pull/319
* Some FAQ tips on how to improve CleanLearning by jwmueller in https://github.com/cleanlab/cleanlab/pull/324
* Updated tutorials to include quickstart by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/323
* Add y argument as alternative to labels in CleanLearning.fit() by elisno in https://github.com/cleanlab/cleanlab/pull/322
* validation.py: Annotate function args and return values by elisno in https://github.com/cleanlab/cleanlab/pull/317
* Fixed package version issues for audio tutorial by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/325
* Add compatibility for tensorflow and pytorch Dataset objects by jwmueller in https://github.com/cleanlab/cleanlab/pull/311
* Re-order find_label_issues args for better clarity by jwmueller in https://github.com/cleanlab/cleanlab/pull/329
* Comment on missing/rare classes in FAQ by jwmueller in https://github.com/cleanlab/cleanlab/pull/332
* update sphinx to v5 by jwmueller in https://github.com/cleanlab/cleanlab/pull/327
* Allow missing classes in get_label_quality_scores by huiwengoh in https://github.com/cleanlab/cleanlab/pull/334
* Allow missing classes in assert_valid_class_labels by huiwengoh in https://github.com/cleanlab/cleanlab/pull/335
* Changed all docstring instances of np.array to np.ndarray by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/336
* Update Contributing.md with Projects link and getting started instructions by jwmueller in https://github.com/cleanlab/cleanlab/pull/349
* Switch docs links from latest release to stable by elisno in https://github.com/cleanlab/cleanlab/pull/379
* Extending cleanlab to find label errors in token classification datasets by ericwang1997 in https://github.com/cleanlab/cleanlab/pull/347
* Cleanlab functionality for multiannotator data by huiwengoh in https://github.com/cleanlab/cleanlab/pull/333
* Cleanup token classification code by elisno in https://github.com/cleanlab/cleanlab/pull/390
* Fix typing for find_label_issues by elisno in https://github.com/cleanlab/cleanlab/pull/391
* Match token/s in color_sentence by elisno in https://github.com/cleanlab/cleanlab/pull/397
* Escape special regex characters by elisno in https://github.com/cleanlab/cleanlab/pull/404
* Add FAQ question on how to get predicted labels by jwmueller in https://github.com/cleanlab/cleanlab/pull/402
* Implementing get_ood_scores function by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/338
* Add termcolor dependency by huiwengoh in https://github.com/cleanlab/cleanlab/pull/415
* Add token classification tutorial notebook to docs.cleanlab.ai by elisno in https://github.com/cleanlab/cleanlab/pull/411
* Update examples links by huiwengoh in https://github.com/cleanlab/cleanlab/pull/421
* Polish multiannotator docs by huiwengoh in https://github.com/cleanlab/cleanlab/pull/422
* Text tutorial improvements by jwmueller in https://github.com/cleanlab/cleanlab/pull/429
* suppress tensorflow warning logs in tutorials if not properly installed by jwmueller in https://github.com/cleanlab/cleanlab/pull/432
* Add autodoc-typehints extension for sphinx by elisno in https://github.com/cleanlab/cleanlab/pull/412
* Strip input prompts when copying code snippets by elisno in https://github.com/cleanlab/cleanlab/pull/439
* Extend KerasWrapper to Functional API by huiwengoh in https://github.com/cleanlab/cleanlab/pull/434
* Deploy documentation for token classification module by elisno in https://github.com/cleanlab/cleanlab/pull/438
* Updated labels to allow array_like by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/426
* Add keras wrapper to docs by jwmueller in https://github.com/cleanlab/cleanlab/pull/443
* Format all return docstrings and add typing by jwmueller in https://github.com/cleanlab/cleanlab/pull/437
* make num_label_issues = cj calibrated offdiag sum by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/445
* fix bug in hard-coded test. generalize the test by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/448
* Change output of display_issues by elisno in https://github.com/cleanlab/cleanlab/pull/450
* More improvements to token classification code and documentation by jwmueller in https://github.com/cleanlab/cleanlab/pull/452
* Fix details disclosure elements in docs by anishathalye in https://github.com/cleanlab/cleanlab/pull/456
* Add missing backticks and language annotation by anishathalye in https://github.com/cleanlab/cleanlab/pull/461
* Error handling for rare classes in multiannotator data by huiwengoh in https://github.com/cleanlab/cleanlab/pull/455
* Fix docs build in CI by anishathalye in https://github.com/cleanlab/cleanlab/pull/462
* Added support for returning ranked issue idxs by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/459
* update readme for v2.1 by jwmueller in https://github.com/cleanlab/cleanlab/pull/457
* Clearer code examples on docs main page by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/430

New Contributors
* Yulv-git made their first contribution in https://github.com/cleanlab/cleanlab/pull/242
* rushic24 made their first contribution in https://github.com/cleanlab/cleanlab/pull/248
* MattiaSangermano made their first contribution in https://github.com/cleanlab/cleanlab/pull/247
* ulya-tkch made their first contribution in https://github.com/cleanlab/cleanlab/pull/293
* huiwengoh made their first contribution in https://github.com/cleanlab/cleanlab/pull/295
* aravindputrevu made their first contribution in https://github.com/cleanlab/cleanlab/pull/318
* elisno made their first contribution in https://github.com/cleanlab/cleanlab/pull/322
* ericwang1997 made their first contribution in https://github.com/cleanlab/cleanlab/pull/340

**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.0.0...v2.1.0

2.1.0

Examples corresponding to [Cleanlab's v2.1.0 release](https://github.com/cleanlab/cleanlab/releases/tag/v2.1.0).

2.0.0

Examples corresponding to [Cleanlab's v2.0.0 release](https://github.com/cleanlab/cleanlab/releases/tag/v2.0.0).

1.0.1

Examples corresponding to [Cleanlab's v1.0.1 release](https://github.com/cleanlab/cleanlab/releases/tag/v1.0.1).

Page 3 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.