Cleanlab

Latest version: v2.6.5

Safety actively analyzes 634631 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 3

2.5.0

This release is non-breaking when upgrading from v2.4.0 (except for certain methods in `cleanlab.experimental` that have been moved, especially utility methods related to Datalab).

New ML tasks supported

Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:
- **regression** (finding errors in numeric data): see `cleanlab.regression` and the "noisy labels in regression" quickstart tutorial.
- **object detection**: see `cleanlab.object_detection` and the "Object Detection" quickstart tutorial.
- **image segmentation**: see `cleanlab.segmentation` and the "Semantic Segmentation tutorial.

Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).

If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!

Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:
https://cleanlab.ai/research/
https://cleanlab.ai/blog/

Improvements to Datalab

[Datalab](https://cleanlab.ai/blog/datalab/) is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.

This release introduces major improvements and new functionalities in Datalab that include the ability to:

- Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of [CleanVision](https://cleanlab.ai/blog/cleanvision/).
- Detect label issues even without `pred_probs` from a ML model (you can instead just provide `features`).
- Flag rare classes in imbalanced classification datasets.
- Audit unlabeled datasets.

Other major improvements

- 50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
- Out-of-Distribution detection based on `pred_probs` via the [GEN algorithm](https://openaccess.thecvf.com/content/CVPR2023/papers/Liu_GEN_Pushing_the_Limits_of_Softmax-Based_Out-of-Distribution_Detection_CVPR_2023_paper.pdf) which is particularly effective for datasets with tons of classes.
- Many of the methods across the package to find label issues now support a `low_memory` option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.

New Contributors

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed [on our github](https://github.com/cleanlab/cleanlab/wiki#ideas-for-contributing-to-cleanlab) or you can jump into the discussions on [Slack](https://cleanlab.ai/slack). We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:

* gordon-lim made their first contribution in https://github.com/cleanlab/cleanlab/pull/746
* tataganesh made their first contribution in https://github.com/cleanlab/cleanlab/pull/751
* vdlad made their first contribution in https://github.com/cleanlab/cleanlab/pull/677
* axl1313 made their first contribution in https://github.com/cleanlab/cleanlab/pull/798
* coding-famer made their first contribution in https://github.com/cleanlab/cleanlab/pull/800


Change Log

* New feature: Label error detection in regression datasets by krmayankb in https://github.com/cleanlab/cleanlab/pull/572; by huiwengoh in https://github.com/cleanlab/cleanlab/pull/830

* New feature: ObjectLab for detecting mislabeled images in objection detection datasets by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/676, https://github.com/cleanlab/cleanlab/pull/739, https://github.com/cleanlab/cleanlab/pull/745, https://github.com/cleanlab/cleanlab/pull/770, https://github.com/cleanlab/cleanlab/pull/779, https://github.com/cleanlab/cleanlab/pull/807, https://github.com/cleanlab/cleanlab/pull/833; by aditya1503 in https://github.com/cleanlab/cleanlab/pull/750, https://github.com/cleanlab/cleanlab/pull/804

* New feature: Label error detection in segmentation datasets by vdlad in https://github.com/cleanlab/cleanlab/pull/677; by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/754, https://github.com/cleanlab/cleanlab/pull/756, https://github.com/cleanlab/cleanlab/pull/759, https://github.com/cleanlab/cleanlab/pull/772; by elisno in https://github.com/cleanlab/cleanlab/pull/775

* New feature: CleanVision to detect low-quality images by sanjanag in https://github.com/cleanlab/cleanlab/pull/679, https://github.com/cleanlab/cleanlab/pull/797

* New image quickstart tutorial that uses Datalab by sanjanag in https://github.com/cleanlab/cleanlab/pull/795

* Datalab code refactoring by elisno in https://github.com/cleanlab/cleanlab/pull/803, https://github.com/cleanlab/cleanlab/pull/783, https://github.com/cleanlab/cleanlab/pull/793, https://github.com/cleanlab/cleanlab/pull/729
* Make labels optional in Datalab by elisno in https://github.com/cleanlab/cleanlab/pull/730
* Update near-duplicate sets in Datalab by elisno in https://github.com/cleanlab/cleanlab/pull/781
* Include non-IID detection in set of default Datalab issue types by elisno in https://github.com/cleanlab/cleanlab/pull/723
* Extend Datalab to be able to detect label issues based on features by Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/760
* Add imbalance issue type to Datalab by tataganesh in https://github.com/cleanlab/cleanlab/pull/758, https://github.com/cleanlab/cleanlab/pull/828
* Catch specific exception for knn in Datalab issue managers by tataganesh in https://github.com/cleanlab/cleanlab/pull/825
* Make plots smaller for datalab tutorials by tataganesh in https://github.com/cleanlab/cleanlab/pull/751

* 50x speedup and other improvements in multiannotator module by huiwengoh in https://github.com/cleanlab/cleanlab/pull/821, https://github.com/cleanlab/cleanlab/pull/784; by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/827

* ENH: make clipping unnecessary for entropy by DerWeh in https://github.com/cleanlab/cleanlab/pull/703

* Extend default CleanLearning classifier to work for more datasets by Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/749
* CleanLearning code improvements by huiwengoh in https://github.com/cleanlab/cleanlab/pull/724; by jwmueller in https://github.com/cleanlab/cleanlab/pull/744
* Change CleanLearning inspect.getfullargspec to signature for sklearn v1.3 compatibility by huiwengoh in https://github.com/cleanlab/cleanlab/pull/761

* Expose low memory option for finding label issues by tataganesh in https://github.com/cleanlab/cleanlab/pull/791, https://github.com/cleanlab/cleanlab/pull/822

* Add GEN OOD-detection algorithm by coding-famer in https://github.com/cleanlab/cleanlab/pull/800

* Unify softmax implementations throughout package by elisno in https://github.com/cleanlab/cleanlab/pull/826

* Better warning handling for off_calibrated_custom in confident joint by gordon-lim in https://github.com/cleanlab/cleanlab/pull/746

* Clearer explanations in documentation/tutorials/readme by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/725; by jwmueller in https://github.com/cleanlab/cleanlab/pull/726, https://github.com/cleanlab/cleanlab/pull/734, https://github.com/cleanlab/cleanlab/pull/741, https://github.com/cleanlab/cleanlab/pull/743, https://github.com/cleanlab/cleanlab/pull/766, https://github.com/cleanlab/cleanlab/pull/832, https://github.com/cleanlab/cleanlab/pull/799, https://github.com/cleanlab/cleanlab/pull/752, https://github.com/cleanlab/cleanlab/pull/841, https://github.com/cleanlab/cleanlab/pull/816, https://github.com/cleanlab/cleanlab/pull/755, https://github.com/cleanlab/cleanlab/pull/731, https://github.com/cleanlab/cleanlab/pull/753, https://github.com/cleanlab/cleanlab/pull/845, https://github.com/cleanlab/cleanlab/pull/835, https://github.com/cleanlab/cleanlab/pull/847

* CI and documentation system updates by anishathalye in https://github.com/cleanlab/cleanlab/pull/742, https://github.com/cleanlab/cleanlab/pull/768, https://github.com/cleanlab/cleanlab/pull/769; by jwmueller in https://github.com/cleanlab/cleanlab/pull/837; by huiwengoh in https://github.com/cleanlab/cleanlab/pull/788, https://github.com/cleanlab/cleanlab/pull/757, https://github.com/cleanlab/cleanlab/pull/738, https://github.com/cleanlab/cleanlab/pull/794; by sanjanag in https://github.com/cleanlab/cleanlab/pull/843; by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/777; by elisno in https://github.com/cleanlab/cleanlab/pull/802; by axl1313 in https://github.com/cleanlab/cleanlab/pull/798

* Improved tests by huiwengoh in https://github.com/cleanlab/cleanlab/pull/778, https://github.com/cleanlab/cleanlab/pull/763

**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.4.0...v2.5.0

2.4.0

Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.

Introducing Datalab

Now we've added a unified platform called `Datalab` for you to apply many of these capabilities in a single line of code!
To audit any classification dataset for issues, first use any trained ML model to produce `pred_probs` (predicted class probabilities) and/or `feature_embeddings` (numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:

python
from cleanlab import Datalab

lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report() summarize the issues found, how severe they are, and other useful info about the dataset


Follow our [blog](https://cleanlab.ai/blog/) to better understand how this works internally, many articles will be published there shortly!
A detailed description of each type of issue `Datalab` can detect is provided in [this guide](https://docs.cleanlab.ai/master/cleanlab/datalab/guide/issue_type_description.html), but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.

`Datalab` can be used to do things like find label issues with string class labels (whereas the prior `find_label_issues()` method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! `Datalab` is also using these internally to detect data issues.

Our goal is for `Datalab` to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some `Datalab` APIs may change in subsequent package versions -- as noted in the documentation.
You can easily run the issue checks in `Datalab` together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into `Datalab`. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via [Slack](https://cleanlab.ai/slack).

Revamped Tutorials

We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with `Datalab` instead (see `Datalab Tutorials`). This should help existing users quickly ramp up on using `Datalab` to see how much more powerful this comprehensive data audit can be.

Improvements for Multi-label Classification

To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the `cleanlab.multilabel_classification` module. So please start there rather than specifying the `multi_label=True` flag in certain methods outside of this module, as that option will be deprecated in the future.

Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the `cleanlab.multilabel_classification.dataset` module.

While moving methods to the `cleanlab.multilabel_classification` module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the `cleanlab.multilabel_classification` module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.

Backwards incompatible changes

Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:

1) `cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)`

`cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)`

The `multi_label=False/True` argument will be removed in the future from the former method.

2) `cleanlab.dataset.find_overlapping_classes(..., multi_label=True)`

`cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)`

The `multi_label=False/True` argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.

3) `cleanlab.dataset.overall_label_health_score(...multi_label=True)`

`cleanlab.multilabel_classification.dataset.overall_label_health_score(...)`

The `multi_label=False/True` argument will be removed in the future from the former method.

4) `cleanlab.dataset.health_summary(...multi_label=True)`

`cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)`

The `multi_label=False/True` argument will be removed in the future from the former method.

There are no other backwards incompatible changes in the package with this release.

Deprecated workflows

We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:

1) `cleanlab.filter.find_label_issues(..., multi_label=True)`

`cleanlab.multilabel_classification.filter.find_label_issues(...)`

The `multi_label=False/True` argument will be removed in the future from the former method.

2) `from cleanlab.multilabel_classification import get_label_quality_scores`

`from cleanlab.multilabel_classification.rank import get_label_quality_scores`

**Remember**: *All* of the code to work with multi-label data now lives in the `cleanlab.multilabel_classification` module.

Change Log

* readme updates by jwmueller in https://github.com/cleanlab/cleanlab/pull/659, https://github.com/cleanlab/cleanlab/pull/660, https://github.com/cleanlab/cleanlab/pull/713
* CI updates (by sanjanag in https://github.com/cleanlab/cleanlab/pull/701; by huiwengoh in https://github.com/cleanlab/cleanlab/pull/671; by elisno in https://github.com/cleanlab/cleanlab/pull/695, https://github.com/cleanlab/cleanlab/pull/706)
* Documentation updates (by jwmueller in https://github.com/cleanlab/cleanlab/pull/669, https://github.com/cleanlab/cleanlab/pull/710, https://github.com/cleanlab/cleanlab/pull/711, https://github.com/cleanlab/cleanlab/pull/716, https://github.com/cleanlab/cleanlab/pull/719, https://github.com/cleanlab/cleanlab/pull/720; by huiwengoh in https://github.com/cleanlab/cleanlab/pull/714, https://github.com/cleanlab/cleanlab/pull/717; by elisno in https://github.com/cleanlab/cleanlab/pull/678, https://github.com/cleanlab/cleanlab/pull/684)
* Documentation: use default rules for shorter, more readable links by DerWeh in https://github.com/cleanlab/cleanlab/pull/700
* Added installation instructions for package extras by sanjanag in https://github.com/cleanlab/cleanlab/pull/697
* Pass confident joint computed in CleanLearning to filter.find_label_issues by huiwengoh in https://github.com/cleanlab/cleanlab/pull/661
* Add Example codeblock to the docstrings of important functions in the dataset module by Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/662, https://github.com/cleanlab/cleanlab/pull/663, https://github.com/cleanlab/cleanlab/pull/668
* Remove batch size check in label_issues_batched by huiwengoh in https://github.com/cleanlab/cleanlab/pull/665
* adding multilabel dataset issue summaries by aditya1503 in https://github.com/cleanlab/cleanlab/pull/657
* move int2onehot, onehot2int to top of multilabel tutorial by jwmueller in https://github.com/cleanlab/cleanlab/pull/666
* Update softmax to more stable variant by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/667
* Revamp text and tabular tutorial by huiwengoh in https://github.com/cleanlab/cleanlab/pull/673, https://github.com/cleanlab/cleanlab/pull/693
* allow for kwargs in token find_label_issues by jwmueller in https://github.com/cleanlab/cleanlab/pull/686
* Update numpy.typing import and annotations by elisno in https://github.com/cleanlab/cleanlab/pull/688
* Standardize documentation and simplify code for outliers by DerWeh in https://github.com/cleanlab/cleanlab/pull/689
* Extract function for computing OOD scores from distances by elisno in https://github.com/cleanlab/cleanlab/pull/664
* Introduce Datalab by elisno in https://github.com/cleanlab/cleanlab/pull/614
* Introduce NonIID issue type by jecummin in https://github.com/cleanlab/cleanlab/pull/614
* Further Datalab updates by elisno in https://github.com/cleanlab/cleanlab/pull/680, https://github.com/cleanlab/cleanlab/pull/683, https://github.com/cleanlab/cleanlab/pull/687, https://github.com/cleanlab/cleanlab/pull/690, https://github.com/cleanlab/cleanlab/pull/691, https://github.com/cleanlab/cleanlab/pull/699, https://github.com/cleanlab/cleanlab/pull/705, https://github.com/cleanlab/cleanlab/pull/709, https://github.com/cleanlab/cleanlab/pull/712
* Add descriptions of issues that Datalab can detect by elisno in https://github.com/cleanlab/cleanlab/pull/682
* Datalab IssueManager.get_summary() -> make_summary() in custom issue manager example by jwmueller in https://github.com/cleanlab/cleanlab/pull/692
* Improve NonIID issue checks by elisno in https://github.com/cleanlab/cleanlab/pull/694, https://github.com/cleanlab/cleanlab/pull/707

New Contributors
* Steven-Yiran made their first contribution in https://github.com/cleanlab/cleanlab/pull/662
* DerWeh made their first contribution in https://github.com/cleanlab/cleanlab/pull/689
* jecummin made their first contribution in https://github.com/cleanlab/cleanlab/pull/614


**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.3.1...v2.4.0

2.3.1

This minor release primarily just improves the user experience when encountering various edge-cases in:
- find_label_issues method
- find_overlapping_issues method
- cleanlab.multiannotator module

This release is non-breaking when upgrading from v2.3.0. Two noteworthy updates in the `cleanlab.multiannotator` module include a:
1. better tie-breaking algorithm inside of `get_majority_vote_label()` to avoid diminishing the frequency of rarer classes (this only plays a role when `pred_probs` are not provided).
2. better user-experience for `get_active_learning_scores()` to support scoring only unlabeled data or only labeled data. More of the arguments can now be `None`.


What's Changed
* Readme updates by jwmueller in https://github.com/cleanlab/cleanlab/pull/645, https://github.com/cleanlab/cleanlab/pull/650, https://github.com/cleanlab/cleanlab/pull/656
* describe activelab in the documentation by jwmueller in https://github.com/cleanlab/cleanlab/pull/648
* Added clipping to address issue 639 by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/647
* Fix for not specifying labels in find_overlapping_issues by huiwengoh in https://github.com/cleanlab/cleanlab/pull/652
* Bug fixes + improvements to multiannotator module by huiwengoh in https://github.com/cleanlab/cleanlab/pull/654
* FAQ question/answer on handling label errors in train vs test data by jwmueller in https://github.com/cleanlab/cleanlab/pull/655

**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.3.0...v2.3.1

2.3.0

We have added new functionality for active learning and easily making Keras models compatible with sklearn. Label issues can now be estimated 10x faster and with much less memory using new methods added to help users with massive datasets. This release is non-breaking when upgrading from v2.2.0 (except for certain methods in `cleanlab.experimental` that have been moved).

Active Learning with ActiveLab

For settings where you want to label more data to get better ML, active learning helps you train the best ML model with the least data labeling. Unfortunately data annotators often give imperfect labels, in which case we might sometimes prefer to have another annotator check an already-labeled example rather than labeling an entirely new example. [ActiveLab](https://cleanlab.ai/blog/active-learning/) is a new algorithm invented by our team that automatically answers the question: **which new data should I label or which of my current labels should be checked again?** ActiveLab is highly practical — it runs quickly and works with: any type of ML model, batch settings where many examples are (re)labeled before model retraining, and settings where multiple annotators can label an example (or just one annotator).

Here's all the code needed to determine active learning scores for examples in your unlabeled pool (no annotations yet) and labeled pool (at least one annotation already collected).


from cleanlab.multiannotator import get_active_learning_scores

scores_labeled_pool, scores_unlabeled_pool = get_active_learning_scores(
multiannotator_labels, pred_probs, pred_probs_unlabeled
)


The batch of examples with the lowest scores are those that are most informative to collect an additional label for (scores between labeled vs unlabeled pool are directly comparable). You can either have a new annotator label the batch of examples with lowest scores, or distribute them amongst your previous annotators as is most convenient. ActiveLab is also effective for: standard active learning where you collect at most one label per example (no re-labeling), as well as *active label cleaning* (with no unlabeled pool) where you only want to re-label examples to ensure 100% correct consensus labels (with the least amount of re-labeling).

Get started running ActiveLab with our [tutorial notebook](https://github.com/cleanlab/examples/blob/master/active_learning_multiannotator/active_learning.ipynb) from our repo that has many other [examples](https://github.com/cleanlab/examples/).

KerasWrapper

We've introduced [one-line wrappers](https://docs.cleanlab.ai/master/cleanlab/models/keras.html) for TensorFlow/Keras models that enable you to use TensorFlow models within scikit-learn workflows with features like `Pipeline`, `GridSearch` and more. Just change one line of code to make your existing Tensorflow/Keras model compatible with scikit-learn’s rich ecosystem! All you have to do is swap out: `keras.Model` → `KerasWrapperModel`, or `keras.Sequential` → `KerasSequentialWrapper`. Imported from `cleanlab.models.keras`, the wrapper objects have all the same methods of their keras counterparts, plus you can use them with tons of handy scikit-learn methods.

Resources to get started include:
- [Blogpost](https://cleanlab.ai/blog/transformer-sklearn/) and [Jupyter notebook](https://github.com/cleanlab/examples/blob/master/transformer_sklearn/transformer_sklearn.ipynb) demonstrating how to make a HuggingFace Transformer (BERT model) sklearn-compatible.
- [Jupyter notebook](https://github.com/cleanlab/examples/blob/master/huggingface_keras_imdb/huggingface_keras_imdb.ipynb) showing how to fit these sklearn-compatible models to a Tensorflow Dataset.
- [Revamped tutorial](https://docs.cleanlab.ai/master/tutorials/text.html) on label errors in text classification data, which has been updated to use this new wrapper.

Computational improvements for detecting label issues

Through extensive optimization of our multiprocessing code (thanks to clu0), `find_label_issues` has been made ~10x faster on Linux machines that have many CPU cores.

For massive datasets, `find_label_issues` may require too much memory to run our your machine. We've added new methods in [cleanlab.experimental.label_issues_batched](https://docs.cleanlab.ai/master/cleanlab/experimental/label_issues_batched.html) that can compute label issues with far less memory via mini-batch estimation. You can use these with billion-scale memmap arrays or Zarr arrays like this:

from cleanlab.experimental.label_issues_batched import find_label_issues_batched

labels = zarr.convenience.open("LABELS.zarr", mode="r")
pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r")
issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=100000)

By choosing sufficiently small `batch_size`, you should be able to handle pretty much any dataset (set it as large as your memory will allow for best efficiency). With default arguments, the batched methods closely approximate the results of the option: `cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")`
This and `filter_by="low_normalized_margin"` are new `find_label_issues()` options added in v2.3, which require less computation and still output accurate estimates of the label errors.


Other changes to be aware of

- Like all major ML frameworks, we have dropped support for Python 3.6.
- We have moved some particularly useful models (fasttext, keras) from `cleanlab.experimental` -> `cleanlab.models`.

Change Log

* Shorten tutorial titles in docs for readability by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/553
* Swap CI workflow to actions by huiwengoh in https://github.com/cleanlab/cleanlab/pull/560
* Remove .pylintrc by elisno in https://github.com/cleanlab/cleanlab/pull/564
* Tutorial fixes by huiwengoh in https://github.com/cleanlab/cleanlab/pull/565
* Fix typo in CONTRIBUTING.md by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/566
* Multiannotator Active Learning Support by huiwengoh in https://github.com/cleanlab/cleanlab/pull/538
* multiannotator explanation improvements by jwmueller in https://github.com/cleanlab/cleanlab/pull/570
* Specify Sphinx to order functions by source code order by huiwengoh in https://github.com/cleanlab/cleanlab/pull/571
* Fix example in ema docstring by elisno in https://github.com/cleanlab/cleanlab/pull/563, https://github.com/cleanlab/cleanlab/pull/573
* update paper list and applications beyond label error detection in readme by jwmueller in https://github.com/cleanlab/cleanlab/pull/574, https://github.com/cleanlab/cleanlab/pull/580
* Drop Python 3.6 support (by jwmueller in https://github.com/cleanlab/cleanlab/pull/558, https://github.com/cleanlab/cleanlab/pull/577; by anishathalye in https://github.com/cleanlab/cleanlab/pull/562; by krmayankb in https://github.com/cleanlab/cleanlab/pull/578; by sanjanag in https://github.com/cleanlab/cleanlab/pull/579)
* add maximum line length by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/583
* Update github actions by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/589
* Revamp text tutorial by huiwengoh in https://github.com/cleanlab/cleanlab/pull/584
* clarify thresholding in issues_from_scores by jwmueller in https://github.com/cleanlab/cleanlab/pull/582
* Remove temp scaling from single annotator case by huiwengoh in https://github.com/cleanlab/cleanlab/pull/590
* Update docs dependencies by huiwengoh in https://github.com/cleanlab/cleanlab/pull/593
* Use euclidean distance for identifying outliers for lower dimensional features by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/581
* changing copyright year 2017-2022 to 2017-2023 by aditya1503 in https://github.com/cleanlab/cleanlab/pull/594
* Handle missing type parameters for generic type "ndarray" by elisno in https://github.com/cleanlab/cleanlab/pull/587
* Remove temp scaling for single-label case in ensemble method by huiwengoh in https://github.com/cleanlab/cleanlab/pull/597
* Adding type hints for mypy strict compatibility by unna97 in https://github.com/cleanlab/cleanlab/pull/585
* fix typo in outliers.ipynb by eltociear in https://github.com/cleanlab/cleanlab/pull/603
* 10x speedup in find_label_issues on linux via better multiprocessing by clu0 in https://github.com/cleanlab/cleanlab/pull/596
* Update tabular tutorial with better language by cmauck10 in https://github.com/cleanlab/cleanlab/pull/609
* Improve num_label_issues() to reflect most accurate num issues by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/610
* Removed duplicate classifier from setup.py by sanjanag in https://github.com/cleanlab/cleanlab/pull/612
* Add two methods to filter.find_label_issues by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/595
* Fix dictionary type annotation for OutOfDistribution object by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/616
* Fix format compatibility with latest black==23. release by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/620
* Create new cleanlab.models module by huiwengoh in https://github.com/cleanlab/cleanlab/pull/601
* upgrade torch in docs by jwmueller in https://github.com/cleanlab/cleanlab/pull/607
* fix bug: confidences -> confidence by jwmueller in https://github.com/cleanlab/cleanlab/pull/623
* Fixed duplicate issue removal in find_label_issues by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/624
* Method to estimate label issues with limited memory via mini-batches by jwmueller in https://github.com/cleanlab/cleanlab/pull/615, https://github.com/cleanlab/cleanlab/pull/629, https://github.com/cleanlab/cleanlab/pull/632, https://github.com/cleanlab/cleanlab/pull/635
* Fix KerasWrapper summary method by huiwengoh in https://github.com/cleanlab/cleanlab/pull/631
* Clarify rank.py not for multi-label classification by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/626
* Removed $ from shell commands to avoid it being copied by sanjanag in https://github.com/cleanlab/cleanlab/pull/625
* label_issues_batched multiprocessing by clu0 in https://github.com/cleanlab/cleanlab/pull/630, https://github.com/cleanlab/cleanlab/pull/634
* Switch to typing.Self by anishathalye in https://github.com/cleanlab/cleanlab/pull/489
* Documentation improvements by huiwengoh in https://github.com/cleanlab/cleanlab/pull/643
* add 2.3.0 to release versions by jwmueller in https://github.com/cleanlab/cleanlab/pull/644

New Contributors
* krmayankb made their first contribution in https://github.com/cleanlab/cleanlab/pull/578
* sanjanag made their first contribution in https://github.com/cleanlab/cleanlab/pull/579
* unna97 made their first contribution in https://github.com/cleanlab/cleanlab/pull/585
* eltociear made their first contribution in https://github.com/cleanlab/cleanlab/pull/603
* clu0 made their first contribution in https://github.com/cleanlab/cleanlab/pull/596

**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.2.0...v2.3.0

2.2

Finding label issues in multi-label classification is done using the same code and inputs as before (and the same object is returned as before):
python
from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
multi_label=True,
return_indices_ranked_by="self_confidence",
)

Where for a 3-class multi-label dataset with 4 examples, we might have say:
python
labels = [[0], [0, 1], [0, 2], [1]]

pred_probs = np.array(
[[0.9, 0.1, 0.1],
[0.9, 0.1, 0.8],
[0.9, 0.1, 0.6],
[0.2, 0.8, 0.3]]
)



The following code (in which class 1 is missing from the dataset) did not previously work but now runs without problem in cleanlab v2.2.0:
python
from cleanlab.filter import find_label_issues
import numpy as np

labels = [0, 0, 2, 0, 2]
pred_probs = np.array(
[[0.8, 0.1, 0.1],
[0.7, 0.1, 0.2],
[0.3, 0.1, 0.6],
[0.5, 0.2, 0.3],
[0.1, 0.1, 0.8]]
)

label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
)


Looking forward

The next major release of this package will introduce a paradigm shift in the way people check their datasets. Today this involves significant manual labor, but software should be able to help! Our research has developed algorithms that can automatically detect many types of common issues that plague real-world ML datasets. The next version of cleanlab will offer an easy-to-use line of code that runs all of our appropriate algorithms to help ensure a given dataset is issue-free and well-suited for supervised learning.

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed [on our github](https://github.com/cleanlab/cleanlab/wiki#ideas-for-contributing-to-cleanlab) or you can jump into the discussions on [Slack](https://cleanlab.ai/slack).

Change Log
* updated label_quality_utils.py and rebuilt the doc by ethanotran in https://github.com/cleanlab/cleanlab/pull/475
* Add workflow for skipping notebooks by huiwengoh in https://github.com/cleanlab/cleanlab/pull/472
* Fix return type in token classification get_label_quality_scores by jwmueller in https://github.com/cleanlab/cleanlab/pull/477
* Adding pylint CI checks by mohitsaxenaknoldus in https://github.com/cleanlab/cleanlab/pull/465
* CI: Build check cleanlab works without optional dependencies by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/470
* Outlier tutorial: move uninteresting code to hidden cell by jwmueller in https://github.com/cleanlab/cleanlab/pull/492
* Update DEVELOPMENT.md with howto add new modules by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/494
* Minor asthetic fix for tutorials.ipynb by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/493
* Update __init__.py to include major files by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/490
* Make type checking pass with mypy 0.981 by anishathalye in https://github.com/cleanlab/cleanlab/pull/488
* Update issues returned by num_label_issues by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/485
* Mypy typechecking fix for count.py by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/500
* Add basic utilities for handling quality scores for multilabel data by elisno in https://github.com/cleanlab/cleanlab/pull/499
* reinvented algorithms for multilabel find_label_issues by aditya1503 in https://github.com/cleanlab/cleanlab/pull/483
* Trying to fix typings by ChinoCodeDemon in https://github.com/cleanlab/cleanlab/pull/502
* Add internal function to properly format labels by huiwengoh in https://github.com/cleanlab/cleanlab/pull/504
* Mention internal format label function in multiannotator docs by huiwengoh in https://github.com/cleanlab/cleanlab/pull/506
* Multilabel code restructuring with aggregation/scorer functions by aditya1503 in https://github.com/cleanlab/cleanlab/pull/509
* Separate word_coloring from token_replacement in color_sentence by elisno in https://github.com/cleanlab/cleanlab/pull/514
* Add support for missing classes by cgnorthcutt in https://github.com/cleanlab/cleanlab/pull/511
* Better missing class support for label quality scoring by jwmueller in https://github.com/cleanlab/cleanlab/pull/518
* moving multilabel functions by aditya1503 in https://github.com/cleanlab/cleanlab/pull/515
* restrict typecheck to python v3.10 by jwmueller in https://github.com/cleanlab/cleanlab/pull/521
* support missing classes in multiannotator functions by huiwengoh in https://github.com/cleanlab/cleanlab/pull/519
* fix mypy typing by huiwengoh in https://github.com/cleanlab/cleanlab/pull/524
* Add studio banner by cmauck10 in https://github.com/cleanlab/cleanlab/pull/525
* added missing classes test for multilabel by aditya1503 in https://github.com/cleanlab/cleanlab/pull/523
* Improve tutorials language/formatting by jwmueller in https://github.com/cleanlab/cleanlab/pull/526
* Validate forgetting factor in EMA by elisno in https://github.com/cleanlab/cleanlab/pull/527
* Remove strong worded requirement for out-of-sample pred probs by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/520
* Ensure type checks pass with new mypy v0.990 by jwmueller in https://github.com/cleanlab/cleanlab/pull/530
* replace pylint --> flake8 by ilnarkz in https://github.com/cleanlab/cleanlab/pull/531
* Tutorial for multi-label classification by aditya1503 in https://github.com/cleanlab/cleanlab/pull/517
* Fix multilabel_py dimensionality by elisno in https://github.com/cleanlab/cleanlab/pull/535
* cleanlab install on colab for multilabel tutorial by jwmueller in https://github.com/cleanlab/cleanlab/pull/537
* Refactor MultilabelScorer helper methods and tests by elisno in https://github.com/cleanlab/cleanlab/pull/540
* Make a public method for multilabel quality scores by jwmueller in https://github.com/cleanlab/cleanlab/pull/542
* Improve and standardize documentation in label error detection methods for classification datasets by jwmueller in https://github.com/cleanlab/cleanlab/pull/543
* Move mypy configuration to config file by anishathalye in https://github.com/cleanlab/cleanlab/pull/545
* Fix types to work with latest pandas-stubs by anishathalye in https://github.com/cleanlab/cleanlab/pull/546
* Fix passing of kwargs to get_label_quality_scores by anishathalye in https://github.com/cleanlab/cleanlab/pull/547
* Switch CI cron schedule by anishathalye in https://github.com/cleanlab/cleanlab/pull/548
* Remove unnecessary type: ignore annotations by anishathalye in https://github.com/cleanlab/cleanlab/pull/549
* update readme for v2.2 by jwmueller in https://github.com/cleanlab/cleanlab/pull/551

New Contributors
* ethanotran made their first contribution in https://github.com/cleanlab/cleanlab/pull/475
* mohitsaxenaknoldus made their first contribution in https://github.com/cleanlab/cleanlab/pull/465
* aditya1503 made their first contribution in https://github.com/cleanlab/cleanlab/pull/483
* ChinoCodeDemon made their first contribution in https://github.com/cleanlab/cleanlab/pull/502
* cmauck10 made their first contribution in https://github.com/cleanlab/cleanlab/pull/525
* ilnarkz made their first contribution in https://github.com/cleanlab/cleanlab/pull/531
* Po-He Tseng helped run some early tests of our new multi-label algorithms

**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.1.0...v2.2.0

2.2.0

Multi-label support for applications like image/document/text tagging

The newest version of cleanlab features a complete overhaul of cleanlab’s multi-label classification functionality:

- We invented new algorithms for detecting label errors in multi-label datasets that are significantly more effective. These methods are formally described and extensively benchmarked in our [research paper](https://arxiv.org/abs/2211.13895).
- We added `cleanlab.multilabel_classification` module for label quality scoring.
- We now offer an easy-to-follow [quickstart tutorial](https://docs.cleanlab.ai/stable/tutorials/) for learning how to apply cleanlab to multi-label datasets.
- We’ve created [example notebooks](https://github.com/cleanlab/examples/tree/master/multilabel_classification) on using cleanlab to clean up image tagging datasets, and how to train a state-of-the-art Pytorch neural network for multi-label classification with any image dataset.
- All of this multi-label functionality is now robustly tested via a comprehensive suite of unit tests to ensure it remains performant.

cleanlab now works when your labels have some classes missing relative to your predicted probabilities

The package now works for datasets in which some classes happen to not be present (but are present say in the `pred_probs` output by a model). This is useful when you:

- Want to use a pretrained model that was fit with additional classes
- Have rare classes and happen to split the data in an unlucky way
- Are doing active learning or other dynamic modeling with data that are iteratively changing
- Are analyzing multi-annotator datasets with `cleanlab.multiannotator` and some annotators occasionally select a really rare class.

Other major improvements

(in addition to too many bugfixes to name):

- Accuracy improvements to the algorithm used to estimate the number of label errors in a dataset via `count.num_label_issues()`. — ulya-tkch
- Introduction of flake8 code linter to ensure the highest standards for our code. — ilnarkz, mohitsaxenaknoldus
- More comprehensive mypy type annotations for cleanlab functions to make our code safer and more understandable. — elisno, ChinoCodeDemon, anishathalye, jwmueller, huiwengoh, ulya-tkch

Special thanks to Po-He Tseng for helping with early tests of our improved multi-label algorithms and the research behind developing them.

Page 2 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.