Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.
Introducing Datalab
Now we've added a unified platform called `Datalab` for you to apply many of these capabilities in a single line of code!
To audit any classification dataset for issues, first use any trained ML model to produce `pred_probs` (predicted class probabilities) and/or `feature_embeddings` (numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:
python
from cleanlab import Datalab
lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report() summarize the issues found, how severe they are, and other useful info about the dataset
Follow our [blog](https://cleanlab.ai/blog/) to better understand how this works internally, many articles will be published there shortly!
A detailed description of each type of issue `Datalab` can detect is provided in [this guide](https://docs.cleanlab.ai/master/cleanlab/datalab/guide/issue_type_description.html), but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.
`Datalab` can be used to do things like find label issues with string class labels (whereas the prior `find_label_issues()` method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! `Datalab` is also using these internally to detect data issues.
Our goal is for `Datalab` to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some `Datalab` APIs may change in subsequent package versions -- as noted in the documentation.
You can easily run the issue checks in `Datalab` together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into `Datalab`. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via [Slack](https://cleanlab.ai/slack).
Revamped Tutorials
We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with `Datalab` instead (see `Datalab Tutorials`). This should help existing users quickly ramp up on using `Datalab` to see how much more powerful this comprehensive data audit can be.
Improvements for Multi-label Classification
To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the `cleanlab.multilabel_classification` module. So please start there rather than specifying the `multi_label=True` flag in certain methods outside of this module, as that option will be deprecated in the future.
Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the `cleanlab.multilabel_classification.dataset` module.
While moving methods to the `cleanlab.multilabel_classification` module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the `cleanlab.multilabel_classification` module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.
Backwards incompatible changes
Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:
1) `cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)`
→
`cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)`
The `multi_label=False/True` argument will be removed in the future from the former method.
2) `cleanlab.dataset.find_overlapping_classes(..., multi_label=True)`
→
`cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)`
The `multi_label=False/True` argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.
3) `cleanlab.dataset.overall_label_health_score(...multi_label=True)`
→
`cleanlab.multilabel_classification.dataset.overall_label_health_score(...)`
The `multi_label=False/True` argument will be removed in the future from the former method.
4) `cleanlab.dataset.health_summary(...multi_label=True)`
→
`cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)`
The `multi_label=False/True` argument will be removed in the future from the former method.
There are no other backwards incompatible changes in the package with this release.
Deprecated workflows
We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:
1) `cleanlab.filter.find_label_issues(..., multi_label=True)`
→
`cleanlab.multilabel_classification.filter.find_label_issues(...)`
The `multi_label=False/True` argument will be removed in the future from the former method.
2) `from cleanlab.multilabel_classification import get_label_quality_scores`
→
`from cleanlab.multilabel_classification.rank import get_label_quality_scores`
**Remember**: *All* of the code to work with multi-label data now lives in the `cleanlab.multilabel_classification` module.
Change Log
* readme updates by jwmueller in https://github.com/cleanlab/cleanlab/pull/659, https://github.com/cleanlab/cleanlab/pull/660, https://github.com/cleanlab/cleanlab/pull/713
* CI updates (by sanjanag in https://github.com/cleanlab/cleanlab/pull/701; by huiwengoh in https://github.com/cleanlab/cleanlab/pull/671; by elisno in https://github.com/cleanlab/cleanlab/pull/695, https://github.com/cleanlab/cleanlab/pull/706)
* Documentation updates (by jwmueller in https://github.com/cleanlab/cleanlab/pull/669, https://github.com/cleanlab/cleanlab/pull/710, https://github.com/cleanlab/cleanlab/pull/711, https://github.com/cleanlab/cleanlab/pull/716, https://github.com/cleanlab/cleanlab/pull/719, https://github.com/cleanlab/cleanlab/pull/720; by huiwengoh in https://github.com/cleanlab/cleanlab/pull/714, https://github.com/cleanlab/cleanlab/pull/717; by elisno in https://github.com/cleanlab/cleanlab/pull/678, https://github.com/cleanlab/cleanlab/pull/684)
* Documentation: use default rules for shorter, more readable links by DerWeh in https://github.com/cleanlab/cleanlab/pull/700
* Added installation instructions for package extras by sanjanag in https://github.com/cleanlab/cleanlab/pull/697
* Pass confident joint computed in CleanLearning to filter.find_label_issues by huiwengoh in https://github.com/cleanlab/cleanlab/pull/661
* Add Example codeblock to the docstrings of important functions in the dataset module by Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/662, https://github.com/cleanlab/cleanlab/pull/663, https://github.com/cleanlab/cleanlab/pull/668
* Remove batch size check in label_issues_batched by huiwengoh in https://github.com/cleanlab/cleanlab/pull/665
* adding multilabel dataset issue summaries by aditya1503 in https://github.com/cleanlab/cleanlab/pull/657
* move int2onehot, onehot2int to top of multilabel tutorial by jwmueller in https://github.com/cleanlab/cleanlab/pull/666
* Update softmax to more stable variant by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/667
* Revamp text and tabular tutorial by huiwengoh in https://github.com/cleanlab/cleanlab/pull/673, https://github.com/cleanlab/cleanlab/pull/693
* allow for kwargs in token find_label_issues by jwmueller in https://github.com/cleanlab/cleanlab/pull/686
* Update numpy.typing import and annotations by elisno in https://github.com/cleanlab/cleanlab/pull/688
* Standardize documentation and simplify code for outliers by DerWeh in https://github.com/cleanlab/cleanlab/pull/689
* Extract function for computing OOD scores from distances by elisno in https://github.com/cleanlab/cleanlab/pull/664
* Introduce Datalab by elisno in https://github.com/cleanlab/cleanlab/pull/614
* Introduce NonIID issue type by jecummin in https://github.com/cleanlab/cleanlab/pull/614
* Further Datalab updates by elisno in https://github.com/cleanlab/cleanlab/pull/680, https://github.com/cleanlab/cleanlab/pull/683, https://github.com/cleanlab/cleanlab/pull/687, https://github.com/cleanlab/cleanlab/pull/690, https://github.com/cleanlab/cleanlab/pull/691, https://github.com/cleanlab/cleanlab/pull/699, https://github.com/cleanlab/cleanlab/pull/705, https://github.com/cleanlab/cleanlab/pull/709, https://github.com/cleanlab/cleanlab/pull/712
* Add descriptions of issues that Datalab can detect by elisno in https://github.com/cleanlab/cleanlab/pull/682
* Datalab IssueManager.get_summary() -> make_summary() in custom issue manager example by jwmueller in https://github.com/cleanlab/cleanlab/pull/692
* Improve NonIID issue checks by elisno in https://github.com/cleanlab/cleanlab/pull/694, https://github.com/cleanlab/cleanlab/pull/707
New Contributors
* Steven-Yiran made their first contribution in https://github.com/cleanlab/cleanlab/pull/662
* DerWeh made their first contribution in https://github.com/cleanlab/cleanlab/pull/689
* jecummin made their first contribution in https://github.com/cleanlab/cleanlab/pull/614
**Full Changelog**: https://github.com/cleanlab/cleanlab/compare/v2.3.1...v2.4.0