This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements.
However, this release drops support for Python 3.7 while adding support for Python 3.11.
Enhancements to Datalab
In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:
- Identify `null` values in your dataset.
- Detect `class_imbalance`.
- Highlight an `underperforming_group`, which refers to a subset of data points where your model exhibits poorer performance compared to others.
See our [FAQ](https://docs.cleanlab.ai/master/tutorials/faq.html#How-do-I-specify-pre-computed-data-slices/clusters-when-detecting-the-Underperforming-Group-Issue?)
for more information on how to provide pre-defined groups for this issue type.
Additionally, Datalab can now optionally:
- Assess the value of data points in your dataset using KNN-Shapley scores as a measure of `data_valuation`.
If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!
Expanded Datalab Support for New ML Tasks
With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.
This release introduces the `task` parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.
python
from cleanlab import Datalab
lab = Datalab(..., task="regression")
The `task`s currently supported are:
- **classification** (*default*): Includes all previously supported issue-checking capabilities based on `pred_probs`, `features`, or a `knn_graph`, and the new features introduced earlier.
- **regression** (*new*):
- Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated [regression tutorial](https://docs.cleanlab.ai/master/tutorials/regression.html#5.-Other-ways-to-find-noisy-labels-in-regression-datasets).
- Find other issues utilizing `features` or a `knn_graph`.
- **multilabel** (*new*):
- Detect label errors in multilabel classification datasets using `pred_probs` exclusively. Explore the updated capabilities in our [multilabel tutorial](https://docs.cleanlab.ai/master/tutorials/multilabel_classification.html).
- Find various other types of issues based on `features` or a `knn_graph`.
Improved Object Detection Dataset Exploration
New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection.
Learn how to leverage some of these functions in our [object detection tutorial](https://docs.cleanlab.ai/master/tutorials/object_detection.html#Exploratory-data-analysis).
Other Major Improvements
- Rescaled Near Duplicate and Outlier Scores:
- Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.
- Consistency in counting label issues:
- `cleanlab.dataset.health_summary()` now returns the same number of issues as `cleanlab.classification.find_label_issues()` and `cleanlab.count.num_label_issues()`.
- Improved handling of non-iid issues:
- The non-iid issue check in Datalab now handles `pred_probs` as input.
- Better reporting in Datalab:
- Simplified `Datalab.report()` now highlights only detected issue types. To view all checked issue types, use `Datalab.report(show_all_issues=True)`.
- Enhanced Handling of Binary Classification Tasks:
- Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.
- Experimental Functionality:
- cleanlab now offers experimental functionality for detecting label issues in **span categorization** tasks with a single class, enhancing its applicability in natural language processing projects.
New Contributors
We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:
* smttsp made their first contribution in https://github.com/cleanlab/cleanlab/pull/867
* abhijitpal1247 made their first contribution in https://github.com/cleanlab/cleanlab/pull/856
* 01PrathamS made their first contribution in https://github.com/cleanlab/cleanlab/pull/893
* mglowacki100 made their first contribution in https://github.com/cleanlab/cleanlab/pull/796
* gibsonliketheguitar made their first contribution in https://github.com/cleanlab/cleanlab/pull/831
* kylegallatin made their first contribution in https://github.com/cleanlab/cleanlab/pull/885
* ryansingman made their first contribution in https://github.com/cleanlab/cleanlab/pull/919
* R-Peleg made their first contribution in https://github.com/cleanlab/cleanlab/pull/948
Thank you for your valuable contributions! If you're interested in contributing, check out our [contributing guide](https://github.com/cleanlab/cleanlab/blob/master/CONTRIBUTING.md) for ways to get involved.
Change Log
Significant changes in this release include:
* Update FAQ section in docs by tataganesh in https://github.com/cleanlab/cleanlab/pull/869; elisno in https://github.com/cleanlab/cleanlab/pull/913
* Improve Object Detection module by Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/840, https://github.com/cleanlab/cleanlab/pull/877; aditya1503 in https://github.com/cleanlab/cleanlab/pull/883, https://github.com/cleanlab/cleanlab/pull/969, https://github.com/cleanlab/cleanlab/pull/968
* Clearer documentation/tutorials/readme by jwmueller in https://github.com/cleanlab/cleanlab/pull/851, https://github.com/cleanlab/cleanlab/pull/931, https://github.com/cleanlab/cleanlab/pull/981, https://github.com/cleanlab/cleanlab/pull/983, https://github.com/cleanlab/cleanlab/pull/1001, https://github.com/cleanlab/cleanlab/pull/978, https://github.com/cleanlab/cleanlab/pull/994, https://github.com/cleanlab/cleanlab/pull/1010; 01PrathamS in https://github.com/cleanlab/cleanlab/pull/893; elisno in https://github.com/cleanlab/cleanlab/pull/878, https://github.com/cleanlab/cleanlab/pull/1007, https://github.com/cleanlab/cleanlab/pull/992, https://github.com/cleanlab/cleanlab/pull/1015, https://github.com/cleanlab/cleanlab/pull/1016; huiwengoh in https://github.com/cleanlab/cleanlab/pull/984; sanjanag in https://github.com/cleanlab/cleanlab/pull/936; tataganesh in https://github.com/cleanlab/cleanlab/pull/916; ulya-tkch in https://github.com/cleanlab/cleanlab/pull/954;
* CI updates by aditya1503 in https://github.com/cleanlab/cleanlab/pull/864; elisno in https://github.com/cleanlab/cleanlab/pull/879, https://github.com/cleanlab/cleanlab/pull/961, https://github.com/cleanlab/cleanlab/pull/963, https://github.com/cleanlab/cleanlab/pull/965, https://github.com/cleanlab/cleanlab/pull/1008, https://github.com/cleanlab/cleanlab/pull/975, https://github.com/cleanlab/cleanlab/pull/1011, https://github.com/cleanlab/cleanlab/pull/1012, https://github.com/cleanlab/cleanlab/pull/1013, https://github.com/cleanlab/cleanlab/pull/1014; jwmueller in https://github.com/cleanlab/cleanlab/pull/852, https://github.com/cleanlab/cleanlab/pull/865; tataganesh in https://github.com/cleanlab/cleanlab/pull/900; anishathalye in https://github.com/cleanlab/cleanlab/pull/956; sanjanag in https://github.com/cleanlab/cleanlab/pull/1009
* Docs system updates by elisno in https://github.com/cleanlab/cleanlab/pull/880, https://github.com/cleanlab/cleanlab/pull/881, https://github.com/cleanlab/cleanlab/pull/958, https://github.com/cleanlab/cleanlab/pull/959, https://github.com/cleanlab/cleanlab/pull/960, https://github.com/cleanlab/cleanlab/pull/964
* Add Null Issue Manager by abhijitpal1247 in https://github.com/cleanlab/cleanlab/pull/856; tataganesh in https://github.com/cleanlab/cleanlab/pull/927, https://github.com/cleanlab/cleanlab/pull/917
* Add Data Valuation Issue Manager by coding-famer in https://github.com/cleanlab/cleanlab/pull/850, https://github.com/cleanlab/cleanlab/pull/925
* Extend non-iid issue check to run if only pred_probs are provided by abhijitpal1247 in https://github.com/cleanlab/cleanlab/pull/857; tataganesh in https://github.com/cleanlab/cleanlab/pull/896, https://github.com/cleanlab/cleanlab/pull/897
* Add Underperforming Group Issue Manager by tataganesh in https://github.com/cleanlab/cleanlab/pull/838, https://github.com/cleanlab/cleanlab/pull/907; elisno in https://github.com/cleanlab/cleanlab/pull/990
* Add Class Imbalance issue type to Datalab defaults by tataganesh in https://github.com/cleanlab/cleanlab/pull/912, https://github.com/cleanlab/cleanlab/pull/933; jwmueller in https://github.com/cleanlab/cleanlab/pull/924, https://github.com/cleanlab/cleanlab/pull/934; elisno in https://github.com/cleanlab/cleanlab/pull/940
* Add regression task to Datalab by mglowacki100 in https://github.com/cleanlab/cleanlab/pull/796; elisno in https://github.com/cleanlab/cleanlab/pull/902
* Add multilabel task to Datalab by tataganesh in https://github.com/cleanlab/cleanlab/pull/929
* 702 - Shorten Refs of classes and functions in Docs by gibsonliketheguitar in https://github.com/cleanlab/cleanlab/pull/831
* Update near duplicate issues and sets by ryansingman in https://github.com/cleanlab/cleanlab/pull/919; elisno in https://github.com/cleanlab/cleanlab/pull/895
* Rescale near duplicate scores by elisno in https://github.com/cleanlab/cleanlab/pull/943
* Rescale outlier scores by elisno in https://github.com/cleanlab/cleanlab/pull/953
* List comprehension to numpy ops for efficiency by tataganesh in https://github.com/cleanlab/cleanlab/pull/844
* Reduce memory usage of filter.find_label_issues() by kylegallatin in https://github.com/cleanlab/cleanlab/pull/885
* Updates to tests by aditya1503 in https://github.com/cleanlab/cleanlab/pull/945; elisno in https://github.com/cleanlab/cleanlab/pull/985, https://github.com/cleanlab/cleanlab/pull/998
* Refactor Datalab functionality by elisno in https://github.com/cleanlab/cleanlab/pull/971, https://github.com/cleanlab/cleanlab/pull/1006
* Minor fixes for Datalab by elisno in https://github.com/cleanlab/cleanlab/pull/997, https://github.com/cleanlab/cleanlab/pull/999, https://github.com/cleanlab/cleanlab/pull/1000, https://github.com/cleanlab/cleanlab/pull/1003, https://github.com/cleanlab/cleanlab/pull/1005, https://github.com/cleanlab/cleanlab/pull/979
* Drop Python 3.7 support and add Python 3.11 support by elisno in https://github.com/cleanlab/cleanlab/pull/980
* Add a `show_all_issues` optional argument to Datalab.report() by elisno in https://github.com/cleanlab/cleanlab/pull/970
* Single Class Span Classification Support by Steven-Yiran in https://github.com/cleanlab/cleanlab/pull/982
* ensure near-predicted labels are not flagged as label issues by aditya1503 in https://github.com/cleanlab/cleanlab/pull/950
* PR template added and gitignore improved by smttsp in https://github.com/cleanlab/cleanlab/pull/867
* Update label issue count in dataset.health_summary() by ulya-tkch in https://github.com/cleanlab/cleanlab/pull/875
* Update segmentation.ipynb by R-Peleg in https://github.com/cleanlab/cleanlab/pull/948
* Refactor batching logic in cleanlab.segmentation.filter.find_label_issues by elisno in https://github.com/cleanlab/cleanlab/pull/918
For a full list of changes, enhancements, and fixes, please refer to the [Full Changelog](https://github.com/cleanlab/cleanlab/compare/v2.5.0...v2.6.0).