Release notes
These are the release notes of the initial release of the Evaluate library.
Goals
Goals of the Evaluate library:
- reproducibility: reporting and reproducing results is easy
- ease-of-use: access to a wide range of evaluation tools with a unified interface
- diversity: provide wide range of evaluation tools with metrics, comparisons, and measurements
- multimodal: models and datasets of many modalities can be evaluated
- community-driven: anybody can add custom evaluations by hosting them on the Hugging Face Hub
Release overview:
- `evaluate.load()`: The `load()` function is the main entry point into evaluate and allows to load evaluation modules from a local folder, the evaluate repository, or the Hugging Face Hub. It downloads, caches, and loads the evaluation modules and returns an `evaluate.EvaluationModule`.
- `evaluate.save()`: With `save()` a user can save evaluation results in a JSON file. In addition to the results from `evaluate.EvaluationModule` it can save additional parameters and automatically saves the timestamp, git commit hash, library version as well as Python path. One can either provide a directory for the results, in which case file names are automatically created, or an explicit file name for the result.
- `evaluate.push_to_hub()`: The `push_to_hub` function allows to push the results of a model evaluation to the model card on the Hugging Face Hub. The model, dataset, and metric are specified such that they can be linked on the hub.
- `evaluate.EvaluationModule`: The `EvaluationModule` class is the baseclass for all evaluation modules. There are three module types: metrics (to evaluate models), comparisons (to compare models), and measurements (to analyze datasets). The inputs can be either added with `add` (single input) and `add_batch` (batch of inputs) followed by a final `compute` call to compute the scores or all inputs can be passed to `compute` directly. Under the hood, Apache Arrow stores and loads the input data to compute the scores.
- `evaluate.EvaluationModuleInfo`: The `EvaluationModule` class is used to store attributes:
- `description`: A short description of the evaluation module.
- `citation`: A BibTex string for citation when available.
- `features`: A `Features` object defining the input format. The inputs provided to `add`, `add_batch`, and `compute` are tested against these types and an error is thrown in case of a mismatch.
- `inputs_description`: This is equivalent to the modules docstring.
- `homepage`: The homepage of the module.
- `license`: The license of the module.
- `codebase_urls`: Link to the code behind the module.
- `reference_urls`: Additional reference URLs.
- `evaluate.evaluator`: The `evaluator` provides automated evaluation and only requires a model, dataset, metric, in contrast to the metrics in the `EvaluationModule` which require model predictions. It has three main components: a model wrapped in a pipeline, a dataset, and a metric, and it returns the computed evaluation scores. Besides the three main components, it may also require two mappings to align the columns in the dataset and the pipeline labels with the datasets labels. This is an experimental feature -- currently, only text classification is supported.
- `evaluate-cli`: The community can add custom metrics by adding the necessary module script to a Space on the Hugging Face Hub. The `evaluate-cli` is a tool that simplifies this process by creating the Space, populating a template, and pushing it to the Hub. It also provides instructions to customize the template and integrate custom logic.
Main contributors:
lvwerra , sashavor , NimaBoscarino , ola13 , osanseviero , lhoestq , lewtun , douwekiela