✨ Release highlights
Offline Batch Generation and OpenAI Batch API
We’ve updated the `LLM` interface so now `LLM`s using an external platform that offers a batch service can be integrated in `distilabel`. In addition, `OpenAILLM` has been updated so it can use the OpenAI Batch API to get 50% cost reductions.
https://github.com/user-attachments/assets/9a559ae1-099b-47a4-9f92-37a3171dfbff
Improved cache for maximum outputs reusability
We all know that running `LLM` is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, `distilabel` cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the `Distiset` generated by one that finished its execution and was re-executed.
In this release, we've greatly improved the cache so the outputs of all the `Step`s are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:
![image](https://github.com/user-attachments/assets/03d6c110-e98a-463e-8876-62c3733d3ef0)
In addition, we've added a `use_cache` attribute in the `Step`s that allows toggling the use of the cache at step level.
Steps can generated artifacts
In some cases, `Step` produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called `Step.save_artifact` that can be called within the step to store artifacts generated by it. The artifacts generated by the `Step` will also get uploaded to the Hugging Face Hub.
python
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt
if TYPE_CHECKING:
from distilabel.steps import StepOutput
class CountTextCharacters(GlobalStep):
property
def inputs(self) -> List[str]:
return ["text"]
property
def outputs(self) -> List[str]:
return ["text_character_count"]
def process(self, inputs: StepInput) -> "StepOutput": type: ignore
character_counts = []
for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)
Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")
Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)
plt.close()
yield inputs
New `Tasks`: `CLAIR`, `APIGEN` and many more!
* New [CLAIR](https://github.com/argilla-io/distilabel/pull/926) task: *CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A `preferred` A’ is much more contrastive and precise.*
* New tasks to replicate [APIGen](https://github.com/argilla-io/distilabel/pull/925) framework: `APIGenGenerator`, `APIGenSemanticChecker`, `APIGenExecutionChecker`. These tasks allow generating datasets like the one presented in the paper: [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)
* New [URIAL](https://github.com/argilla-io/distilabel/pull/921) task that allows using non-instruct models to generate a response for an instruction.
* New [TextClassification](https://github.com/argilla-io/distilabel/pull/948) task to make zero-shot text classification based on a predefined but highly customizable prompt.
* [TextClustering](https://github.com/argilla-io/distilabel/pull/948), to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
* Updated [TextGeneration](https://github.com/argilla-io/distilabel/pull/974) to simplify customization of tasks that don’t require further post-processing.
New Steps to sample data in your pipelines and remove duplicates
* New [DataSampler](https://github.com/argilla-io/distilabel/pull/925) step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
* New [EmbeddingDedup](https://github.com/argilla-io/distilabel/pull/946) step to remove duplicates based on embeddings and a distance metric.
* New [MinHashDedup](https://github.com/argilla-io/distilabel/pull/937) step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
* New [TruncateTextColumns](https://github.com/argilla-io/distilabel/pull/902) to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
* New [CombineOutputs](https://github.com/argilla-io/distilabel/pull/939) to combine the outputs of two or more steps into a single output.
Generate text embeddings using `vLLM`
* Now you can generate embeddings using [vLLMEmbeddings](https://github.com/argilla-io/distilabel/pull/920)!
Extra things
* Easily visualize the tasks’ prompts using [Task.print](https://github.com/argilla-io/distilabel/pull/934) method.
* New [use\_default\_structured\_outputs](https://github.com/argilla-io/distilabel/pull/868) flag in tasks to automatically use structured generation in some tasks that can benefit from it.
What's Changed
* Make `ClientvLLM.model_name` a `cached_property` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/862
* Pass dataset to dry_run method by plaguss in https://github.com/argilla-io/distilabel/pull/863
* Add default structured output for `GenerateSentencePair` task by plaguss in https://github.com/argilla-io/distilabel/pull/868
* Complexity scorer default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/870
* Quality scorer default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/873
* Ultrafeedback default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/876
* Remove use of `default_chat_template` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/888
* Temporary fix for installing `llama-cpp-python` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/886
* Fix unit tests after release of `transformers==4.44.0` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/891
* Fix default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/892
* Send as many batches as possible to input queues by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/895
* Exclude `repo_id` from `LoadDataFromFileSystem` by plaguss in https://github.com/argilla-io/distilabel/pull/898
* Fix loader to read from a glob pattern by plaguss in https://github.com/argilla-io/distilabel/pull/877
* Add `save_artifact` method to `_Step` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/871
* Add new `add_raw_input` argument to `_Task` so we can automatically include the formatted input by plaguss in https://github.com/argilla-io/distilabel/pull/903
* New `TruncateTextColumn` to truncate the length of texts using the number of tokens or characters by plaguss in https://github.com/argilla-io/distilabel/pull/902
* Update `inputs` and `outputs` interface to allow returning dict indicating optionality by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/883
* Update mistrallm by plaguss in https://github.com/argilla-io/distilabel/pull/904
* Deepseek prover by plaguss in https://github.com/argilla-io/distilabel/pull/907
* Update `RewardModelScore.inputs` property by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/908
* Add tutorial - generate data for training embeddings and reranking models by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/893
* Fix load data from disk by plaguss in https://github.com/argilla-io/distilabel/pull/910
* docs: minor fixes by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/913
* Add `URIAL` task by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/921
* Add `vLLMEmbeddings` by plaguss in https://github.com/argilla-io/distilabel/pull/920
* docs: add tutorials preference and clean by sdiazlor in https://github.com/argilla-io/distilabel/pull/917
* Fix `StructuredGeneration` examples and internal check by plaguss in https://github.com/argilla-io/distilabel/pull/912
* Generate deterministic pipeline name when it's not given by plaguss in https://github.com/argilla-io/distilabel/pull/878
* Add custom errors by plaguss in https://github.com/argilla-io/distilabel/pull/911
* Docs/tutorials fix by sdiazlor in https://github.com/argilla-io/distilabel/pull/922
* Add `revision` runtime parameter to `LoadDataFromHub` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/928
* Add plausible as replacement for GA by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/929
* Add minhash related steps to deduplicate texts by plaguss in https://github.com/argilla-io/distilabel/pull/931
* docs: API reference review by sdiazlor in https://github.com/argilla-io/distilabel/pull/932
* Refactor of MinHash to work with a single class and fix the shelve backend by plaguss in https://github.com/argilla-io/distilabel/pull/937
* Update `make_generator_step` to set pipeline to step and add edge to steps in trophic level 1 by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/936
* Add `CombineOutputs` step by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/939
* fix: regex expression in POSITIVE_NEGATIVE by sdiazlor in https://github.com/argilla-io/distilabel/pull/940
* Offline batch generation by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/923
* Fix applying input mapping when mapping overrides another column by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/938
* Fix all replicas had the same `_llm_identifier` for `CudaDevicePlacementMixin` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/941
* Fix empty load stage when two `GlobalStep`s are chained by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/945
* Add `system_prompt` attribute to `TextGeneration` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/950
* Add step to deduplicate records based on embeddings by plaguss in https://github.com/argilla-io/distilabel/pull/946
* Updated setup_logging to use UTF-8 in FileHandler by dameikle in https://github.com/argilla-io/distilabel/pull/952
* Add more generation parameters to `vLLM` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/955
* Fix `Magpie` generating different columns names depending on `LLM` output by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/965
* Docs/962 docs create a smoother transition from index installation quickstart by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/968
* Add `logging_handlers` argument by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/969
* [DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints by plaguss in https://github.com/argilla-io/distilabel/pull/973
* Add `TextClassification`, `UMAP`, `DBSCAN` and `TextClustering` tasks by plaguss in https://github.com/argilla-io/distilabel/pull/948
* [FEATURE] Simplify customizing the `TextGeneration` task with custom prompts by plaguss in https://github.com/argilla-io/distilabel/pull/974
* Update `system_prompt` attribute for adding probabilities in `MagpieBase` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/981
* Fix unloading steps with more than 1 replica by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/982
* docs: 960 docs add a glossary concept section by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/970
* Fix missing `system_prompt_key` column in `Magpie` tasks by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/983
* docs: update component gallery by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/987
* fix missing batch when last batch arrive early by zye1996 in https://github.com/argilla-io/distilabel/pull/989
* Fine personas socialai tutorial by plaguss in https://github.com/argilla-io/distilabel/pull/992
* feat: add basic draw implementation to pipline by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/966
* Fix schema inference structured generation by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/994
* [DOCS] Add developer documentation section in the docs by plaguss in https://github.com/argilla-io/distilabel/pull/999
* Fix `vllm` installation in CI by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1009
* fix metadata writeout when llm error by zye1996 in https://github.com/argilla-io/distilabel/pull/1003
* Add example of custom text generation step in quickstart by plaguss in https://github.com/argilla-io/distilabel/pull/984
* feat: 985 feature argillalabeller task by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/986
* Fix`llvmlite` install with `uv` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1018
* fix: failing tests argilla labeller by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1017
* fix inpute when output_mapping is not empty by zye1996 in https://github.com/argilla-io/distilabel/pull/1015
* Add Tasks to replicate `APIGen` by plaguss in https://github.com/argilla-io/distilabel/pull/925
* Pretty print by plaguss in https://github.com/argilla-io/distilabel/pull/934
* Add `CLAIR` task by plaguss in https://github.com/argilla-io/distilabel/pull/926
* Add cache at `Step` level by plaguss in https://github.com/argilla-io/distilabel/pull/766
* Fix `IndexError` when overriding inputs and `group_generations=False` by plaguss in https://github.com/argilla-io/distilabel/pull/1022
* Update `Pipeline cache` docs by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1023
* `1.4.0` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1024
New Contributors
* dameikle made their first contribution in https://github.com/argilla-io/distilabel/pull/952
* zye1996 made their first contribution in https://github.com/argilla-io/distilabel/pull/989
**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.3.2...1.4.0