Distilabel

Latest version: v1.5.3

Safety actively analyzes 724020 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 6

1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

We’ve updated the `LLM` interface so now `LLM`s using an external platform that offers a batch service can be integrated in `distilabel`. In addition, `OpenAILLM` has been updated so it can use the OpenAI Batch API to get 50% cost reductions.

https://github.com/user-attachments/assets/9a559ae1-099b-47a4-9f92-37a3171dfbff

Improved cache for maximum outputs reusability

We all know that running `LLM` is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, `distilabel` cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the `Distiset` generated by one that finished its execution and was re-executed.

In this release, we've greatly improved the cache so the outputs of all the `Step`s are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:

![image](https://github.com/user-attachments/assets/03d6c110-e98a-463e-8876-62c3733d3ef0)

In addition, we've added a `use_cache` attribute in the `Step`s that allows toggling the use of the cache at step level.

Steps can generated artifacts

In some cases, `Step` produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called `Step.save_artifact` that can be called within the step to store artifacts generated by it. The artifacts generated by the `Step` will also get uploaded to the Hugging Face Hub.

python
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt

if TYPE_CHECKING:
from distilabel.steps import StepOutput

class CountTextCharacters(GlobalStep):
property
def inputs(self) -> List[str]:
return ["text"]

property
def outputs(self) -> List[str]:
return ["text_character_count"]

def process(self, inputs: StepInput) -> "StepOutput": type: ignore
character_counts = []

for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)

Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")

Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)

plt.close()

yield inputs

New `Tasks`: `CLAIR`, `APIGEN` and many more!

* New [CLAIR](https://github.com/argilla-io/distilabel/pull/926) task: *CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A `preferred` A’ is much more contrastive and precise.*
* New tasks to replicate [APIGen](https://github.com/argilla-io/distilabel/pull/925) framework: `APIGenGenerator`, `APIGenSemanticChecker`, `APIGenExecutionChecker`. These tasks allow generating datasets like the one presented in the paper: [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)
* New [URIAL](https://github.com/argilla-io/distilabel/pull/921) task that allows using non-instruct models to generate a response for an instruction.
* New [TextClassification](https://github.com/argilla-io/distilabel/pull/948) task to make zero-shot text classification based on a predefined but highly customizable prompt.
* [TextClustering](https://github.com/argilla-io/distilabel/pull/948), to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
* Updated [TextGeneration](https://github.com/argilla-io/distilabel/pull/974) to simplify customization of tasks that don’t require further post-processing.

New Steps to sample data in your pipelines and remove duplicates

* New [DataSampler](https://github.com/argilla-io/distilabel/pull/925) step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
* New [EmbeddingDedup](https://github.com/argilla-io/distilabel/pull/946) step to remove duplicates based on embeddings and a distance metric.
* New [MinHashDedup](https://github.com/argilla-io/distilabel/pull/937) step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
* New [TruncateTextColumns](https://github.com/argilla-io/distilabel/pull/902) to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
* New [CombineOutputs](https://github.com/argilla-io/distilabel/pull/939) to combine the outputs of two or more steps into a single output.

Generate text embeddings using `vLLM`

* Now you can generate embeddings using [vLLMEmbeddings](https://github.com/argilla-io/distilabel/pull/920)!

Extra things

* Easily visualize the tasks’ prompts using [Task.print](https://github.com/argilla-io/distilabel/pull/934) method.
* New [use\_default\_structured\_outputs](https://github.com/argilla-io/distilabel/pull/868) flag in tasks to automatically use structured generation in some tasks that can benefit from it.

What's Changed
* Make `ClientvLLM.model_name` a `cached_property` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/862
* Pass dataset to dry_run method by plaguss in https://github.com/argilla-io/distilabel/pull/863
* Add default structured output for `GenerateSentencePair` task by plaguss in https://github.com/argilla-io/distilabel/pull/868
* Complexity scorer default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/870
* Quality scorer default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/873
* Ultrafeedback default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/876
* Remove use of `default_chat_template` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/888
* Temporary fix for installing `llama-cpp-python` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/886
* Fix unit tests after release of `transformers==4.44.0` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/891
* Fix default structured output by plaguss in https://github.com/argilla-io/distilabel/pull/892
* Send as many batches as possible to input queues by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/895
* Exclude `repo_id` from `LoadDataFromFileSystem` by plaguss in https://github.com/argilla-io/distilabel/pull/898
* Fix loader to read from a glob pattern by plaguss in https://github.com/argilla-io/distilabel/pull/877
* Add `save_artifact` method to `_Step` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/871
* Add new `add_raw_input` argument to `_Task` so we can automatically include the formatted input by plaguss in https://github.com/argilla-io/distilabel/pull/903
* New `TruncateTextColumn` to truncate the length of texts using the number of tokens or characters by plaguss in https://github.com/argilla-io/distilabel/pull/902
* Update `inputs` and `outputs` interface to allow returning dict indicating optionality by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/883
* Update mistrallm by plaguss in https://github.com/argilla-io/distilabel/pull/904
* Deepseek prover by plaguss in https://github.com/argilla-io/distilabel/pull/907
* Update `RewardModelScore.inputs` property by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/908
* Add tutorial - generate data for training embeddings and reranking models by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/893
* Fix load data from disk by plaguss in https://github.com/argilla-io/distilabel/pull/910
* docs: minor fixes by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/913
* Add `URIAL` task by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/921
* Add `vLLMEmbeddings` by plaguss in https://github.com/argilla-io/distilabel/pull/920
* docs: add tutorials preference and clean by sdiazlor in https://github.com/argilla-io/distilabel/pull/917
* Fix `StructuredGeneration` examples and internal check by plaguss in https://github.com/argilla-io/distilabel/pull/912
* Generate deterministic pipeline name when it's not given by plaguss in https://github.com/argilla-io/distilabel/pull/878
* Add custom errors by plaguss in https://github.com/argilla-io/distilabel/pull/911
* Docs/tutorials fix by sdiazlor in https://github.com/argilla-io/distilabel/pull/922
* Add `revision` runtime parameter to `LoadDataFromHub` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/928
* Add plausible as replacement for GA by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/929
* Add minhash related steps to deduplicate texts by plaguss in https://github.com/argilla-io/distilabel/pull/931
* docs: API reference review by sdiazlor in https://github.com/argilla-io/distilabel/pull/932
* Refactor of MinHash to work with a single class and fix the shelve backend by plaguss in https://github.com/argilla-io/distilabel/pull/937
* Update `make_generator_step` to set pipeline to step and add edge to steps in trophic level 1 by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/936
* Add `CombineOutputs` step by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/939
* fix: regex expression in POSITIVE_NEGATIVE by sdiazlor in https://github.com/argilla-io/distilabel/pull/940
* Offline batch generation by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/923
* Fix applying input mapping when mapping overrides another column by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/938
* Fix all replicas had the same `_llm_identifier` for `CudaDevicePlacementMixin` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/941
* Fix empty load stage when two `GlobalStep`s are chained by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/945
* Add `system_prompt` attribute to `TextGeneration` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/950
* Add step to deduplicate records based on embeddings by plaguss in https://github.com/argilla-io/distilabel/pull/946
* Updated setup_logging to use UTF-8 in FileHandler by dameikle in https://github.com/argilla-io/distilabel/pull/952
* Add more generation parameters to `vLLM` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/955
* Fix `Magpie` generating different columns names depending on `LLM` output by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/965
* Docs/962 docs create a smoother transition from index installation quickstart by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/968
* Add `logging_handlers` argument by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/969
* [DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints by plaguss in https://github.com/argilla-io/distilabel/pull/973
* Add `TextClassification`, `UMAP`, `DBSCAN` and `TextClustering` tasks by plaguss in https://github.com/argilla-io/distilabel/pull/948
* [FEATURE] Simplify customizing the `TextGeneration` task with custom prompts by plaguss in https://github.com/argilla-io/distilabel/pull/974
* Update `system_prompt` attribute for adding probabilities in `MagpieBase` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/981
* Fix unloading steps with more than 1 replica by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/982
* docs: 960 docs add a glossary concept section by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/970
* Fix missing `system_prompt_key` column in `Magpie` tasks by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/983
* docs: update component gallery by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/987
* fix missing batch when last batch arrive early by zye1996 in https://github.com/argilla-io/distilabel/pull/989
* Fine personas socialai tutorial by plaguss in https://github.com/argilla-io/distilabel/pull/992
* feat: add basic draw implementation to pipline by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/966
* Fix schema inference structured generation by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/994
* [DOCS] Add developer documentation section in the docs by plaguss in https://github.com/argilla-io/distilabel/pull/999
* Fix `vllm` installation in CI by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1009
* fix metadata writeout when llm error by zye1996 in https://github.com/argilla-io/distilabel/pull/1003
* Add example of custom text generation step in quickstart by plaguss in https://github.com/argilla-io/distilabel/pull/984
* feat: 985 feature argillalabeller task by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/986
* Fix`llvmlite` install with `uv` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1018
* fix: failing tests argilla labeller by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1017
* fix inpute when output_mapping is not empty by zye1996 in https://github.com/argilla-io/distilabel/pull/1015
* Add Tasks to replicate `APIGen` by plaguss in https://github.com/argilla-io/distilabel/pull/925
* Pretty print by plaguss in https://github.com/argilla-io/distilabel/pull/934
* Add `CLAIR` task by plaguss in https://github.com/argilla-io/distilabel/pull/926
* Add cache at `Step` level by plaguss in https://github.com/argilla-io/distilabel/pull/766
* Fix `IndexError` when overriding inputs and `group_generations=False` by plaguss in https://github.com/argilla-io/distilabel/pull/1022
* Update `Pipeline cache` docs by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1023
* `1.4.0` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1024

New Contributors
* dameikle made their first contribution in https://github.com/argilla-io/distilabel/pull/952
* zye1996 made their first contribution in https://github.com/argilla-io/distilabel/pull/989

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.3.2...1.4.0

1.3.2

What's Changed
* Deepseek prover task by plaguss in https://github.com/argilla-io/distilabel/pull/733
* Do not cancel in progress docs workflows by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/919
* Fix creating Ray placement groups for vLLM by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/918
* Fix passing `base_url` in `model_id` in `InferenceEndpointsLLM` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/924

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.3.1...1.3.2

1.3.1

What's Changed
* Create new `distilabel.constants` module to store constants and avoid circular imports by plaguss in https://github.com/argilla-io/distilabel/pull/861
* Add OpenAI request timeout by ashim-mahara in https://github.com/argilla-io/distilabel/pull/858

New Contributors
* ashim-mahara made their first contribution in https://github.com/argilla-io/distilabel/pull/858

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.3.0...1.3.1

1.3.0

What's Changed
* Add new step `CombineKeys` by plaguss in https://github.com/argilla-io/distilabel/pull/747
* Refactor naming columns steps combinecolumns combinekeys expandcolumns by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/758
* Drop remove deprecated `LoadHubDataset` by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/759
* Add `requirements` list for `Pipeline` by plaguss in https://github.com/argilla-io/distilabel/pull/720
* Add `StepResources` and step replicas in `Pipeline` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/750
* Add load stages by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/760
* Update min required version to `python==3.9` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/770
* Optionally include the pipeline script in the hub when pushing your distiset by plaguss in https://github.com/argilla-io/distilabel/pull/762
* Add `docs-pr.yml` and `docs-pr-close.yml` workflows by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/774
* Add `RayPipeline` class by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/769
* Fixed closed PR workflow by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/776
* Add `Magpie` and `MagpieGenerator` tasks by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/778
* Fix some issues related to `Magpie` task by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/783
* Add `end_with_user` and `include_system_prompt` flags to `Magpie` tasks and handle `None`s. by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/784
* Add workflow concurrency group for publishing docs by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/796
* Add `_desired_num_gpus` attribute to `CudaDevicePlacementMixin` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/795
* Compatibility with `vLLM` with `tensor_parallel_size` argument by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/805
* Update default names in `GroupColumns` by plaguss in https://github.com/argilla-io/distilabel/pull/808
* Request batches to `GeneratorStep` if only step in pipeline by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/828
* Add default name for a pipeline by plaguss in https://github.com/argilla-io/distilabel/pull/809
* Update distilabel phrasing based on PR hugging face hub by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/821
* Some more `Magpie` improvements by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/833
* Add `Embeddings` base class, `SentenceTransformerEmbeddings` class, `EmbeddingGeneration` and `FaissNearestNeighbour` steps by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/830
* Create file per hostname in `CudaDevicePlacementMixin` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/814
* Create a `GeneratorStep` from a dataset using a helper function by plaguss in https://github.com/argilla-io/distilabel/pull/812
* Do not take into account `disable_cuda_device_placement` for pipeline signature by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/838
* Add `RewardModelScore` step by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/840
* Fix `LoadDataFromHub` attribute `_dataset` had `ellipsis` by default instead of `None` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/841
* Create `PlacementGroup` for steps using `vLLM` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/842
* Update `argilla` integration to use `argilla_sdk` v2 by alvarobartt in https://github.com/argilla-io/distilabel/pull/705
* Make `overall-rating` the default aspect for `UltraFeedback` task by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/843
* fix typo index.md by franperic in https://github.com/argilla-io/distilabel/pull/844
* Use `CudaDevicePlacementMixin` in `RewardModelScore` step by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/845
* Gather GPUs per Ray node to create placement groups by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/848
* Fix typo in docs by plaguss in https://github.com/argilla-io/distilabel/pull/850
* Add `xfail` routing batch function tests by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/852
* Fix creating placement group when `pipeline_parallel_size>1` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/851
* docs: 846 docs include google analytics by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/847
* Add `ClientvLLM` class by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/854
* Add hard-negative flag to include similar challenging negatives on triplets by plaguss in https://github.com/argilla-io/distilabel/pull/856
* Add bibtex references in the docstrings to be shown in the README by plaguss in https://github.com/argilla-io/distilabel/pull/855
* distilabel `1.3.0` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/857

New Contributors
* franperic made their first contribution in https://github.com/argilla-io/distilabel/pull/844

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.2.4...1.3.0

1.2.4

What's Changed
* Update `InferenceEndpointsLLM` to use `chat_completion` method by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/815

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.2.3...1.2.4

1.2.3

What's Changed
* Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue 785) by Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/786
* Correct variable name in dataset push example (in ultrafeedback.md file) (Issue 787) by Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/791
* docs: update script for issue dashboard by sdiazlor in https://github.com/argilla-io/distilabel/pull/775
* Fix 404 model not found for private Serverless IE by dvsrepo in https://github.com/argilla-io/distilabel/pull/806

New Contributors
* Hassaan-Qaisar made their first contribution in https://github.com/argilla-io/distilabel/pull/786

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.2.2...1.2.3

Page 2 of 6

Releases

Has known vulnerabilities

Previous Next

Distilabel

Page 2 of 6

1.4.0

1.3.2

1.3.1

1.3.0

1.2.4

1.2.3

Page 2 of 6

Links

Releases