Distilabel

Latest version: v1.4.1

Safety actively analyzes 682404 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 5

1.2.3

What's Changed
* Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue 785) by Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/786
* Correct variable name in dataset push example (in ultrafeedback.md file) (Issue 787) by Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/791
* docs: update script for issue dashboard by sdiazlor in https://github.com/argilla-io/distilabel/pull/775
* Fix 404 model not found for private Serverless IE by dvsrepo in https://github.com/argilla-io/distilabel/pull/806

New Contributors
* Hassaan-Qaisar made their first contribution in https://github.com/argilla-io/distilabel/pull/786

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.2.2...1.2.3

1.2.2

What's Changed
* Fix passing `input` to `format_output` function by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/781


**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.2.1...1.2.2

1.2.1

What's Changed
* Fix docs for distiset.save_to_disk kwargs by fpreiss in https://github.com/argilla-io/distilabel/pull/745
* docs: change references by sdiazlor in https://github.com/argilla-io/distilabel/pull/754
* Fix `response_format` for `TogetherLLM` and `AnyScaleLLM` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/764

New Contributors
* fpreiss made their first contribution in https://github.com/argilla-io/distilabel/pull/745

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.2.0...1.2.1

1.2.0

✨ Release highlights

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

* [`instructor`](https://github.com/jxnl/instructor) has been integrated bringing support for structured generation with `OpenAILLM`, `AnthropicLLM`, `LiteLLM`, `MistralLLM`, `CohereLLM` and `GroqLLM`:

<details>
<summary>Structured generation with `instructor` example</summary>

python
from typing import List

from distilabel.llms import MistralLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, Field


class Node(BaseModel):
id: int
label: str
color: str


class Edge(BaseModel):
source: int
target: int
label: str
color: str = "black"


class KnowledgeGraph(BaseModel):
nodes: List[Node] = Field(..., default_factory=list)
edges: List[Edge] = Field(..., default_factory=list)


with Pipeline(
name="Knowledge-Graphs",
description=(
"Generate knowledge graphs to answer questions, this type of dataset can be used to "
"steer a model to answer questions with a knowledge graph."
),
) as pipeline:
sample_questions = [
"Teach me about quantum mechanics",
"Who is who in The Simpsons family?",
"Tell me about the evolution of programming languages",
]

load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
"instruction": f"{question}",
}
for question in sample_questions
],
)

text_generation = TextGeneration(
name="knowledge_graph_generation",
llm=MistralLLM(
model="open-mixtral-8x22b",
structured_output={"schema": KnowledgeGraph}
),
)
load_dataset >> text_generation

</details>
* `InferenceEndpointsLLM` now supports structured generation
* New [`StructuredGeneration`](https://distilabel.argilla.io/latest/components-gallery/tasks/structuredgeneration/) task that allows defining the schema of the structured generation per input row.

New tasks for generating datasets for training embedding models

[`sentence-transformers`](https://sbert.net/) v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!

* New [`GenerateSentencePair`](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/) task that allows to generate a `positive` sentence for an input `anchor`, and optionally also a `negative` sentence. The tasks allows creating different kind of data specifying the `action` to perform with respect to the `anchor`: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer.
* Implemented [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368) and adding the following tasks derived from the paper:
* [`EmbeddingTaskGenerator`](https://distilabel.argilla.io/latest/components-gallery/tasks/embeddingtaskgenerator/) which allows generating new embedding-related tasks using an `LLM`.
* [`GenerateTextRetrievalData`](https://distilabel.argilla.io/latest/components-gallery/tasks/generatetextretrievaldata/) which allows creating text retrieval data with an `LLM`.
* [`GenerateShortTextMatchingData`](https://distilabel.argilla.io/latest/components-gallery/tasks/generateshorttextmatchingdata/) which allows creating short texts matching the input data.
* [`GenerateLongTextMatchingData`](https://distilabel.argilla.io/latest/components-gallery/tasks/generatelongtextmatchingdata/) which allows creating long texts matching the input data.
* [`GenerateTextClassificationData`](https://distilabel.argilla.io/latest/components-gallery/tasks/generatetextclassificationdata/) which allows creating text classification data from the input data.
* [`MonolingualTripletGenerator`](https://distilabel.argilla.io/latest/components-gallery/tasks/monolingualtripletgenerator/) which allows creating monolingual triplets from the input data.
* [`BitextRetrievalGenerator `](https://distilabel.argilla.io/latest/components-gallery/tasks/bitextretrievalgenerator) which allows creating bitext retrieval data from the input data.

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

We've added a few new steps allowing to load data from different sources:

* [`LoadDataFromDisk `](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromdisk/) allows loading a `Distiset`or `datasets.Dataset` that was previously saved using the `save_to_disk` method.
* [`LoadDataFromFileSystem`](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromfilesystem/) allows loading a `datasets.Dataset` from a file system.

Thanks to rasdani for helping us testing this new tasks!

In addition, we have added `save_to_disk` method to `Distiset` akin to `datasets.Dataset.save_to_disk`, that allows saving the generated distiset to disk, along with the `pipeline.yaml` and `pipeline.log`.

<details>
<summary>`save_to_disk` example</summary>

python
from distilabel.pipeline import Pipeline

with Pipeline(name="my-pipeline") as pipeline:
...

if __name__ == "__main__":
distiset = pipeline.run(...)
distiset.save_to_disk(dataset_path="my-distiset")

</details>

`MixtureOfAgentsLLM` implementation

We've added a new `LLM` called [`MixtureOfAgentsLLM`](https://distilabel.argilla.io/latest/components-gallery/llms/mixtureofagentsllm/) derived from the paper [Mixture-of-Agents Enhances Large Language Model Capabilities](https://arxiv.org/abs/2406.04692). This new `LLM` allows generating improved outputs thanks to the collective expertise of several `LLM`s.

<details>
<summary>`MixtureOfAgentsLLM` example</summary>

python
from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM

llm = MixtureOfAgentsLLM(
aggregator_llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
proposers_llms=[
InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
InferenceEndpointsLLM(
model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
),
InferenceEndpointsLLM(
model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
),
],
rounds=2,
)

llm.load()

output = llm.generate(
inputs=[
[
{
"role": "user",
"content": "My favorite witty review of The Rings of Power series is this: Input:",
}
]
]
)

</details>

Saving cache and passing batches to `GlobalStep`s optimizations

* The cache logic of the `_BatchManager` has been improved to incrementally update the cache making the process much faster.
* The data of the input batches of the `GlobalStep`s will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration of `fsspec`, which [can be configured](https://distilabel.argilla.io/latest/sections/how_to_guides/advanced/fs_to_pass_data/) to use a file system or cloud storage as backend for passing the data of the batches.

`BasePipeline` and `_BatchManager` refactor

The logic around `BasePipeline` and `_BatchManager` has been refactored, which will make it easier to implement new pipelines in the future.

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

`distilabel` can be easily used to create an `LLM` benchmark. To showcase this, we decided to implement [Arena Hard](https://github.com/lm-sys/arena-hard-auto) as an example: [Benchmarking with `distilabel`: Arena Hard](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/#benchmarking-with-distilabel-arena-hard)

📚 Improved documentation structure

We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.

![image](https://github.com/argilla-io/distilabel/assets/29572918/611973d5-77cc-4a50-a414-adfaf1d821a1)

What's Changed
* Add `prometheus.md` by alvarobartt in https://github.com/argilla-io/distilabel/pull/656
* Reduce time required to execute `_cache` method by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/672
* [DOCS] Update theme styles and images by leiyre in https://github.com/argilla-io/distilabel/pull/667
* Fix circular import due to DISTILABEL_METADATA_KEY by plaguss in https://github.com/argilla-io/distilabel/pull/675
* Add `CITATION.cff` by alvarobartt in https://github.com/argilla-io/distilabel/pull/677
* Deprecate conversation support in `TextGeneration` in favour of `ChatGeneration` by alvarobartt in https://github.com/argilla-io/distilabel/pull/676
* Add functionality to load/save distisets to/from disk by plaguss in https://github.com/argilla-io/distilabel/pull/673
* Integration instructor by plaguss in https://github.com/argilla-io/distilabel/pull/654
* Fix docs of saving/loading distiset from disk by plaguss in https://github.com/argilla-io/distilabel/pull/679
* Pass data of batches using file system by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/678
* Add `python==3.12` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/615
* Add `codspeed` benchmarks by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/674
* Add `StructuredGeneration` task and support for `grammar` in `InferenceEndpointsLLM` by alvarobartt in https://github.com/argilla-io/distilabel/pull/680
* Fix `InferenceEndpointsLLM` not using cached token by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/690
* Add `GenerateSentencePair` task by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/689
* Fix prepend batches by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/696
* Fix `EvolQuality._apply_random_mutation` not properly injecting `response` in template by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/703
* [FEATURE] Include new `GeneratorStep` classes to load datasets from different formats by plaguss in https://github.com/argilla-io/distilabel/pull/691
* Add citation readme by plaguss in https://github.com/argilla-io/distilabel/pull/712
* Move navigation to top bar by plaguss in https://github.com/argilla-io/distilabel/pull/708
* Fix `install_dependencies.sh` by alvarobartt in https://github.com/argilla-io/distilabel/pull/713
* Add context to guide the generate sentence pair task if informed by plaguss in https://github.com/argilla-io/distilabel/pull/706
* Add examples to the LLMs to be shown in the components gallery by plaguss in https://github.com/argilla-io/distilabel/pull/714
* Gather HF_TOKEN internally when calling `Distiset.push_to_hub` if token is None. by plaguss in https://github.com/argilla-io/distilabel/pull/707
* Implement "Improving Text Embeddings with LLMs" by alvarobartt in https://github.com/argilla-io/distilabel/pull/683
* Add `ArenaHard` benchmark and `ArenaHardResults` step by alvarobartt in https://github.com/argilla-io/distilabel/pull/670
* Refactor `Pipeline` and `BasePipeline` classes by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/704
* Fix AzureOpenAILLM load method setting the correct path to mock the internal class by plaguss in https://github.com/argilla-io/distilabel/pull/725
* Components examples steps by plaguss in https://github.com/argilla-io/distilabel/pull/715
* Add examples for tasks in the components gallery by plaguss in https://github.com/argilla-io/distilabel/pull/724
* [FEATURE] Refactor of structured generation and use schemas defined in a dataset by plaguss in https://github.com/argilla-io/distilabel/pull/688
* Update docs document phrasing and funnel by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/718
* docs: 728 docs api reference tasktyping cannot be imported during doc build by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/729
* docs: 730 docs add an index to the guide overview by davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/731
* Add `MixtureOfAgentsLLM` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/735
* Add `examples/arena_hard.py` and remove from `distilabel` core by alvarobartt in https://github.com/argilla-io/distilabel/pull/741
* Add serving LLM section in the docs by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/742
* `distilabel` v1.2.0 by alvarobartt in https://github.com/argilla-io/distilabel/pull/659


New Contributors
* leiyre made their first contribution in https://github.com/argilla-io/distilabel/pull/667

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.1.1...1.2.0

1.1.1

What's Changed
* Fix crash when using vLLM without structured generation by cg123 in https://github.com/argilla-io/distilabel/pull/658
* Fix error on `Pipeline.dry_run` without `parameters` by plaguss in https://github.com/argilla-io/distilabel/pull/655

New Contributors
* cg123 made their first contribution in https://github.com/argilla-io/distilabel/pull/658

**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.1.0...1.1.1

1.1.0

Two new tasks implemented!

`Genstruct` task (https://github.com/argilla-io/distilabel/pull/600)

You can now use `Genstruct` task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:

python
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct

with Pipeline(name="harry-potter-genstruct") as pipeline:
load_hub_dataset = LoadDataFromDicts(
name="load_dataset",
data=[
{
"title": "Harry Potter and the Sorcerer's Stone",
"content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
},
{
"title": "Harry Potter and the Chamber of Secrets",
"content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
},
],
)

task = Genstruct(
name="task",
llm=TransformersLLM(
model="NousResearch/Genstruct-7B",
torch_dtype="float16",
chat_template="{{ messages[0]['content'] }}",
device="cuda:0",
),
num_generations=2,
group_generations=False,
output_mappings={"model_name": "model"},
)


`PrometheusEval` task (https://github.com/argilla-io/distilabel/pull/610)

A new `PrometheusEval` task, based on the recently published paper ["Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models"](https://arxiv.org/abs/2405.01535):

python
from distilabel.steps.tasks import PrometheusEval

with Pipeline(name="prometheus") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
output_mappings={"prompt": "instruction", "completion": "generation"},
)

task = PrometheusEval(
name="task",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)

load_dataset >> task


Connect the steps in the pipeline with `>>` (https://github.com/argilla-io/distilabel/pull/490)

Now you can connect your steps using the *binary shift* operator in python:

python
from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns

with Pipeline(name="Pipe name") as pipeline:
load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
evol_instruction_complexity_1 = EvolInstruct(
llm=OpenAILLM(model="gpt-3.5-turbo"),
)
evol_instruction_complexity_2 = EvolInstruct(
llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
)

combine_columns = CombineColumns(
columns=["response"],
output_columns=["candidates"],
)

(
load_hub_dataset
>> [evol_instruction_complexity_1, evol_instruction_complexity_2]
>> combine_columns
)


Routing batch function (https://github.com/argilla-io/distilabel/pull/595)

Thanks to the new `routing_batch_function`, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a `sample_n_steps` routing batch function, making easier replicating the definition of the original UltraFeedback paper:

python
import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration

routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
return random.sample(steps, 2)

with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)

tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)

combine_generations = CombineColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)

load_dataset >> sample_two_steps >> tasks >> combine_generations


Generate structured outputs using `outlines` (https://github.com/argilla-io/distilabel/pull/601)

You can generate `JSON` or `regex` using `TransformersLLM`, `LlamaCppLLM` or `vLLM` thanks to the integration with `[outlines](https://github.com/outlines-dev/outlines)`

python
from enum import Enum

from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated

class Weapon(str, Enum):
sword = "sword"
axe = "axe"
mace = "mace"
spear = "spear"
bow = "bow"
crossbow = "crossbow"

class Armor(str, Enum):
leather = "leather"
chainmail = "chainmail"
plate = "plate"
mithril = "mithril"

class Character(BaseModel):
name: Annotated[str, StringConstraints(max_length=30)]
age: conint(gt=1, lt=3000)
armor: Armor
weapon: Weapon

with Pipeline("RPG-characters") as pipeline:
system_prompt = (
"You are a leading role play gamer. You have seen thousands of different characters and their attributes."
" Please return a JSON object with common attributes of an RPG character."
)

load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": system_prompt,
"instruction": f"Give me a character description for a {char}",
}
for char in ["dwarf", "elf", "human", "ork"]
],
)

text_generation = TextGeneration(
name="text_generation_rpg",
llm=LlamaCppLLM(
model_path="model/path", type: ignore
structured_output={"format": "json", "schema": Character},
),
)
load_dataset >> text_generation


New `GroqLLM` (https://github.com/argilla-io/distilabel/pull/583)

New integration with [groq](https://console.groq.com/docs/quickstart), special mention to kcentric which did the initial work prior to the refactor for 1.0.0

python
from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration

with Pipeline(name="text-generation-groq") as pipeline:
...
text_generation_with_groq = TextGeneration(
llm=GroqLLM(model="llama3-70b-8192"),
)
...


Easily test your pipeline doing a `dry_run` (https://github.com/argilla-io/distilabel/pull/635)

python
with Pipeline(...) as pipeline:
...
distiset = pipeline.dry_run(
parameters=...,  The same argument as `Pipeline.run`
batch_size=1 Optional, will be set to 1 by default.
)


python
[05/13/24 16:22:30] INFO ['distilabel.pipeline.local'] 🌵 Dry run mode local.py:103
INFO ['distilabel.pipeline.local'] 📝 Pipeline data will be ... local.py:125


**`Pipeline.log` file is dumped to the Hugging Face repository ([568](https://github.com/argilla-io/distilabel/pull/568))**

Now on when you call `distiset.push_to_hub`, the `pipeline.log` file will be automatically dumped to your dataset repository with the `pipeline.yaml` to keep track of the execution.

New `distilabel_metadata` column to store internal data (https://github.com/argilla-io/distilabel/pull/586)

You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via `format_output` , keep the original output to avoid lossing anything.

You can include the metadata at the task level as:

python
TextGeneration(..., add_raw_output=True|False)


And directly determine whether you want this column in your final `Distiset`:

python
with Pipeline(...,enable_metadata=True|False):
...


This way we can decide to remove all the column altogether.

All the changes in this PR

* Allow nested connect calls and overload rshift method to connect steps by plaguss in https://github.com/argilla-io/distilabel/pull/490
* Fix `llm_blender` installation by alvarobartt in https://github.com/argilla-io/distilabel/pull/557
* Warn user about unknown runtime parameters by plaguss in https://github.com/argilla-io/distilabel/pull/555
* Add missing `model_name`, update docstrings, and add `*.jinja2` templates to `Task` subclasses by alvarobartt in https://github.com/argilla-io/distilabel/pull/560
* Split `ChatGeneration` from `TextGeneration` by alvarobartt in https://github.com/argilla-io/distilabel/pull/558
* Set `extra="forbid"` in `{_Step,LLM}.model_config` by alvarobartt in https://github.com/argilla-io/distilabel/pull/577
* Infer step name by plaguss in https://github.com/argilla-io/distilabel/pull/575
* Change the context of subprocesses depending on the platform by plaguss in https://github.com/argilla-io/distilabel/pull/578
* Dump logs within a file in .cache/distilabel/pipelines dir by plaguss in https://github.com/argilla-io/distilabel/pull/568
* Fix empty batches causing missaligment when branching by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/590
* Add `GroqLLM` by alvarobartt in https://github.com/argilla-io/distilabel/pull/583
* Add `Format{Chat,Text}Generation{DPO,SFT}` by alvarobartt in https://github.com/argilla-io/distilabel/pull/584
* Fix `title` in `RatingQuestion` of `PreferenceToArgilla` by alvarobartt in https://github.com/argilla-io/distilabel/pull/597
* Set `streaming=False` and add `num_examples` to `LoadHubDataset` by plaguss in https://github.com/argilla-io/distilabel/pull/565
* Make `pipeline` argument of `Step` optional by plaguss in https://github.com/argilla-io/distilabel/pull/566
* Extend `LLM` kwargs to align with counterparts by alvarobartt in https://github.com/argilla-io/distilabel/pull/594
* Add `Genstruct` task by alvarobartt in https://github.com/argilla-io/distilabel/pull/600
* Fix `num_examples` to be optional in `LoadHubDataset` by plaguss in https://github.com/argilla-io/distilabel/pull/603
* Fix `list_files_in_dir` returning unsorted files by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/609
* Add `PrometheusEval` task by alvarobartt in https://github.com/argilla-io/distilabel/pull/610
* Update `ValueError` on missing inputs message by alvarobartt in https://github.com/argilla-io/distilabel/pull/617
* Add `routing_batch_function` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/595
* Fix `pipeline.log` inconsistency & include LLM info in signature by plaguss in https://github.com/argilla-io/distilabel/pull/598
* Add custom `rubrics` attribute to `PrometheusEval` by alvarobartt in https://github.com/argilla-io/distilabel/pull/621
* Update `UltraFeedback` paper replication to use `routing_batch_function` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/620
* Add `distilabel_metadata` column to the datasets to include general data by plaguss in https://github.com/argilla-io/distilabel/pull/586
* Add the option of passing the multiprocessing context via env var by plaguss in https://github.com/argilla-io/distilabel/pull/604
* Add name of the pipeline to group the hashed folders by it by plaguss in https://github.com/argilla-io/distilabel/pull/626
* Add `routing_batch_function` serialization by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/628
* Excluding model path in serialization of llamacpp by ignacioct in https://github.com/argilla-io/distilabel/pull/633
* Fix problem with sorting method in `list_files_in_dir` function by plaguss in https://github.com/argilla-io/distilabel/pull/622
* Add `dry_run` method to the pipelines to run with a single example. by plaguss in https://github.com/argilla-io/distilabel/pull/635
* [FEATURE] Add structured outputs using `outlines` by plaguss in https://github.com/argilla-io/distilabel/pull/601
* Force pipeline stop after 2 SIGINT signals caught by plaguss in https://github.com/argilla-io/distilabel/pull/630
* Refactor and update `docs` by alvarobartt in https://github.com/argilla-io/distilabel/pull/634
* Export components info & components gallery in docs by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/640
* Documentation updates by plaguss in https://github.com/argilla-io/distilabel/pull/646
* Refactor docs 1.1.0 by plaguss in https://github.com/argilla-io/distilabel/pull/650
* Fix routing batch function deadlocks and unordered batches by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/649


**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.0.3...1.1.0

Page 2 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.