Two new tasks implemented!
`Genstruct` task (https://github.com/argilla-io/distilabel/pull/600)
You can now use `Genstruct` task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:
python
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct
with Pipeline(name="harry-potter-genstruct") as pipeline:
load_hub_dataset = LoadDataFromDicts(
name="load_dataset",
data=[
{
"title": "Harry Potter and the Sorcerer's Stone",
"content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
},
{
"title": "Harry Potter and the Chamber of Secrets",
"content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
},
],
)
task = Genstruct(
name="task",
llm=TransformersLLM(
model="NousResearch/Genstruct-7B",
torch_dtype="float16",
chat_template="{{ messages[0]['content'] }}",
device="cuda:0",
),
num_generations=2,
group_generations=False,
output_mappings={"model_name": "model"},
)
`PrometheusEval` task (https://github.com/argilla-io/distilabel/pull/610)
A new `PrometheusEval` task, based on the recently published paper ["Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models"](https://arxiv.org/abs/2405.01535):
python
from distilabel.steps.tasks import PrometheusEval
with Pipeline(name="prometheus") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
output_mappings={"prompt": "instruction", "completion": "generation"},
)
task = PrometheusEval(
name="task",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
load_dataset >> task
Connect the steps in the pipeline with `>>` (https://github.com/argilla-io/distilabel/pull/490)
Now you can connect your steps using the *binary shift* operator in python:
python
from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns
with Pipeline(name="Pipe name") as pipeline:
load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
evol_instruction_complexity_1 = EvolInstruct(
llm=OpenAILLM(model="gpt-3.5-turbo"),
)
evol_instruction_complexity_2 = EvolInstruct(
llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
)
combine_columns = CombineColumns(
columns=["response"],
output_columns=["candidates"],
)
(
load_hub_dataset
>> [evol_instruction_complexity_1, evol_instruction_complexity_2]
>> combine_columns
)
Routing batch function (https://github.com/argilla-io/distilabel/pull/595)
Thanks to the new `routing_batch_function`, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a `sample_n_steps` routing batch function, making easier replicating the definition of the original UltraFeedback paper:
python
import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration
routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
return random.sample(steps, 2)
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)
combine_generations = CombineColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
load_dataset >> sample_two_steps >> tasks >> combine_generations
Generate structured outputs using `outlines` (https://github.com/argilla-io/distilabel/pull/601)
You can generate `JSON` or `regex` using `TransformersLLM`, `LlamaCppLLM` or `vLLM` thanks to the integration with `[outlines](https://github.com/outlines-dev/outlines)`
python
from enum import Enum
from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated
class Weapon(str, Enum):
sword = "sword"
axe = "axe"
mace = "mace"
spear = "spear"
bow = "bow"
crossbow = "crossbow"
class Armor(str, Enum):
leather = "leather"
chainmail = "chainmail"
plate = "plate"
mithril = "mithril"
class Character(BaseModel):
name: Annotated[str, StringConstraints(max_length=30)]
age: conint(gt=1, lt=3000)
armor: Armor
weapon: Weapon
with Pipeline("RPG-characters") as pipeline:
system_prompt = (
"You are a leading role play gamer. You have seen thousands of different characters and their attributes."
" Please return a JSON object with common attributes of an RPG character."
)
load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": system_prompt,
"instruction": f"Give me a character description for a {char}",
}
for char in ["dwarf", "elf", "human", "ork"]
],
)
text_generation = TextGeneration(
name="text_generation_rpg",
llm=LlamaCppLLM(
model_path="model/path", type: ignore
structured_output={"format": "json", "schema": Character},
),
)
load_dataset >> text_generation
New `GroqLLM` (https://github.com/argilla-io/distilabel/pull/583)
New integration with [groq](https://console.groq.com/docs/quickstart), special mention to kcentric which did the initial work prior to the refactor for 1.0.0
python
from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
with Pipeline(name="text-generation-groq") as pipeline:
...
text_generation_with_groq = TextGeneration(
llm=GroqLLM(model="llama3-70b-8192"),
)
...
Easily test your pipeline doing a `dry_run` (https://github.com/argilla-io/distilabel/pull/635)
python
with Pipeline(...) as pipeline:
...
distiset = pipeline.dry_run(
parameters=..., The same argument as `Pipeline.run`
batch_size=1 Optional, will be set to 1 by default.
)
python
[05/13/24 16:22:30] INFO ['distilabel.pipeline.local'] 🌵 Dry run mode local.py:103
INFO ['distilabel.pipeline.local'] 📝 Pipeline data will be ... local.py:125
**`Pipeline.log` file is dumped to the Hugging Face repository ([568](https://github.com/argilla-io/distilabel/pull/568))**
Now on when you call `distiset.push_to_hub`, the `pipeline.log` file will be automatically dumped to your dataset repository with the `pipeline.yaml` to keep track of the execution.
New `distilabel_metadata` column to store internal data (https://github.com/argilla-io/distilabel/pull/586)
You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via `format_output` , keep the original output to avoid lossing anything.
You can include the metadata at the task level as:
python
TextGeneration(..., add_raw_output=True|False)
And directly determine whether you want this column in your final `Distiset`:
python
with Pipeline(...,enable_metadata=True|False):
...
This way we can decide to remove all the column altogether.
All the changes in this PR
* Allow nested connect calls and overload rshift method to connect steps by plaguss in https://github.com/argilla-io/distilabel/pull/490
* Fix `llm_blender` installation by alvarobartt in https://github.com/argilla-io/distilabel/pull/557
* Warn user about unknown runtime parameters by plaguss in https://github.com/argilla-io/distilabel/pull/555
* Add missing `model_name`, update docstrings, and add `*.jinja2` templates to `Task` subclasses by alvarobartt in https://github.com/argilla-io/distilabel/pull/560
* Split `ChatGeneration` from `TextGeneration` by alvarobartt in https://github.com/argilla-io/distilabel/pull/558
* Set `extra="forbid"` in `{_Step,LLM}.model_config` by alvarobartt in https://github.com/argilla-io/distilabel/pull/577
* Infer step name by plaguss in https://github.com/argilla-io/distilabel/pull/575
* Change the context of subprocesses depending on the platform by plaguss in https://github.com/argilla-io/distilabel/pull/578
* Dump logs within a file in .cache/distilabel/pipelines dir by plaguss in https://github.com/argilla-io/distilabel/pull/568
* Fix empty batches causing missaligment when branching by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/590
* Add `GroqLLM` by alvarobartt in https://github.com/argilla-io/distilabel/pull/583
* Add `Format{Chat,Text}Generation{DPO,SFT}` by alvarobartt in https://github.com/argilla-io/distilabel/pull/584
* Fix `title` in `RatingQuestion` of `PreferenceToArgilla` by alvarobartt in https://github.com/argilla-io/distilabel/pull/597
* Set `streaming=False` and add `num_examples` to `LoadHubDataset` by plaguss in https://github.com/argilla-io/distilabel/pull/565
* Make `pipeline` argument of `Step` optional by plaguss in https://github.com/argilla-io/distilabel/pull/566
* Extend `LLM` kwargs to align with counterparts by alvarobartt in https://github.com/argilla-io/distilabel/pull/594
* Add `Genstruct` task by alvarobartt in https://github.com/argilla-io/distilabel/pull/600
* Fix `num_examples` to be optional in `LoadHubDataset` by plaguss in https://github.com/argilla-io/distilabel/pull/603
* Fix `list_files_in_dir` returning unsorted files by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/609
* Add `PrometheusEval` task by alvarobartt in https://github.com/argilla-io/distilabel/pull/610
* Update `ValueError` on missing inputs message by alvarobartt in https://github.com/argilla-io/distilabel/pull/617
* Add `routing_batch_function` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/595
* Fix `pipeline.log` inconsistency & include LLM info in signature by plaguss in https://github.com/argilla-io/distilabel/pull/598
* Add custom `rubrics` attribute to `PrometheusEval` by alvarobartt in https://github.com/argilla-io/distilabel/pull/621
* Update `UltraFeedback` paper replication to use `routing_batch_function` by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/620
* Add `distilabel_metadata` column to the datasets to include general data by plaguss in https://github.com/argilla-io/distilabel/pull/586
* Add the option of passing the multiprocessing context via env var by plaguss in https://github.com/argilla-io/distilabel/pull/604
* Add name of the pipeline to group the hashed folders by it by plaguss in https://github.com/argilla-io/distilabel/pull/626
* Add `routing_batch_function` serialization by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/628
* Excluding model path in serialization of llamacpp by ignacioct in https://github.com/argilla-io/distilabel/pull/633
* Fix problem with sorting method in `list_files_in_dir` function by plaguss in https://github.com/argilla-io/distilabel/pull/622
* Add `dry_run` method to the pipelines to run with a single example. by plaguss in https://github.com/argilla-io/distilabel/pull/635
* [FEATURE] Add structured outputs using `outlines` by plaguss in https://github.com/argilla-io/distilabel/pull/601
* Force pipeline stop after 2 SIGINT signals caught by plaguss in https://github.com/argilla-io/distilabel/pull/630
* Refactor and update `docs` by alvarobartt in https://github.com/argilla-io/distilabel/pull/634
* Export components info & components gallery in docs by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/640
* Documentation updates by plaguss in https://github.com/argilla-io/distilabel/pull/646
* Refactor docs 1.1.0 by plaguss in https://github.com/argilla-io/distilabel/pull/650
* Fix routing batch function deadlocks and unordered batches by gabrielmbmb in https://github.com/argilla-io/distilabel/pull/649
**Full Changelog**: https://github.com/argilla-io/distilabel/compare/1.0.3...1.1.0