New type handling capabilities
The most significant change in this release is the introduction of type serializers to unitxt.
Type serializers in charge of taking a specific type of data structure such as Table, or Dialog and serialize it to textual representation.
Now you can define tasks in unitxt that have complex types such as Table or Dialog and define serializers that handle their transformation to text.
This allows to control the representation of different types from the recipe api:
python
from unitxt import load_dataset
from unitxt.struct_data_operators import SerializeTableAsMarkdown
serializer = SerializeTableAsMarkdown(shuffle_rows=True, seed=0)
dataset = load_dataset(card="cards.wikitq", template_card_index=0, serializer=serializer)
And if you want to serialize this table differently you can change any of the many available [table serializers](https://github.com/IBM/unitxt/blob/80b284fec1954bdf48638d9442c75808cd79a4c5/src/unitxt/struct_data_operators.py#L103-L203).
Defining New Type
If you wish to define a new type with custom serializers you can do so by using python `typing` library:
python
from typing import Any, List, TypedDict
class Table(TypedDict):
header: List[str]
rows: List[List[Any]]
Once your type is ready you should register it to unitxt type handling within the code you are running:
python
from unitxt.type_utils import register_type
register_type(Table)
Now your type can be used anywhere across unitxt (e.g in task definition or serializers).
Defining a Serializer For a Type
If you want to define a serializer for your custom type or any typing type combination you can do so by:
python
class MySerizlizer(SingleTypeSerializer):
serialized_type = Table
def serialize(self, value: Table, instance: Dict[str, Any]) -> str:
your code to turn value of type Table to string
Multi-Modality
You now can process Image-Text to Text or Image-Audio to Text datasets in unitxt.
For example if you want to load the doc-vqa dataset you can do so by:
python
from unitxt import load_dataset
dataset = load_dataset(
card="cards.doc_vqa.en",
template="templates.qa.with_context.title",
format="formats.models.llava_interleave",
loader_limit=20,
)
Since we have data augmentation mechanisms it is just natural to use it for images. For example if you want your images in grey scale:
python
dataset = load_dataset(
card="cards.doc_vqa.en",
template="templates.qa.with_context.title",
format="formats.models.llava_interleave",
loader_limit=20,
augmentor="augmentors.image.grey_scale", <= Just like the text augmenters!
)
Then if you want to get the scores of a model on this dataset you can use:
python
from unitxt.inference import HFLlavaInferenceEngine
from unitxt.text_utils import print_dict
from unitxt import evaluate
inference_model = HFLlavaInferenceEngine(
model_name="llava-hf/llava-interleave-qwen-0.5b-hf", max_new_tokens=32
)
test_dataset = dataset["test"].select(range(5))
predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)
print_dict(
evaluated_dataset[0],
keys_to_print=["source", "media", "references", "processed_prediction", "score"],
)
Multi modality support in unitxt is building upon the type handling introduced in the previous section with two new types: Image and Audio.
What's Changed
* add revision option to hf loader by OfirArviv in https://github.com/IBM/unitxt/pull/1189
* Support dataset field in nested JSON files by antonpibm in https://github.com/IBM/unitxt/pull/1188
* Add TURL Table column type annotation task card by csrajmohan in https://github.com/IBM/unitxt/pull/1186
* Update operators.py - copy edits (grammar, consistency, clarity) by welisheva22 in https://github.com/IBM/unitxt/pull/1187
* Numeric nlg postproc by ShirApp in https://github.com/IBM/unitxt/pull/1185
* Add support for Literal, TypedDict and NewType for unitxt type checking by elronbandel in https://github.com/IBM/unitxt/pull/1191
* Scarebleu metric: remove mecab_ko and mecab_ko_dic from metric requir… by eladven in https://github.com/IBM/unitxt/pull/1197
* Add rag dataset + openai format dialog operator by OfirArviv in https://github.com/IBM/unitxt/pull/1192
* Update README.md by elronbandel in https://github.com/IBM/unitxt/pull/1198
* add decorator with init warning by MikolajCharchut in https://github.com/IBM/unitxt/pull/1200
* Add mock inference mode setting and allow testing without gen ai key by elronbandel in https://github.com/IBM/unitxt/pull/1204
* Fix using OpenAiInferenceEngine for LLMAsJudge by yifanmai in https://github.com/IBM/unitxt/pull/1194
* Add TogetherAiInferenceEngine by yifanmai in https://github.com/IBM/unitxt/pull/1203
* Fix OpenAiInferenceEngine by yifanmai in https://github.com/IBM/unitxt/pull/1193
* Add serializers to templates and reorganize and unite all templates by elronbandel in https://github.com/IBM/unitxt/pull/1195
* Add demos to task_data by elronbandel in https://github.com/IBM/unitxt/pull/1206
* Move test_context_correctness by matanor in https://github.com/IBM/unitxt/pull/1207
* Add image-text to text datasets by elronbandel in https://github.com/IBM/unitxt/pull/1211
* Refactor augmentors to be more scaleable + add image aumgentors by elronbandel in https://github.com/IBM/unitxt/pull/1212
* Fix grey scale augmentor and add to image example by elronbandel in https://github.com/IBM/unitxt/pull/1213
* Add images to UI by elronbandel in https://github.com/IBM/unitxt/pull/1216
* add unified decorator for warnings and unit tests by MikolajCharchut in https://github.com/IBM/unitxt/pull/1209
* Add templates list option to standard recipe by elronbandel in https://github.com/IBM/unitxt/pull/1219
* Use read token for huggingface datasets reading by elronbandel in https://github.com/IBM/unitxt/pull/1223
* add Llava-next system prompt by OfirArviv in https://github.com/IBM/unitxt/pull/1221
* Improve performance for huggingface tokenizer based format by elronbandel in https://github.com/IBM/unitxt/pull/1224
* Fix compute expression to use the instance variables as globals by elronbandel in https://github.com/IBM/unitxt/pull/1217
* Add generic inference engine to allow dynamic selection by the user by eladven in https://github.com/IBM/unitxt/pull/1226
* A suggested PR for issue 1106: More meaningful error message when catalog consistency fails by dafnapension in https://github.com/IBM/unitxt/pull/1201
* Add random templates for bluebench by perlitz in https://github.com/IBM/unitxt/pull/1222
* A suggested PR for issue 1214: fixed a bug in score_prefix for grouped instance scores by dafnapension in https://github.com/IBM/unitxt/pull/1228
* Add control over serizliers from recipe + improve serializers construction + allow seed for table shuffling serizliers by elronbandel in https://github.com/IBM/unitxt/pull/1229
* Fix table tasks to use default table serializers by elronbandel in https://github.com/IBM/unitxt/pull/1230
* Add concurency_limit parameter to WMLInferenceEngine by elronbandel in https://github.com/IBM/unitxt/pull/1231
* Add wml and generic based llmaj metric by perlitz in https://github.com/IBM/unitxt/pull/1227
* Update version to 1.13.0 by elronbandel in https://github.com/IBM/unitxt/pull/1232
New Contributors
* MikolajCharchut made their first contribution in https://github.com/IBM/unitxt/pull/1200
**Full Changelog**: https://github.com/IBM/unitxt/compare/1.12.4...1.13.0