The main improvements in this version focus on **caching strategies, dataset loading, and speed optimizations**.
Hugging Face Datasets Caching Policy
We have **completely revised our caching policy** and how we handle Hugging Face datasets in order to improve performance.
1. **Hugging Face datasets are now cached by default.**
This means that LoadHF loader will cache the downloaded datasets in the HF cache directory (typically ~/.cache/huggingface/datasets).
- To disable this caching mechanism, use:
python
unitxt.settings.disable_hf_datasets_cache = True
2. **All Hugging Face datasets are first downloaded and then processed.**
- This means the entire dataset is downloaded, which is faster for most datasets. However, if you want to process a huge dataset, and the HF dataset supports streaming, you can load it **in streaming mode**
python
LoadHF(name="my-dataset", streaming=True)
3. **To enable streaming mode by default for all Hugging Face datasets, use:**
python
unitxt.settings.stream_hf_datasets_by_default = True
While the **new defaults (full download & caching)** may make the **initial dataset load slower**, subsequent loads will be **significantly faster**.
Unitxt Datasets Caching Policy
By default, when loading datasets with `unitxt.load_dataset`, the dataset is **prepared from scratch** each time you call the function.
This ensures that any changes made to the card definition are reflected in the output.
- This process may take a few seconds, and for **large datasets**, repeated loading can accumulate overhead.
- If you are using fixed datasets from the catalog, you can **enable caching** for Unitxt datasets and thus cache the unitxt datasets.
The datasets are cached in the huggingface cache (typically ~/.cache/huggingface/datasets).
python
from unitxt import load_dataset
ds = load_dataset(card="my_card", use_cache=True)
Faster Unitxt Dataset Preparation
To improve dataset loading speed, we have optimized how Unitxt datasets are prepared.
Background:
Unitxt datasets are converted to Hugging Face datasets because they store data on disk while keeping only the necessary parts in memory (via **PyArrow**). This enables efficient handling of large datasets without excessive memory usage.
Previously, `unitxt.load_dataset` used **built-in Hugging Face methods** for dataset preparation, which included **unnecessary type handling and verification**, slowing down the process.
Key improvements:
- We now **create the Hugging Face dataset directly**, reducing preparation time by **almost 50%**.
- With this optimization, **Unitxt datasets are now faster than ever!**
What's Changed
* End of year summary blog post by elronbandel in https://github.com/IBM/unitxt/pull/1530
* Updated documentation and examples of LLM-as-Judge by tejaswini in https://github.com/IBM/unitxt/pull/1532
* Eval assist documentation by tejaswini in https://github.com/IBM/unitxt/pull/1537
* Update notification banner styles and add 2024 summary blog link by elronbandel in https://github.com/IBM/unitxt/pull/1538
* Add more granite llm as judge artifacts by martinscooper in https://github.com/IBM/unitxt/pull/1516
* Fix Australian legal qa dataset by elronbandel in https://github.com/IBM/unitxt/pull/1542
* Set use 1 shot for wikitq in tables_benchmark by yifanmai in https://github.com/IBM/unitxt/pull/1541
* Bugfix: indexed row major serialization fails with None cell values by yifanmai in https://github.com/IBM/unitxt/pull/1540
* Solve issue of expired token in Unitxt Assistant by eladven in https://github.com/IBM/unitxt/pull/1543
* Add Replicate inference support by elronbandel in https://github.com/IBM/unitxt/pull/1544
* add a filter to wikitq by ShirApp in https://github.com/IBM/unitxt/pull/1547
* Add text2sql tasks by perlitz in https://github.com/IBM/unitxt/pull/1414
* Add deduplicate operator by elronbandel in https://github.com/IBM/unitxt/pull/1549
* Fix the authentication problem by eladven in https://github.com/IBM/unitxt/pull/1550
* Attach assitant answers to their origins with url link by elronbandel in https://github.com/IBM/unitxt/pull/1528
* Add mtrag benchmark by elronbandel in https://github.com/IBM/unitxt/pull/1548
* Update end of year summary blog by elronbandel in https://github.com/IBM/unitxt/pull/1552
* Add data classification policy to CrossProviderInferenceEngine initialization based on selected model by elronbandel in https://github.com/IBM/unitxt/pull/1539
* Fix recently broken rag metrics by elronbandel in https://github.com/IBM/unitxt/pull/1554
* Renamed criterias in LLM-as-a-Judge metrics to criteria - Breaking change by tejaswini in https://github.com/IBM/unitxt/pull/1545
* Finqa hash to top by elronbandel in https://github.com/IBM/unitxt/pull/1555
* Refactor safety metric to be faster and updated by elronbandel in https://github.com/IBM/unitxt/pull/1484
* Improve assistant by elronbandel in https://github.com/IBM/unitxt/pull/1556
* Feature/add global mmlu cards by eliyahabba in https://github.com/IBM/unitxt/pull/1561
* Add quality dataset by eliyahabba in https://github.com/IBM/unitxt/pull/1563
* Add CollateInstanceByField operator to group data by specific field by sarathsgvr in https://github.com/IBM/unitxt/pull/1546
* Fix prompts table benchmark by ShirApp in https://github.com/IBM/unitxt/pull/1565
* Create new IntersectCorrespondingFields operator by pklpriv in https://github.com/IBM/unitxt/pull/1531
* Add granite documents format by elronbandel in https://github.com/IBM/unitxt/pull/1566
* Revisit huggingface cache policy - BREAKING CHANGE by elronbandel in https://github.com/IBM/unitxt/pull/1564
* Add global mmlu lite sensitivity cards by eliyahabba in https://github.com/IBM/unitxt/pull/1568
* Add schema-linking by KyleErwin in https://github.com/IBM/unitxt/pull/1533
* fix the printout of empty strings in the yaml cards of the catalog by dafnapension in https://github.com/IBM/unitxt/pull/1567
* Use repr instead of to_json for unitxt dataset caching by elronbandel in https://github.com/IBM/unitxt/pull/1570
* Added key value extraction evaluation and example with images by yoavkatz in https://github.com/IBM/unitxt/pull/1529
New Contributors
* tejaswini made their first contribution in https://github.com/IBM/unitxt/pull/1532
* KyleErwin made their first contribution in https://github.com/IBM/unitxt/pull/1533
**Full Changelog**: https://github.com/IBM/unitxt/compare/1.17.0...1.18.0