Py-data-juicer

Latest version: v1.3.0

Safety actively analyzes 726363 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 3

1.3.0

The Big Change 🚀
Refactor of dataset builder and executor, see https://github.com/modelscope/data-juicer/pull/537, cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡
🔊 New audio processing operator: audio_add_gaussian_noise ([PR 622](https://github.com/modelscope/data-juicer/pull/622)), liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency ([PR 625](https://github.com/modelscope/data-juicer/pull/625))

1.2.2

Major Updates
- 🧪 Add document for API service. Add parameter transmission using `json.dumps` to support API calls for arbitrary registration functions and classes. 613
- 🚀 Add unit tests for the analysis module and utils module to increase test coverage. 604 616
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., *16%* gain on [MathVision](https://mathllm.github.io/mathvision/#leaderboard) using only *400 samples*). See more details in [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499).

New OPs
- `llm_quality_score_filter`: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. 606 614 620
- `llm_difficulty_score_filter`: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. 606 614 620

Others
- Fix config in LLaVa pretrain recipe. 610
- Update news for MindGYM and fix doc. 615
- Fix decode error through UTF-8 decoding. 618

1.2.1

Major Updates
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) DJ has been integrated in [Ray's official Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html) and [Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html). Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by [Apache Arrow](https://github.com/apache/arrow/pull/45084).
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) Our work on contrastive data synthesis, [ImgDiff](https://arxiv.org/pdf/2408.04594), has been accepted by *CVPR 2025*!
- Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. 598
- use primitive `unittest.skip` and remove `SKIPPED_TESTS`. 586
- upload test coverage reports to GitHub artifacts. 586

New OPs

- `image_remove_background_mapper`: remove the background of images. 589

Others
- add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. 585
- only build doc for py3.10. 586
- move dependency on `ray` to minimal requirements. 586 594 595
- allow executor and other tool functions to consume a loaded dataset in addition to the config file. 596 597
- fix undefined `fileno` bug of the logger. 594

Acknowledgement
- liuyuhanalex helps simplify the code logic of OP fusion, add a new OP `image_remove_background_mapper`, and fix some minor bugs. 581 585 589
- co63oc helps to fix typos in code and documents. 582 583 588 591 593
- danielhjz helps to fix the implicit memory leak problem in `image_nsfw_filter`. 590

1.2.0

What's New
* 📚 The DJ doc is refactored and improved, e.g., *RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links*
* 🔎 More unit-tests added.
* 🎛 The data pre-split and export are improved.
* 🔮 A new data selection method, DaaR, is proposed. See [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380).

Detailed PRs
* fix export error when export_stats columns is null in https://github.com/modelscope/data-juicer/pull/557
* Resplit input dataset in ray mode in https://github.com/modelscope/data-juicer/pull/549
* Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in https://github.com/modelscope/data-juicer/pull/561
* Resolve most skipped unit-tests by in https://github.com/modelscope/data-juicer/pull/559
* fix translation error in https://github.com/modelscope/data-juicer/pull/562
* Add unittest for ray text dedup in https://github.com/modelscope/data-juicer/pull/540
* [Typo]correct a small typo in https://github.com/modelscope/data-juicer/pull/563
* update the 2.0 paper link & the DaaR news in https://github.com/modelscope/data-juicer/pull/566
* Fix typos in https://github.com/modelscope/data-juicer/pull/571
* Optimization for sdxl_prompt2prompt_mapper dependency importing by in https://github.com/modelscope/data-juicer/pull/570
* Fix typos in https://github.com/modelscope/data-juicer/pull/572

Acknowledgment
* liuyuhanalex co63oc made their first PRs

**Full Changelog**: https://github.com/modelscope/data-juicer/compare/v1.1.0...v1.2.0

1.1.0

Major Updates
- 🧪 User now can run ray-based distributed data processing under the guidance of added docs. 523
- 🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. 542
- 💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. 526
- 🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. 534
- 🚀 Automatically update relevant documents when adding OPs to reduce the development burden. 527
- 🛝 Add usability tags for OPs:
- `alpha` tag for OPs in which only the basic OP implementations are finished;
- `beta` tag for OPs in which unittests are added based on the `alpha` version;
- `stable` tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the `beta` version.

New OPs
- `image_segment_mapper`: Perform segment-anything on images and return the bounding boxes. 550
- `mllm_mapper`: Mapper to use MLLMs to generate texts for images. 550
- `sdxl_prompt2prompt_mapper`: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. 550
- `sentence_augmentation_mapper`: Augment sentences using LLMs. 550
- `text_pair_similarity_filter`: Filter samples according to the similarity score between the text pair. 550

Bug Fixed
- Add global `skip_op_error` param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. 528
- Fix model force download bug. 529
- Fix `IndexError` if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. 536
- Fix missing field meta tag on ray mode. 538
- Update `max_tokens` or `max_new_tokens` for vllm-based OPs to avoid too short generation. 544
- Fix bug in the role playing data generation demo. 545

Others
- Enhance unit test for API calling OPs. 528
- Remove sandbox requirements installation from Dockerfile. 530
- Update the `datasource` related APIs to be compatible with the latest version of Ray. 532
- Limit the generated qa num for each text in `generate_qa_from_text_mapper`. 541
- Update docs for preparing DJ2.0 release. 542
- Update a quick cdn link for arch figure. 543
- Add a video demo for role playing data generation. 545
- Optimize op doc for global textual search. 552
- Use a more stable and fast translator than google translator for automatic OP doc building. 554

Acknowledgement
* Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. 550

1.0.3

Major Updates
- 💥 Support **Ray-based MinHashLSH deduplicator**, which implemented a **multi-process Union-Find set** based on Ray Actor and [BTS algorithm](https://ieeexplore.ieee.org/document/10598116) to complete equivalence class merging. #502
- 💥 Support **post-tuning dataset formats** in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. 514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (`meta`, `stats`) 514 518
- Provide several format conversion tools for converting to Data-Juicer format and vice versa. 514
- 🚀 Add **10 more post-tuning OPs** to process post-tuning datasets better. It's listed in detail in the below New OPs section. 513
- 🚀 Support **Ray Actor mode** for **GPU-based OPs**. 511

New OPs
Post-tuning OPs for fine-grained analysis of dialog data. 513
Mapper
- `dialog_intent_detection_mapper`: Mapper to generate user's intent labels in feed back dialog data.
- `dialog_sentiment_detection_mapper`: Mapper to generate user's sentiment labels in feed back dialog data.
- `dialog_sentiment_intensity_mapper`: Mapper to predict user's sentiment intensity (from -5 to 5 in default
prompt) in feed back dialog data.
- `dialog_topic_detection_mapper`: Mapper to generate user's topic labels in feed back dialog data.
- `query_intent_detection_mapper`: Mapper to predict user's Intent label in a query.
- `query_sentiment_detection_mapper`: Mapper to predict user's sentiment label ('negative', 'neutral' and
'positive') in a query.
- `query_topic_detection_mapper`: Mapper to predict user's topic label in a query.
Aggregator
- `meta_tags_aggregator`: Merge similar meta tags to one tag.
Selector
- `tags_specified_field_selector`: Select samples based on the tags of specified field.
Grouper
- `naive_reverse_grouper`: Split bathed sample to samples.

Bug Fixed
- Fix the wrong argument passing in `generate_qa_from_example_mapper`. 517
- Update the out-of-date Dingding QR code on the main page. 513

Acknowledgement
* jackylee-ch made their first contribution to help fix several invalid links in the document. 521

**Full Changelog**: https://github.com/modelscope/data-juicer/compare/v1.0.2...v1.0.3

Page 1 of 3

Releases

Has known vulnerabilities

Py-data-juicer

Page 1 of 3

1.3.0

1.2.2

1.2.1

1.2.0

1.1.0

1.0.3

Page 1 of 3

Links

Releases