Py-data-juicer

Latest version: v1.2.1

Safety actively analyzes 712615 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

1.2.1

Major Updates
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) DJ has been integrated in [Ray's official Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html) and [Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html). Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by [Apache Arrow](https://github.com/apache/arrow/pull/45084).
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) Our work on contrastive data synthesis, [ImgDiff](https://arxiv.org/pdf/2408.04594), has been accepted by *CVPR 2025*!
- Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. 598
- use primitive `unittest.skip` and remove `SKIPPED_TESTS`. 586
- upload test coverage reports to GitHub artifacts. 586

New OPs

- `image_remove_background_mapper`: remove the background of images. 589

Others
- add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. 585
- only build doc for py3.10. 586
- move dependency on `ray` to minimal requirements. 586 594 595
- allow executor and other tool functions to consume a loaded dataset in addition to the config file. 596 597
- fix undefined `fileno` bug of the logger. 594

Acknowledgement
- liuyuhanalex helps simplify the code logic of OP fusion, add a new OP `image_remove_background_mapper`, and fix some minor bugs. 581 585 589
- co63oc helps to fix typos in code and documents. 582 583 588 591 593
- danielhjz helps to fix the implicit memory leak problem in `image_nsfw_filter`. 590

1.2.0

What's New
* ๐Ÿ“š The DJ doc is refactored and improved, e.g., *RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links*
* ๐Ÿ”Ž More unit-tests added.
* ๐ŸŽ› The data pre-split and export are improved.
* ๐Ÿ”ฎ A new data selection method, DaaR, is proposed. See [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380).

Detailed PRs
* fix export error when export_stats columns is null in https://github.com/modelscope/data-juicer/pull/557
* Resplit input dataset in ray mode in https://github.com/modelscope/data-juicer/pull/549
* Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in https://github.com/modelscope/data-juicer/pull/561
* Resolve most skipped unit-tests by in https://github.com/modelscope/data-juicer/pull/559
* fix translation error in https://github.com/modelscope/data-juicer/pull/562
* Add unittest for ray text dedup in https://github.com/modelscope/data-juicer/pull/540
* [Typo]correct a small typo in https://github.com/modelscope/data-juicer/pull/563
* update the 2.0 paper link & the DaaR news in https://github.com/modelscope/data-juicer/pull/566
* Fix typos in https://github.com/modelscope/data-juicer/pull/571
* Optimization for sdxl_prompt2prompt_mapper dependency importing by in https://github.com/modelscope/data-juicer/pull/570
* Fix typos in https://github.com/modelscope/data-juicer/pull/572

Acknowledgment
* liuyuhanalex co63oc made their first PRs

**Full Changelog**: https://github.com/modelscope/data-juicer/compare/v1.1.0...v1.2.0

1.1.0

Major Updates
- ๐Ÿงช User now can run ray-based distributed data processing under the guidance of added docs. 523
- ๐Ÿงช The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. 542
- ๐Ÿ’ฅ Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. 526
- ๐Ÿš€ Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. 534
- ๐Ÿš€ Automatically update relevant documents when adding OPs to reduce the development burden. 527
- ๐Ÿ› Add usability tags for OPs:
- `alpha` tag for OPs in which only the basic OP implementations are finished;
- `beta` tag for OPs in which unittests are added based on the `alpha` version;
- `stable` tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the `beta` version.


New OPs
- `image_segment_mapper`: Perform segment-anything on images and return the bounding boxes. 550
- `mllm_mapper`: Mapper to use MLLMs to generate texts for images. 550
- `sdxl_prompt2prompt_mapper`: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. 550
- `sentence_augmentation_mapper`: Augment sentences using LLMs. 550
- `text_pair_similarity_filter`: Filter samples according to the similarity score between the text pair. 550

Bug Fixed
- Add global `skip_op_error` param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. 528
- Fix model force download bug. 529
- Fix `IndexError` if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. 536
- Fix missing field meta tag on ray mode. 538
- Update `max_tokens` or `max_new_tokens` for vllm-based OPs to avoid too short generation. 544
- Fix bug in the role playing data generation demo. 545

Others
- Enhance unit test for API calling OPs. 528
- Remove sandbox requirements installation from Dockerfile. 530
- Update the `datasource` related APIs to be compatible with the latest version of Ray. 532
- Limit the generated qa num for each text in `generate_qa_from_text_mapper`. 541
- Update docs for preparing DJ2.0 release. 542
- Update a quick cdn link for arch figure. 543
- Add a video demo for role playing data generation. 545
- Optimize op doc for global textual search. 552
- Use a more stable and fast translator than google translator for automatic OP doc building. 554

Acknowledgement
* Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. 550

1.0.3

Major Updates
- ๐Ÿ’ฅ Support **Ray-based MinHashLSH deduplicator**, which implemented a **multi-process Union-Find set** based on Ray Actor and [BTS algorithm](https://ieeexplore.ieee.org/document/10598116) to complete equivalence class merging. #502
- ๐Ÿ’ฅ Support **post-tuning dataset formats** in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. 514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (`meta`, `stats`) 514 518
- Provide several format conversion tools for converting to Data-Juicer format and vice versa. 514
- ๐Ÿš€ Add **10 more post-tuning OPs** to process post-tuning datasets better. It's listed in detail in the below New OPs section. 513
- ๐Ÿš€ Support **Ray Actor mode** for **GPU-based OPs**. 511

New OPs
Post-tuning OPs for fine-grained analysis of dialog data. 513
Mapper
- `dialog_intent_detection_mapper`: Mapper to generate user's intent labels in feed back dialog data.
- `dialog_sentiment_detection_mapper`: Mapper to generate user's sentiment labels in feed back dialog data.
- `dialog_sentiment_intensity_mapper`: Mapper to predict user's sentiment intensity (from -5 to 5 in default
prompt) in feed back dialog data.
- `dialog_topic_detection_mapper`: Mapper to generate user's topic labels in feed back dialog data.
- `query_intent_detection_mapper`: Mapper to predict user's Intent label in a query.
- `query_sentiment_detection_mapper`: Mapper to predict user's sentiment label ('negative', 'neutral' and
'positive') in a query.
- `query_topic_detection_mapper`: Mapper to predict user's topic label in a query.
Aggregator
- `meta_tags_aggregator`: Merge similar meta tags to one tag.
Selector
- `tags_specified_field_selector`: Select samples based on the tags of specified field.
Grouper
- `naive_reverse_grouper`: Split bathed sample to samples.

Bug Fixed
- Fix the wrong argument passing in `generate_qa_from_example_mapper`. 517
- Update the out-of-date Dingding QR code on the main page. 513

Acknowledgement
* jackylee-ch made their first contribution to help fix several invalid links in the document. 521

**Full Changelog**: https://github.com/modelscope/data-juicer/compare/v1.0.2...v1.0.3

1.0.2

Major Updates
- Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
- Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators
* `extract_support_text_mapper`, `relation_identity_mapper`, `python_file_mapper`, https://github.com/modelscope/data-juicer/pull/500
* `naive_grouper`, `key_value_grouper`, https://github.com/modelscope/data-juicer/pull/500
* `nested_aggregator`, `entity_attribute_aggregator`, `most_relavant_entities_aggregator`, https://github.com/modelscope/data-juicer/pull/500
* `video_extract_frames_mapper`, https://github.com/modelscope/data-juicer/pull/507

Performance
* Optimize ray mode performance, https://github.com/modelscope/data-juicer/pull/442
* Patch for Performance Benchmark in CI/CD workflows, https://github.com/modelscope/data-juicer/pull/506
* DJ Ray mode supports streaming loading of `jsonl` files, https://github.com/modelscope/data-juicer/pull/515

Usability and Analysis
* support dj-install in recipe-level, https://github.com/modelscope/data-juicer/pull/508
* support dj-analyze with --auto mode, https://github.com/modelscope/data-juicer/pull/512
* support op-wise insight auto mining, https://github.com/modelscope/data-juicer/pull/516


Acknowledgment
Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!

1.0.1

Major Updates
+ ๐Ÿš€ Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. 464
+ ๐Ÿš€ **[UnitTest]** Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. 483
+ ๐Ÿ’ฅ Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ 491 492 493

OPs
Text OPs
+ `pair_preference_mapper`: Mapper to construct preference answers for QA pairs. 491
Script OPs
+ `python_lambda_mapper`: Mapper for executing customized Python lambda functions on data samples. 492
+ `python_file_mapper`: Mapper for executing customized Python functions on data samples. 493

Bugs Fixed
- Add an argument to control whether to open `Monitor` for data processing. It's True by default. 483
- For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. 483
- Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. 483
- Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. 504

Others
- Pin the PyAV version to prevent inconsistent updates. 504
- Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. 503
- Remove unnecessary UNFORKABLE marks for some OPs. 491
- Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. 494 501

Acknowledgment
Here we thank public contributors for their PRs and issues to make Data-Juicer better!

Page 1 of 2

ยฉ 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.