Major Updates
- Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
- Optimized the distributed mode performance and usability with more automatic features.
DJ-Operators
* `extract_support_text_mapper`, `relation_identity_mapper`, `python_file_mapper`, https://github.com/modelscope/data-juicer/pull/500
* `naive_grouper`, `key_value_grouper`, https://github.com/modelscope/data-juicer/pull/500
* `nested_aggregator`, `entity_attribute_aggregator`, `most_relavant_entities_aggregator`, https://github.com/modelscope/data-juicer/pull/500
* `video_extract_frames_mapper`, https://github.com/modelscope/data-juicer/pull/507
Performance
* Optimize ray mode performance, https://github.com/modelscope/data-juicer/pull/442
* Patch for Performance Benchmark in CI/CD workflows, https://github.com/modelscope/data-juicer/pull/506
* DJ Ray mode supports streaming loading of `jsonl` files, https://github.com/modelscope/data-juicer/pull/515
Usability and Analysis
* support dj-install in recipe-level, https://github.com/modelscope/data-juicer/pull/508
* support dj-analyze with --auto mode, https://github.com/modelscope/data-juicer/pull/512
* support op-wise insight auto mining, https://github.com/modelscope/data-juicer/pull/516
Acknowledgment
Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!