The Big Change ๐
Refactor of dataset builder and executor, see https://github.com/modelscope/data-juicer/pull/537, cyruszhang
๐ YAML explicitly defines different sources of datasets; local and remote are defined separately.
๐ง More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
๐ Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
๐ Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
๐ Add data format validation to ensure consistency and correctness.
๐ Expanded data source support:
a. ๐ฆ ModelScope integration.
b. ๐ ArXiv dataset import (download, decompress, ingest).
c. ๐ Wikipedia dataset support (download, decompress, ingest).
d. ๐ Common Crawl integration (download, decompress, ingest).
๐ Backward compatibility with existing dataset_path command-line syntax.
๐ Support for data mixtures to combine multiple datasets dynamically.
๐ง Support for empty formatters/generated datasets without pre-defined config files.
Others ๐ก
๐ New audio processing operator: audio_add_gaussian_noise ([PR 622](https://github.com/modelscope/data-juicer/pull/622)), liuyuhanalex
๐ Added dynamic coverage rate badge to the README for transparency ([PR 625](https://github.com/modelscope/data-juicer/pull/625))