Major Updates
+ ๐ Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. 359 366
+ ๐งช **[Experimental]** Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the [docs](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md). #273 291 312 332 364
+ ๐ Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. 468
+ ๐ Support adaptive resource management:
- Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. 270 329 354
- Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. 429
+ ๐ฅ We presented a [tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) of _Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases_ on KDD'24. #310
+ ๐ฅ A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
+ ๐ A playground for Data-Juicer is opened for user trial. 277 368
OPs
Text
+ `ray_document_deduplicator`: supports Ray-based distributed exact-match deduplication for text-only datasets. 263
+ Support sentencepiece tokenizer for MinHash deduplicators. 269
+ `generate_qa_from_text_mapper`: generates question and answer pairs from input texts. 333 454
+ `generate_qa_from_examples_mapper`: generates question and answer pairs based on examples. 338 454
+ `optimize_qa_mapper`: optimizes the question-answer pairs in question-answering samples. 338 454
+ `optimize_query_mapper`: optimizes the query in question-answering samples. 338 454
+ `optimize_response_mapper`: optimizes the response in question-answering samples. 454
+ `calibrate_qa_mapper`: calibrates question-answer pairs based on reference text. 463
+ `calibrate_query_mapper`: calibrates query in question-answer pairs based on reference text. 463
+ `calibrate_response_mapper`: calibrates response in question-answer pairs based on reference text. 463
+ `text_chunk_mapper`: splits input text to chunks. 481
+ `extract_entity_attribute_mapper`: extracts attributes for given entities from the text. 481
+ `extract_entity_relation_mapper`: extracts entities and relations in the text for knowledge graph. 481
+ `extract_event_mapper`: extracts events and relevant characters in the text. 481
+ `extract_keyword_mapper`: generates keywords for the text. 481
+ `extract_nickname_mapper`: extracts nickname relationship in the text.. 481
Image
+ `image_face_blur_mapper`: blurs faces detected in images. 249
+ `image_nsfw_filter`: keeps samples containing images with NSFW scores below the threshold. 252
+ `image_watermark_filter`: keeps samples containing images with predicted watermark probabilities below the threshold. 256
+ `ray_image_deduplicator`: supports Ray-based distributed exact-match deduplication for image or image-text datasets. 263
+ `image_pair_similarity_filter`: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. 393
+ `image_tagging_mapper`: generates image tags from the input images. 423
+ `image_face_count_filter`: keeps samples containing images with face counts within the specified range. 446
Video
+ `video_face_blur_mapper`: blurs faces detected in videos. 253
+ `video_remove_watermark_mapper`: removes the watermarks in given regions from the videos. 236
+ `video_nsfw_filter`: keeps samples containing videos with NSFW scores below the threshold. 252
+ `video_watermark_filter`: keeps samples containing videos with predicted watermark probabilities below the threshold. 256
+ `ray_video_deduplicator`: supports Ray-based distributed exact-match deduplication for video or video-text datasets. 263
+ `video_tagging_from_frames_filter`: keeps samples containing videos with given tags. 260
+ `video_captioning_from_frames_mapper`: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. 257
+ `video_captioning_from_summarizer_mapper`: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). 250
+ `video_motion_score_raft_filter`: keeps samples with video motion scores (based on RAFT model) within a specific range. 478
+ Enhance the `video_motion_score_filter` to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. 361
Misc.
+ Switch face detection used in 3 OPs (`image_face_ratio_filter`, `image_face_blur_mapper`, `video_face_blur_mapper`) from `dlib` to `OpenCV` to avoid dependency problems. 320
+ Deduplicators for multimodal datasets are allowed to consider text information as well. 313
+ Support batched processing for some OPs. 406 435
Others (Engine, Job Control and Tools)
+ Support more multimodal (video) dataset conversion tools: MSR-VTT 248
+ Support distributed processing script for Slurm. 242
+ Support Minhash-LSH deduplication tools based on Spark. 290
+ Enable GPU usage for Ray executor. 274
+ Add debug mode for Data-Juicer. 303
+ Add video generation tools for several metrics. 273 312
+ Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. 304
+ Add sampled frames from videos for video OPs to support OP fusion. 271
+ Allow to save stats for each OP respectively by specifying the exporting paths for them. 309
+ Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. 317
+ Support `turbo` mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. 402
+ Update type annotations from `jsonargparse` to `Pydantic`. 422
+ Add a Monitor module to monitor the resource utilization during data processing for each OP. 429
+ Allow lazy importing for third-party libraries and installing dependencies if they are not installed. 414 443
+ Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. 448
+ Enable unit test coverage report. 460
+ Support invoking API models for interaction with OpenAI-compatible APIs. 463 479
Document Updates
+ Refine documentation system based on Sphinx. 245
+ Regular document updates. 234 246
+ Update the class importing and document building logics for better automation. 299
+ Reorganize the operator documents for better reading. 472
Bugs Fixed
+ Fix the bug of non-existent videos returned by the video splitting function given a short duration. 243
+ Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. 247
+ Fix some problems in demos. 244
+ Fix "Undefined punctuation_pattern" error in two OPs. 301
+ Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. 287
+ Fix the bug of out-of-work type hint checking for config files. 302
+ Fix the bug of parameters in the base classes that can not be parsed in some OPs. 311
+ Fix the memory leaking of video OPs. 374
+ Fix the bug of two OPs (`video_aesthetics_filter` and `image_diffusion_mapper`) that can not make use of GPUs. 389
+ Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. 391
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!
+ chg0901 helps to fix typos in documents. 237
+ lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. 289
+ shiweijiezero helps fix the bugs in updating the data keys. 300
+ seanzhang-zhichen helps to support multiple patterns for `replace_content_mapper`. 319
+ simplaj helps to fix a bug of a non-predefined attribute for `video_captioning_from_summarizer_mapper`. 343
+ zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. 352 381 456 461
+ 2108038773 helps to add `trust_remote_code` argument for some public models on HuggingFace. 382 385
+ TobyJasper helps to fix typos in documents and contribute a new OP `image_face_count_filter`. 392 452
+ co63oc helps to fix some typos in documents and code. 427