Py-data-juicer

Latest version: v1.3.0

Safety actively analyzes 723976 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 3

1.0.2

Major Updates
- Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
- Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators
* `extract_support_text_mapper`, `relation_identity_mapper`, `python_file_mapper`, https://github.com/modelscope/data-juicer/pull/500
* `naive_grouper`, `key_value_grouper`, https://github.com/modelscope/data-juicer/pull/500
* `nested_aggregator`, `entity_attribute_aggregator`, `most_relavant_entities_aggregator`, https://github.com/modelscope/data-juicer/pull/500
* `video_extract_frames_mapper`, https://github.com/modelscope/data-juicer/pull/507

Performance
* Optimize ray mode performance, https://github.com/modelscope/data-juicer/pull/442
* Patch for Performance Benchmark in CI/CD workflows, https://github.com/modelscope/data-juicer/pull/506
* DJ Ray mode supports streaming loading of `jsonl` files, https://github.com/modelscope/data-juicer/pull/515

Usability and Analysis
* support dj-install in recipe-level, https://github.com/modelscope/data-juicer/pull/508
* support dj-analyze with --auto mode, https://github.com/modelscope/data-juicer/pull/512
* support op-wise insight auto mining, https://github.com/modelscope/data-juicer/pull/516

Acknowledgment
Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!

1.0.1

Major Updates
+ 🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. 464
+ 🚀 **[UnitTest]** Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. 483
+ 💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ 491 492 493

OPs
Text OPs
+ `pair_preference_mapper`: Mapper to construct preference answers for QA pairs. 491
Script OPs
+ `python_lambda_mapper`: Mapper for executing customized Python lambda functions on data samples. 492
+ `python_file_mapper`: Mapper for executing customized Python functions on data samples. 493

Bugs Fixed
- Add an argument to control whether to open `Monitor` for data processing. It's True by default. 483
- For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. 483
- Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. 483
- Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. 504

Others
- Pin the PyAV version to prevent inconsistent updates. 504
- Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. 503
- Remove unnecessary UNFORKABLE marks for some OPs. 491
- Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. 494 501

Acknowledgment
Here we thank public contributors for their PRs and issues to make Data-Juicer better!

1.0.0

Major Updates
+ 🚀 Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. 359 366
+ 🧪 **[Experimental]** Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the [docs](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md). #273 291 312 332 364
+ 🚀 Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. 468
+ 🚀 Support adaptive resource management:
- Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. 270 329 354
- Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. 429
+ 💥 We presented a [tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) of _Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases_ on KDD'24. #310
+ 💥 A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
+ 🛝 A playground for Data-Juicer is opened for user trial. 277 368

OPs
Text
+ `ray_document_deduplicator`: supports Ray-based distributed exact-match deduplication for text-only datasets. 263
+ Support sentencepiece tokenizer for MinHash deduplicators. 269
+ `generate_qa_from_text_mapper`: generates question and answer pairs from input texts. 333 454
+ `generate_qa_from_examples_mapper`: generates question and answer pairs based on examples. 338 454
+ `optimize_qa_mapper`: optimizes the question-answer pairs in question-answering samples. 338 454
+ `optimize_query_mapper`: optimizes the query in question-answering samples. 338 454
+ `optimize_response_mapper`: optimizes the response in question-answering samples. 454
+ `calibrate_qa_mapper`: calibrates question-answer pairs based on reference text. 463
+ `calibrate_query_mapper`: calibrates query in question-answer pairs based on reference text. 463
+ `calibrate_response_mapper`: calibrates response in question-answer pairs based on reference text. 463
+ `text_chunk_mapper`: splits input text to chunks. 481
+ `extract_entity_attribute_mapper`: extracts attributes for given entities from the text. 481
+ `extract_entity_relation_mapper`: extracts entities and relations in the text for knowledge graph. 481
+ `extract_event_mapper`: extracts events and relevant characters in the text. 481
+ `extract_keyword_mapper`: generates keywords for the text. 481
+ `extract_nickname_mapper`: extracts nickname relationship in the text.. 481

Image
+ `image_face_blur_mapper`: blurs faces detected in images. 249
+ `image_nsfw_filter`: keeps samples containing images with NSFW scores below the threshold. 252
+ `image_watermark_filter`: keeps samples containing images with predicted watermark probabilities below the threshold. 256
+ `ray_image_deduplicator`: supports Ray-based distributed exact-match deduplication for image or image-text datasets. 263
+ `image_pair_similarity_filter`: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. 393
+ `image_tagging_mapper`: generates image tags from the input images. 423
+ `image_face_count_filter`: keeps samples containing images with face counts within the specified range. 446

Video
+ `video_face_blur_mapper`: blurs faces detected in videos. 253
+ `video_remove_watermark_mapper`: removes the watermarks in given regions from the videos. 236
+ `video_nsfw_filter`: keeps samples containing videos with NSFW scores below the threshold. 252
+ `video_watermark_filter`: keeps samples containing videos with predicted watermark probabilities below the threshold. 256
+ `ray_video_deduplicator`: supports Ray-based distributed exact-match deduplication for video or video-text datasets. 263
+ `video_tagging_from_frames_filter`: keeps samples containing videos with given tags. 260
+ `video_captioning_from_frames_mapper`: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. 257
+ `video_captioning_from_summarizer_mapper`: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). 250
+ `video_motion_score_raft_filter`: keeps samples with video motion scores (based on RAFT model) within a specific range. 478
+ Enhance the `video_motion_score_filter` to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. 361

Misc.
+ Switch face detection used in 3 OPs (`image_face_ratio_filter`, `image_face_blur_mapper`, `video_face_blur_mapper`) from `dlib` to `OpenCV` to avoid dependency problems. 320
+ Deduplicators for multimodal datasets are allowed to consider text information as well. 313
+ Support batched processing for some OPs. 406 435

Others (Engine, Job Control and Tools)
+ Support more multimodal (video) dataset conversion tools: MSR-VTT 248
+ Support distributed processing script for Slurm. 242
+ Support Minhash-LSH deduplication tools based on Spark. 290
+ Enable GPU usage for Ray executor. 274
+ Add debug mode for Data-Juicer. 303
+ Add video generation tools for several metrics. 273 312
+ Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. 304
+ Add sampled frames from videos for video OPs to support OP fusion. 271
+ Allow to save stats for each OP respectively by specifying the exporting paths for them. 309
+ Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. 317
+ Support `turbo` mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. 402
+ Update type annotations from `jsonargparse` to `Pydantic`. 422
+ Add a Monitor module to monitor the resource utilization during data processing for each OP. 429
+ Allow lazy importing for third-party libraries and installing dependencies if they are not installed. 414 443
+ Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. 448
+ Enable unit test coverage report. 460
+ Support invoking API models for interaction with OpenAI-compatible APIs. 463 479

Document Updates
+ Refine documentation system based on Sphinx. 245
+ Regular document updates. 234 246
+ Update the class importing and document building logics for better automation. 299
+ Reorganize the operator documents for better reading. 472

Bugs Fixed
+ Fix the bug of non-existent videos returned by the video splitting function given a short duration. 243
+ Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. 247
+ Fix some problems in demos. 244
+ Fix "Undefined punctuation_pattern" error in two OPs. 301
+ Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. 287
+ Fix the bug of out-of-work type hint checking for config files. 302
+ Fix the bug of parameters in the base classes that can not be parsed in some OPs. 311
+ Fix the memory leaking of video OPs. 374
+ Fix the bug of two OPs (`video_aesthetics_filter` and `image_diffusion_mapper`) that can not make use of GPUs. 389
+ Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. 391

Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!

+ chg0901 helps to fix typos in documents. 237
+ lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. 289
+ shiweijiezero helps fix the bugs in updating the data keys. 300
+ seanzhang-zhichen helps to support multiple patterns for `replace_content_mapper`. 319
+ simplaj helps to fix a bug of a non-predefined attribute for `video_captioning_from_summarizer_mapper`. 343
+ zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. 352 381 456 461
+ 2108038773 helps to add `trust_remote_code` argument for some public models on HuggingFace. 382 385
+ TobyJasper helps to fix typos in documents and contribute a new OP `image_face_count_filter`. 392 452
+ co63oc helps to fix some typos in documents and code. 427

0.2.0

New Features
- 🚀 We introduce [**DJ-SORA**](https://github.com/alibaba/data-juicer/blob/main/docs/DJ_SORA.md) to provide open large-scale, high-quality datasets for SORA-like models. #227
- 🚀 We introduce hundreds of dedicated video, image, audio, text, and other **multi-modal** data processing [**operators**](https://github.com/alibaba/data-juicer/blob/main/docs/Operators.md) and **tools**.
- 💥 Our paper has been accepted by **SIGMOD'24 industrial track**! 211
- 💥 "**BetterMixture**" — Our second data-centric LLM competition has kicked off and is about to end soon. 174

New OPs
Multimodal

- `video_frames_text_similarity_filter`: keeps samples whose similarities between sampled video frame images and text within a specific range. 227
- `video_tagging_from_frames_mapper`: generates video tags from frames extracted from the video. 227
- `video_tagging_from_audio_mapper`: generates video tags from audio streams extracted from videos. 227
- `video_captioning_from_video_mapper`: generates captions from frame images extracted from video to augment datasets. 227
- `video_captioning_from_audio_mapper`: captions a video according to its audio streams. 227
- `image_captioning_mapper`: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. 131 191 227
- `image_captioning_from_gpt4v_mapper`: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. 214 227
- `image_diffusion_mapper`: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. 200
Video
Filter

- `video_duration_filter`: keeps samples whose videos' durations are within a specified range. 227
- `video_aspect_ratio_filter`: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. 227
- `video_resolution_filter`: filters samples according to the resolution of videos in them. 227
- `video_ocr_area_ratio_filter`: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. 227
- `video_aesthetics_filter`: filters samples according to the aesthetics score of frame images extracted from videos. 227
- `video_motion_score_filter`: keeps samples with video motion scores within a specific range. 227
Mapper

- `video_split_by_scene_mapper`: splits videos into scene clips. 227
- `video_split_by_duration_mapper`: splits videos by specified duration interval. 227
- `video_split_by_key_frame_mapper`: splits videos by their keyframes. 227
- `video_resize_aspect_ratio_mapper`: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. 227
- `video_resize_resolution_mapper`: maps videos to ones with a given resolution range. 227
- `video_ffmpeg_wrapped_mapper`: a wrapper to apply ffmpeg to video data more conveniently. 227
Deduplicator

- `video_deduplicator`: deduplicates samples at document-level using exact matching of videos between documents. 227
Audio

- `audio_duration_filter`: keeps samples whose audios' durations are within a specified range. 177
- `audio_size_filter`: keeps samples whose audios' sizes are within a specified range. 184
- `audio_nmf_snr_filter`: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. 189
- `audio_ffmpeg_wrapped_mapper`: a wrapper to apply ffmpeg to audio data more conveniently. 227
Image

- `image_blur_mapper`: adds random noises to images to blur them. 180
- `image_aesthetics_filter`: filter samples according to the aesthetics scores of images. 227
Document Updates

- "Bad" Data Exhibition [EN](https://github.com/alibaba/data-juicer/blob/main/docs/BadDataExhibition.md) [ZH](https://github.com/alibaba/data-juicer/blob/main/docs/BadDataExhibition_ZH.md): shows how Data-Juicer finds those "bad" data and how they look like.
- Awesome LLM Data [EN](https://github.com/alibaba/data-juicer/blob/main/docs/awesome_llm_data.md): a collection of awesome LLM datasets with fine-grained tags.
- Developer Guide enhancement [EN](https://github.com/alibaba/data-juicer/blob/main/docs/DeveloperGuide.md) [ZH](https://github.com/alibaba/data-juicer/blob/main/docs/DeveloperGuide_ZH.md): adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 220
- OP Insight Visualization Demo [code](https://github.com/alibaba/data-juicer/blob/main/demos/data_visualization_op_insight): adds a demo to visualize how each OP works.
Bugs Fixed

- Fix stats computation error in the ray mode due to the inappropriate initialization method. 173
- Fix the bug that some images will be lost when converting their paths to absolute paths. 178
- Fix the dependency problems of OPs who depend on other OPs. 181
- Fix the bug that the `predict.py` tool gets stuck on the help page. 183
- Fix `face_area_filter`: constrains the detection coordinates within the image. 202
- Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. 195
- Fix or update invalid links in Data-Juicer. 201 219
Others

- Optimize the model management module. 196 227
- Optimize the unit test actions. 195 196 216 227
- Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. 203 217 222 227
- Update the docker image with JDK. 208
- Support more multimodal (video) dataset conversion tools: 227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
- Optimize the generated multimodal data storage. 227
- Support running data-juicer process jobs on Aliyun PAI-DLC. 227
- Better support for multi-machine distributed data processing in Ray mode. 227
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!

- liuyanyi helps to fix a bug in quality classifier tools. 183
- co63oc helps to fix some typos. 215
- liuyanyi helps to provide the solution to add JDK in the docker image. 182 208
- zhenqincn helps to add more papers to the Awesome LLM Data doc. 226

0.1.3

New Features
- Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named `simhash-pybind` to solve the Python version limitation problem.
- We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
- Multimodal dataset analysis and processing are now supported. 64 91 95 106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
- Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. 65 140
- Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP `replace_content_mapper`. 143
- Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. 160
New OPs
Text
- `chinese_convert_mapper`: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by [opencc](https://github.com/BYVoid/OpenCC)) #51
- `remove_non_chinese_character_mapper`: removes non-Chinese characters in text samples. 51
- `text_action_filter`: keeps samples containing action verbs in their texts. 122
- `text_entity_dependency_filter`: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. 122
- `replace_content_mapper`: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. 143
- `remove_repeat_sentences_mapper`: Remove repeated sentences in the text. 149
Image
- `image_shape_filter`: keeps samples containing images with widths and heights within the specified ranges. 74
- `image_aspect_ratio_filter`: keeps samples containing images with aspect ratios (w/h) within the specified range. 64
- `image_size_filter`: keeps samples containing images whose sizes in bytes are within the specified range. 73
- `face_area_filter`: keeps samples containing images with face area ratios within the specified range. 110
- `image_deduplicator`: deduplicates samples at document-level using exact matching of images between documents. 72
Multimodal
- `image_text_similarity_filter`: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. 69
- `image_text_matching_filter`: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. 100
- `phrase_grounding_recall_filter`: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. 139
Bugs fixed
- Fix the `pandas==2.0.0 fsspec==2023.3.0` to avoid unexpected errors from third-party dependencies. 38 42
- Fix the bug when OPs `nlpaug_en_mapper` and `nlpcda_zh_mapper` generate indefinite numbers of augmented samples. 76
- Fix the bug of `maximum_line_length_filter` might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. 147
- Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. 155 157
- Fix the bug of commandline arguments parsing error in some cases. 108 165
- Store simhash value as string type to avoid errors from PyArrow. 168 170
Others
- Dependency importing optimization: only require and import some dependencies when using. 35 82
- Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. 42 54
- Optimize the cache directory selection logic. 43
- Support limiting the number of samples when mixing datasets. 86
- Avoid extra unnecessary model preparation when enabling tokenization in some OPs. 99
- OP `language_id_score_filter` supports keeping samples in multiple languages now. 125 151
Acknowledgement
Here we thank public contributors for their PRs to make Data-Juicer better!
- JONGSKY helps to remove some unnecessary code. 85
- xuruidong helps to fix several broken links in the README doc. 142

0.1.2

New OPs
- `nlpaug_en_mapper`: simple data augmentation using [nlpaug](https://github.com/makcedward/nlpaug) library for English corpus. #17
- `nlpcda_zh_mapper`: simple data augmentation using [nlpcda](https://github.com/425776024/nlpcda) library for Chinese corpus. #17
- `token_num_filter`: filter out samples by the number of tokens in them. HF tokenizers are supported. 24

New features
- OP Fusion 14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
- Cache management 19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
- Distributed data processing with Ray is supported now. 21
- Config sys optimization:
- Only keep `text_keys` and remove previous misleading arg `text_key(s)_to_process/load`. 13
- A new argument `export_in_parallel` is added to control whether export the result datasets in parallel. 17
- Display the config table after config parsing is ready. 17

Others
- Replace original string constants with constant enums. 13
- Expand the checkpoint protection range to cover the exporting process. 14
- Remove extra intermediate variables storage in `document_simhash_deduplicator` to save more memory. 14
- Docs updates. 15 16
- PyPi package is available. You can install data-juicer by `pip install py-data-juicer` now. 23
- Docker building is available now. The official docker image for Docker Hub is in progress. 23
- Deploy the unit tests for Data-Juicer. 29

Page 2 of 3

Releases

Has known vulnerabilities

Previous Next

Py-data-juicer

Page 2 of 3

1.0.2

1.0.1

1.0.0

0.2.0

0.1.3

0.1.2

Page 2 of 3

Links

Releases