Py-data-juicer

Latest version: v0.2.0

Safety actively analyzes 622344 Python packages for vulnerabilities to keep your Python projects secure.

0.2.0

New Features
- 🚀 We introduce [**DJ-SORA**](https://github.com/alibaba/data-juicer/blob/main/docs/DJ_SORA.md) to provide open large-scale, high-quality datasets for SORA-like models. #227
- 🚀 We introduce hundreds of dedicated video, image, audio, text, and other **multi-modal** data processing [**operators**](https://github.com/alibaba/data-juicer/blob/main/docs/Operators.md) and **tools**.
- 💥 Our paper has been accepted by **SIGMOD'24 industrial track**! 211
- 💥 "**BetterMixture**" — Our second data-centric LLM competition has kicked off and is about to end soon. 174

New OPs
Multimodal

- `video_frames_text_similarity_filter`: keeps samples whose similarities between sampled video frame images and text within a specific range. 227
- `video_tagging_from_frames_mapper`: generates video tags from frames extracted from the video. 227
- `video_tagging_from_audio_mapper`: generates video tags from audio streams extracted from videos. 227
- `video_captioning_from_video_mapper`: generates captions from frame images extracted from video to augment datasets. 227
- `video_captioning_from_audio_mapper`: captions a video according to its audio streams. 227
- `image_captioning_mapper`: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. 131 191 227
- `image_captioning_from_gpt4v_mapper`: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. 214 227
- `image_diffusion_mapper`: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. 200
Video
Filter

- `video_duration_filter`: keeps samples whose videos' durations are within a specified range. 227
- `video_aspect_ratio_filter`: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. 227
- `video_resolution_filter`: filters samples according to the resolution of videos in them. 227
- `video_ocr_area_ratio_filter`: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. 227
- `video_aesthetics_filter`: filters samples according to the aesthetics score of frame images extracted from videos. 227
- `video_motion_score_filter`: keeps samples with video motion scores within a specific range. 227
Mapper

- `video_split_by_scene_mapper`: splits videos into scene clips. 227
- `video_split_by_duration_mapper`: splits videos by specified duration interval. 227
- `video_split_by_key_frame_mapper`: splits videos by their keyframes. 227
- `video_resize_aspect_ratio_mapper`: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. 227
- `video_resize_resolution_mapper`: maps videos to ones with a given resolution range. 227
- `video_ffmpeg_wrapped_mapper`: a wrapper to apply ffmpeg to video data more conveniently. 227
Deduplicator

- `video_deduplicator`: deduplicates samples at document-level using exact matching of videos between documents. 227
Audio

- `audio_duration_filter`: keeps samples whose audios' durations are within a specified range. 177
- `audio_size_filter`: keeps samples whose audios' sizes are within a specified range. 184
- `audio_nmf_snr_filter`: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. 189
- `audio_ffmpeg_wrapped_mapper`: a wrapper to apply ffmpeg to audio data more conveniently. 227
Image

- `image_blur_mapper`: adds random noises to images to blur them. 180
- `image_aesthetics_filter`: filter samples according to the aesthetics scores of images. 227
Document Updates

- "Bad" Data Exhibition [EN](https://github.com/alibaba/data-juicer/blob/main/docs/BadDataExhibition.md) [ZH](https://github.com/alibaba/data-juicer/blob/main/docs/BadDataExhibition_ZH.md): shows how Data-Juicer finds those "bad" data and how they look like.
- Awesome LLM Data [EN](https://github.com/alibaba/data-juicer/blob/main/docs/awesome_llm_data.md): a collection of awesome LLM datasets with fine-grained tags.
- Developer Guide enhancement [EN](https://github.com/alibaba/data-juicer/blob/main/docs/DeveloperGuide.md) [ZH](https://github.com/alibaba/data-juicer/blob/main/docs/DeveloperGuide_ZH.md): adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 220
- OP Insight Visualization Demo [code](https://github.com/alibaba/data-juicer/blob/main/demos/data_visualization_op_insight): adds a demo to visualize how each OP works.
Bugs Fixed

- Fix stats computation error in the ray mode due to the inappropriate initialization method. 173
- Fix the bug that some images will be lost when converting their paths to absolute paths. 178
- Fix the dependency problems of OPs who depend on other OPs. 181
- Fix the bug that the `predict.py` tool gets stuck on the help page. 183
- Fix `face_area_filter`: constrains the detection coordinates within the image. 202
- Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. 195
- Fix or update invalid links in Data-Juicer. 201 219
Others

- Optimize the model management module. 196 227
- Optimize the unit test actions. 195 196 216 227
- Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. 203 217 222 227
- Update the docker image with JDK. 208
- Support more multimodal (video) dataset conversion tools: 227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
- Optimize the generated multimodal data storage. 227
- Support running data-juicer process jobs on Aliyun PAI-DLC. 227
- Better support for multi-machine distributed data processing in Ray mode. 227
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!

- liuyanyi helps to fix a bug in quality classifier tools. 183
- co63oc helps to fix some typos. 215
- liuyanyi helps to provide the solution to add JDK in the docker image. 182 208
- zhenqincn helps to add more papers to the Awesome LLM Data doc. 226

0.1.3

New Features
- Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named `simhash-pybind` to solve the Python version limitation problem.
- We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
- Multimodal dataset analysis and processing are now supported. 64 91 95 106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
- Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. 65 140
- Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP `replace_content_mapper`. 143
- Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. 160
New OPs
Text
- `chinese_convert_mapper`: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by [opencc](https://github.com/BYVoid/OpenCC)) #51
- `remove_non_chinese_character_mapper`: removes non-Chinese characters in text samples. 51
- `text_action_filter`: keeps samples containing action verbs in their texts. 122
- `text_entity_dependency_filter`: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. 122
- `replace_content_mapper`: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. 143
- `remove_repeat_sentences_mapper`: Remove repeated sentences in the text. 149
Image
- `image_shape_filter`: keeps samples containing images with widths and heights within the specified ranges. 74
- `image_aspect_ratio_filter`: keeps samples containing images with aspect ratios (w/h) within the specified range. 64
- `image_size_filter`: keeps samples containing images whose sizes in bytes are within the specified range. 73
- `face_area_filter`: keeps samples containing images with face area ratios within the specified range. 110
- `image_deduplicator`: deduplicates samples at document-level using exact matching of images between documents. 72
Multimodal
- `image_text_similarity_filter`: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. 69
- `image_text_matching_filter`: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. 100
- `phrase_grounding_recall_filter`: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. 139
Bugs fixed
- Fix the `pandas==2.0.0 fsspec==2023.3.0` to avoid unexpected errors from third-party dependencies. 38 42
- Fix the bug when OPs `nlpaug_en_mapper` and `nlpcda_zh_mapper` generate indefinite numbers of augmented samples. 76
- Fix the bug of `maximum_line_length_filter` might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. 147
- Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. 155 157
- Fix the bug of commandline arguments parsing error in some cases. 108 165
- Store simhash value as string type to avoid errors from PyArrow. 168 170
Others
- Dependency importing optimization: only require and import some dependencies when using. 35 82
- Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. 42 54
- Optimize the cache directory selection logic. 43
- Support limiting the number of samples when mixing datasets. 86
- Avoid extra unnecessary model preparation when enabling tokenization in some OPs. 99
- OP `language_id_score_filter` supports keeping samples in multiple languages now. 125 151
Acknowledgement
Here we thank public contributors for their PRs to make Data-Juicer better!
- JONGSKY helps to remove some unnecessary code. 85
- xuruidong helps to fix several broken links in the README doc. 142

0.1.2

New OPs
- `nlpaug_en_mapper`: simple data augmentation using [nlpaug](https://github.com/makcedward/nlpaug) library for English corpus. #17
- `nlpcda_zh_mapper`: simple data augmentation using [nlpcda](https://github.com/425776024/nlpcda) library for Chinese corpus. #17
- `token_num_filter`: filter out samples by the number of tokens in them. HF tokenizers are supported. 24

New features
- OP Fusion 14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
- Cache management 19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
- Distributed data processing with Ray is supported now. 21
- Config sys optimization:
- Only keep `text_keys` and remove previous misleading arg `text_key(s)_to_process/load`. 13
- A new argument `export_in_parallel` is added to control whether export the result datasets in parallel. 17
- Display the config table after config parsing is ready. 17

Others
- Replace original string constants with constant enums. 13
- Expand the checkpoint protection range to cover the exporting process. 14
- Remove extra intermediate variables storage in `document_simhash_deduplicator` to save more memory. 14
- Docs updates. 15 16
- PyPi package is available. You can install data-juicer by `pip install py-data-juicer` now. 23
- Docker building is available now. The official docker image for Docker Hub is in progress. 23
- Deploy the unit tests for Data-Juicer. 29

0.1.0

Summarization - Table of Contents
=================

* Data-Juicer: A Data-Centric Text Processing System for Large Language Models
* Table of Contents
* Features
* Prerequisites
* Installation
* Quick Start
* Data Processing
* Data Analysis
* Data Visualization
* Build Up Config Files
* Preprocess raw data (Optional)
* Documentation | 文档
* Data Recipes
* Demos
* License
* Contributing
* References

Features

- **Broad Range of Operators**: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.

- **Specialized Toolkits**: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.

- **Systematic & Reusable**: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.

- **Data-in-the-loop**: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.

- **Comprehensive Processing Recipes**: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.

- **User-Friendly Experience**: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.

- **Enhanced Efficiency**: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.

Releases

Has known vulnerabilities