Py-data-juicer

Latest version: v1.0.3

Safety actively analyzes 693883 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 2

0.1.2

New OPs
- `nlpaug_en_mapper`: simple data augmentation using [nlpaug](https://github.com/makcedward/nlpaug) library for English corpus. #17
- `nlpcda_zh_mapper`: simple data augmentation using [nlpcda](https://github.com/425776024/nlpcda) library for Chinese corpus. #17
- `token_num_filter`: filter out samples by the number of tokens in them. HF tokenizers are supported. 24

New features
- OP Fusion 14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
- Cache management 19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
- Distributed data processing with Ray is supported now. 21
- Config sys optimization:
- Only keep `text_keys` and remove previous misleading arg `text_key(s)_to_process/load`. 13
- A new argument `export_in_parallel` is added to control whether export the result datasets in parallel. 17
- Display the config table after config parsing is ready. 17

Others
- Replace original string constants with constant enums. 13
- Expand the checkpoint protection range to cover the exporting process. 14
- Remove extra intermediate variables storage in `document_simhash_deduplicator` to save more memory. 14
- Docs updates. 15 16
- PyPi package is available. You can install data-juicer by `pip install py-data-juicer` now. 23
- Docker building is available now. The official docker image for Docker Hub is in progress. 23
- Deploy the unit tests for Data-Juicer. 29

0.1.0

Summarization - Table of Contents
=================

* Data-Juicer: A Data-Centric Text Processing System for Large Language Models
* Table of Contents
* Features
* Prerequisites
* Installation
* Quick Start
* Data Processing
* Data Analysis
* Data Visualization
* Build Up Config Files
* Preprocess raw data (Optional)
* Documentation | 文档
* Data Recipes
* Demos
* License
* Contributing
* References

Features

- **Broad Range of Operators**: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.

- **Specialized Toolkits**: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.

- **Systematic & Reusable**: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.

- **Data-in-the-loop**: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.

- **Comprehensive Processing Recipes**: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.

- **User-Friendly Experience**: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.

- **Enhanced Efficiency**: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.

Page 2 of 2

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.