Summarization - Table of Contents
=================
* Data-Juicer: A Data-Centric Text Processing System for Large Language Models
* Table of Contents
* Features
* Prerequisites
* Installation
* Quick Start
* Data Processing
* Data Analysis
* Data Visualization
* Build Up Config Files
* Preprocess raw data (Optional)
* Documentation | ć楣
* Data Recipes
* Demos
* License
* Contributing
* References
Features
- **Broad Range of Operators**: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.
- **Specialized Toolkits**: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.
- **Systematic & Reusable**: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.
- **Data-in-the-loop**: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.
- **Comprehensive Processing Recipes**: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.
- **User-Friendly Experience**: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.
- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.
- **Enhanced Efficiency**: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.