I am excited to announce the first release of Synda, a Python package designed for creating synthetic data generation pipelines. Synda is built with a focus on speed and simplicity while maintaining flexibility through configuration.
⚠️ **Note**: This is an early development release and is not recommended for production use.
Key Features
Pipeline Architecture
- Configurable pipeline system using YAML configuration files
- Three-stage architecture: Input → Pipeline → Output
- Support for CSV input/output with customizable parameters
Core Components
Data Input
- CSV file support with configurable parameters
- Target column specification
- Custom separator support
Pipeline Steps
1. **Split Operation**
- Chunk-based data splitting
- Configurable chunk sizes
2. **Generation**
- LLM-based content generation
- Template system for generation instructions
- OpenAI integration (supports models like gpt-4o-mini)
3. **Ablation**
- LLM-based binary judgment system
- Multiple criteria support
- Consensus-based filtering
- Quality control through customizable criteria
Provider Management
- CLI command for adding model providers
- Secure API key management
- Initial support for OpenAI integration