> π **LLM Dataset Extractor from GitHub Repos** | AI & NLP-ready text pipelines
π Gittxt: Get text from Git repositories in AI-ready formats.
[](pyproject.toml)
[](https://pypi.org/project/gittxt/)
[](https://github.com/sandy-sp/gittxt/releases)
[](https://docs.pytest.org/en/stable/)
[](https://pypi.org/project/gittxt/)


[](https://github.com/sandy-sp/gittxt/actions)
[](https://github.com/sandy-sp/gittxt)
[](https://github.com/charliermarsh/ruff)
[](LICENSE)
---
β¨ What is Gittxt?
**Gittxt** is a developer-focused CLI tool that extracts AI-ready text from **Git repositories**. Whether you're preparing datasets for **AI models**, **NLP pipelines**, or **LLM fine-tuning**, Gittxt automates the tedious task of repository scanning and text conversion.
Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:
- Preparing **training data for LLMs** (e.g., ChatGPT, Claude, Mistral)
- **Documentation extraction** for knowledge bases
- **Code summarization** pipelines
- **Repository analysis** for machine learning workflows
---
π Features
- β
**Dynamic File-Type Filtering** (`--file-types=code,docs,images,csv,media,all`)
- β
**Automatic Tree Generation** with clean filtering (excludes `.git/`, `__pycache__`, etc.)
- β
**Multiple Output Formats**: TXT, JSON, Markdown
- β
**Optional ZIP Packaging** for non-text assets
- β
**CLI-friendly Progress Bars**
- β
**Built-in Summary Reports** (`--summary`)
- β
**Interactive & CI-ready Modes** (`--non-interactive`)
---
ποΈ Installation
π¦ Using Poetry
bash
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install
π Using pip (stable)
bash
pip install gittxt
---
βοΈ Quickstart Example
bash
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --file-types code,docs --summary
π This will:
- Scan a GitHub repository
- Extract code & docs files
- Output `.txt` + `.json` summaries
- Show a summary report
---
π₯οΈ CLI Usage
bash
gittxt scan [REPOS]... [OPTIONS]
Options:
--include TEXT Include patterns (e.g., *.py)
--exclude TEXT Exclude patterns (e.g., tests/, node_modules)
--size-limit INTEGER Max file size in bytes
--branch TEXT Specify branch (for GitHub URLs)
--file-types TEXT code, docs, images, csv, media, all
--output-format TEXT txt, json, md, or comma-separated list
--output-dir PATH Custom output directory
--summary Show post-scan summary
--non-interactive Skip prompts for CI/CD workflows
--progress Enable scan progress bars
--debug Enable debug logs
--help Show this message and exit
---
π Output Structure
<output_dir>/
βββ text/
β βββ repo-name.txt
βββ json/
β βββ repo-name.json
βββ md/
β βββ repo-name.md
βββ zips/
βββ repo-name_bundle.zip Optional ZIP for assets (images, csv, etc.)
---
π How It Works
1. π Clone GitHub/local repo (supports branch/subdir URLs)
2. π³ Dynamically generate directory tree (excluding `.git`, `__pycache__`, etc.)
3. ποΈ Filter files based on type (code, docs, csv, media)
4. π Generate formatted outputs (TXT, JSON, MD)
5. π¦ Package assets (optional ZIP for non-text)
6. π§Ή Cleanup temporary files (cache-free design)
---
π Example Summary Output
π Summary Report:
- Total files processed: 45
- Output formats: txt, json
- File type breakdown: {'code': 31, 'docs': 14}
---
π Security Policy
Please report security issues to: **sandeep.paidipatigmail.com**
[View Security Policy](docs/SECURITY.md)
---
π€ Contributing
We welcome community contributions!
- [Contributing Guidelines](docs/CONTRIBUTING.md)
- [Code of Conduct](docs/CODE_OF_CONDUCT.md)
- [Open an Issue](https://github.com/sandy-sp/gittxt/issues/new/choose)
---
π£οΈ Roadmap
- FastAPI-powered web UI
- AI-powered summaries (GPT/OpenAI integration)
- Support YAML/CSV as additional output formats
- Async file scanning (speed boost)
---
π License
MIT License Β© [Sandeep Paidipati](https://github.com/sandy-sp)
---
Gittxt β **βGittxt: Get text from Git repositories in AI-ready formats.β**
---