RepoQA for Long-Context Code Understanding
Introduction
RepoQA is a benchmark that aims to exercise LLM's long-context code understanding ability.
* **Multi-Lingual**: RepoQA now supports repositories from 5 programming languages:
* Python
* C++
* TypeScript
* Rust
* Java
* **Application-driven**: RepoQA aims to evaluate LLMs on long-context tasks that can reflect real-life uses. Before RepoQA, long-context evaluators mainly focus on using synthetic tasks to examine the vulnerable parts of the LLM's long context, such as *"Needle in the Code"* by [CodeQwen](https://qwenlm.github.io/blog/codeqwen1.5/) and *"Needle in a Haystack"*.
* The first RepoQA task we propose is [🔍 Searching Needle Function](https://evalplus.github.io/repoqa.html#task-snf):
* 500 sub-tasks = 5 PLs x 10 repos x 10 needles
* Asks the model to search the corresponding function (we call it needle function) given a precise natural language description
![](https://evalplus.github.io/assets/RepoQA-CTX.svg)
RepoQA is easy to use
* Supports following backends
* OpenAI
* Anthropic
* vLLM
* HuggingFace transformers
* Google Generative AI API (Gemini)
* 🚀 Evaluation can be done in one command
* 🏆 A leaderboard: https://evalplus.github.io/repoqa.html
Quick examples
shell
pip install repoqa
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google
Resources
* PyPI: https://pypi.org/project/repoqa/0.1.0/
* Homepage: https://evalplus.github.io/repoqa.html
* Dataset release: https://github.com/evalplus/repoqa_release