Benchmark methodology
This page explains what we measure, how we measure it, and why we chose these metrics.
Overview
SigMap is evaluated on 90 real-world coding tasks across 18 open-source repositories spanning 8 programming languages. The benchmark answers: "Does SigMap help developers finish coding tasks with fewer retries?"
Test set: 90 tasks across 18 repos
The benchmark includes 5 tasks per repository, distributed across multiple languages and project types:
| Language | Repos | Example projects | Tasks |
|---|---|---|---|
| Python | 2 | Flask, FastAPI | 10 |
| JavaScript | 4 | Express, Axios, Fastify, Vue | 20 |
| Java | 3 | Spring, OkHttp, Akka | 15 |
| Go | 1 | Gin | 5 |
| Ruby | 1 | Rails | 5 |
| Rust | 1 | rust-analyzer | 5 |
| C++ | 1 | abseil-cpp | 5 |
| PHP | 1 | Laravel | 5 |
| Other | 4 | Serilog (C#), Riverpod (Dart), Vapor (Swift), Svelte (TS/JS) | 20 |
| Total | 18 | — | 90 |
Task selection criteria
Each task was designed to be:
- Representative — common developer questions in real projects
- Challenging — requires understanding module architecture, not just keyword search
- Answerable — solution files exist in the repository
- Language-diverse — covers different syntax, structure, and naming patterns
Example tasks:
- "Where is the auth middleware implemented?" (locate)
- "How do I configure rate limiting?" (explain)
- "Fix the memory leak in connection pooling" (debug)
- "Refactor the request handler to support streaming" (modify)
What we measure
1. Retrieval accuracy (Hit@5)
Question: Does SigMap find the right file in the top 5 results?
Why: Retrieval is a prerequisite. If the right file isn't in context, the AI can't answer correctly.
Metric: Hit@5 — the right file appears in the top 5 ranked results
Baseline: Random selection = ~13.6% (1 correct file out of ~90 files in typical repo)
SigMap score: 80.0% — 5.9× better than random
2. Task success proxy (correct rank)
Question: Was the right file ranked first?
Why: Rank 1 usually means one prompt. Rank 2-5 means follow-up. No hit = multiple retries.
Metric: Files ranked in positions:
- Correct (rank 1) — likely single prompt
- Partial (ranks 2-5) — likely follow-up
- Wrong (not in top 5) — likely multiple retries
SigMap breakdown:
- Correct: 52.2% of tasks
- Partial: 26.7% of tasks
- Wrong: 21.1% of tasks
3. Prompt reduction
Question: Do developers need fewer prompts with SigMap context?
Why: Real measure of usability. Fewer retries = faster answers.
Metric: Average prompts per task
- Without SigMap: 2.84 prompts/task (cold start, no context)
- With SigMap: 1.68 prompts/task
- Reduction: 41.0%
4. Token reduction
Question: How much context do we actually need?
Why: Token limits constrain what we can include. SigMap keeps answers grounded by sending less.
Metric: Token count of final context
- Full repo signatures: 12.8M tokens (before SigMap)
- SigMap output: 241K tokens (after ranking/filtering)
- Reduction: 96.8% average, 40–98% per repo
Impact: Without SigMap, 13 of 18 repos overflow GPT-4o's 128K context window. With SigMap, all 18 fit.
5. Answer usefulness (v6.9+)
Question: Do answers retrieved with SigMap actually help developers?
Why: Correct retrieval doesn't guarantee helpful answers. We measure whether context + retrieval enables correct problem-solving.
Metric: Usefulness tier
- Fully useful — context enabled correct answer
- Partially useful — context partially helped but needed clarification
- Not useful — context didn't help or was misleading
Reproducibility
All benchmarks are reproducible:
- Task set: 90 tasks committed in
benchmarks/tasks/ - Repositories: 18 open-source projects cloned from GitHub
- Scripts: Benchmark runners in
scripts/run-*.mjs - Raw data: Available in Zenodo archive
- Dashboard: Self-contained HTML report in
benchmarks/reports/benchmark-report.html
Run the benchmarks yourself:
node scripts/run-retrieval-benchmark.mjs
node scripts/run-quality-benchmark.mjs
node scripts/run-task-benchmark.mjs
node scripts/run-benchmark-matrix.mjs --savePer-repo variation
Metrics vary significantly by repository type:
| Dimension | Range | Meaning |
|---|---|---|
| Hit@5 by language | 60–100% | Python/Java typically higher, JS lower due to naming variance |
| Prompt reduction | 17–65% | Larger repos need more context reduction |
| Token reduction | 40–98% | Enterprise frameworks reduce more than utilities |
See per-repo breakdown for details.
Confidence and limitations
What this benchmark covers:
- File retrieval in well-structured open-source projects
- Common developer tasks across 8 languages
- Context quality under token constraints
What this benchmark does NOT cover:
- Enterprise proprietary codebases (different structure, naming)
- Real-time user study of answer quality
- Performance (latency, memory usage)
- Specialized languages (Lisp, Haskell, Niche domains)
Treat these metrics as a guide, not a guarantee. Your results may vary based on:
- Codebase size and complexity
- Project naming consistency
- Language-specific patterns
- Domain-specific terminology
Questions about methodology? Open an issue on GitHub.
Want to contribute tasks? See benchmark suite repository.