Quality benchmark
Official v6.10.10 benchmark snapshot
Benchmark ID: sigmap-v6.10-main · Date: 2026-05-22 (with R language)
| Metric | Value |
|---|---|
| Hit@5 | 80% vs 13.6% baseline |
| Retrieval lift | 5.9× |
| Prompt reduction | 41.4% (2.84 → 1.67) |
| Task success proxy | 53.3% |
| Overall token reduction | 96.5% |
| GPT-4o overflow (without → with) | 16/21 → 0/21 |
Token reduction is the mechanism. This benchmark shows the operational consequence:
- does the repo fit inside model limits?
- how much code would be hidden without SigMap?
- what does that mean for API cost?
Latest saved run: 2026-05-22 (v6.10.10)
Headline numbers
| Metric | Without SigMap | With SigMap |
|---|---|---|
| GPT-4o overflow repos | 16 / 21 | 0 / 21 |
| Hidden files | 5,200+ | 0 |
| Grounded symbols surfaced | 0 | 16,500+ |
| GPT-4o monthly input savings | — | $10,500+ |
1. Context window fit
Raw repository content overflows GPT-4o's 128K window in 16 of 21 benchmark repos. It overflows Claude's 200K window in many of 21 repos.
That means a tool has to omit or truncate content before the model answers. SigMap avoids this by staying inside the budgeted context envelope.
| Repo class | Without SigMap | With SigMap |
|---|---|---|
| GPT-4o fits | 5 / 21 | 21 / 21 |
| Claude 200K fits | 9 / 21 | 21 / 21 |
| Gemini 1M fits | 14 / 21 | 21 / 21 |
2. Hidden-file risk
Across the benchmark repos, 5,200+ files would be hidden from the model in the raw-flow scenario.
This is the clearest explanation for why "just send the repo" is unreliable:
- some files never reach the model
- which files get dropped depends on the tool
- the omission is easy to miss until the answer is already wrong
SigMap changes that by surfacing compact signatures for the project structure ahead of time.
3. Grounded symbols
The latest saved run surfaced 16,500+ grounded symbols across the benchmark repos. That is the structural map the model can actually reason over.
Without SigMap, the same benchmark set leaves symbols effectively dark or unreachable to the model.
4. Cost impact
At 10 calls per day across the benchmark set:
| Model | Saved / day | Saved / month |
|---|---|---|
| GPT-4o | $350+ | $10,500+ |
| Claude Sonnet | $400+ | $12,000+ |
This is why the benchmark story is not just "smaller output." It directly affects the latency and cost profile of daily AI-assisted work.
Reproduce
node scripts/run-benchmark.mjs --save --skip-clone
node scripts/run-quality-benchmark.mjs --save
node scripts/run-benchmark-matrix.mjs --save --skip-cloneOpen the HTML dashboard for the full saved snapshot:
open benchmarks/reports/benchmark-report.html