Skip to content

Quality benchmark

Official v6.10.10 benchmark snapshot

Benchmark ID: sigmap-v6.10-main  ·  Date: 2026-05-22 (with R language)

MetricValue
Hit@580% vs 13.6% baseline
Retrieval lift5.9×
Prompt reduction41.4% (2.84 → 1.67)
Task success proxy53.3%
Overall token reduction96.5%
GPT-4o overflow (without → with)16/21 → 0/21

Token reduction is the mechanism. This benchmark shows the operational consequence:

  • does the repo fit inside model limits?
  • how much code would be hidden without SigMap?
  • what does that mean for API cost?

Latest saved run: 2026-05-22 (v6.10.10)

Headline numbers

MetricWithout SigMapWith SigMap
GPT-4o overflow repos16 / 210 / 21
Hidden files5,200+0
Grounded symbols surfaced016,500+
GPT-4o monthly input savings$10,500+

1. Context window fit

Raw repository content overflows GPT-4o's 128K window in 16 of 21 benchmark repos. It overflows Claude's 200K window in many of 21 repos.

That means a tool has to omit or truncate content before the model answers. SigMap avoids this by staying inside the budgeted context envelope.

Repo classWithout SigMapWith SigMap
GPT-4o fits5 / 2121 / 21
Claude 200K fits9 / 2121 / 21
Gemini 1M fits14 / 2121 / 21

2. Hidden-file risk

Across the benchmark repos, 5,200+ files would be hidden from the model in the raw-flow scenario.

This is the clearest explanation for why "just send the repo" is unreliable:

  • some files never reach the model
  • which files get dropped depends on the tool
  • the omission is easy to miss until the answer is already wrong

SigMap changes that by surfacing compact signatures for the project structure ahead of time.

3. Grounded symbols

The latest saved run surfaced 16,500+ grounded symbols across the benchmark repos. That is the structural map the model can actually reason over.

Without SigMap, the same benchmark set leaves symbols effectively dark or unreachable to the model.

4. Cost impact

At 10 calls per day across the benchmark set:

ModelSaved / daySaved / month
GPT-4o$350+$10,500+
Claude Sonnet$400+$12,000+

This is why the benchmark story is not just "smaller output." It directly affects the latency and cost profile of daily AI-assisted work.

Reproduce

bash
node scripts/run-benchmark.mjs --save --skip-clone
node scripts/run-quality-benchmark.mjs --save
node scripts/run-benchmark-matrix.mjs --save --skip-clone

Open the HTML dashboard for the full saved snapshot:

bash
open benchmarks/reports/benchmark-report.html

MIT License