Skip to content

Retrieval benchmark

Latest saved run: 2026-04-17 (v5.3.0)

Result: SigMap finds the right file in the top 5 far more often than chance — 80.0% hit@5 vs 13.6% random baseline across 90 tasks on 18 real repos.

Why this benchmark matters

When a coding assistant misses the key file, everything downstream gets worse:

  • more retries
  • more clarifying questions
  • more wrong-context answers

This benchmark isolates that first question: did the right file appear in context?

Headline numbers

MetricWithout SigMapWith SigMap
Average hit@513.6%80.0%
Lift5.9x
Correct (rank 1)~1%52.2%
Partial (ranks 2–5)~13%26.7%
Wrong (not in top 5)~86%21.1%

Quality tiers from the saved run

TierTasksShare
Correct47 / 9052.2%
Partial24 / 9026.7%
Wrong19 / 9021.1%

Per-repo results

RepoRandom hit@5SigMap hit@5LiftCorrect / Partial / Wrong
express83.3%80%1.0x2 / 2 / 1
flask26.3%100%3.8x5 / 0 / 0
gin4.7%80%17.0x3 / 1 / 1
spring-petclinic38.5%60%1.6x3 / 0 / 2
rails0.4%60%150.0x2 / 1 / 2
axios20.0%60%3.0x2 / 1 / 2
rust-analyzer0.8%100%125.0x4 / 1 / 0
abseil-cpp0.7%100%142.9x3 / 2 / 0
serilog5.1%40%7.8x0 / 2 / 3
riverpod1.1%100%90.9x4 / 1 / 0
okhttp27.8%100%3.6x5 / 0 / 0
laravel0.3%100%333.3x2 / 3 / 0
akka2.4%100%41.7x3 / 2 / 0
vapor3.8%40%10.5x1 / 1 / 3
vue-core2.2%100%45.5x1 / 4 / 0
svelte1.4%60%42.9x0 / 3 / 2
fastify16.1%60%3.7x3 / 0 / 2
fastapi10.4%80%7.7x4 / 0 / 1

What the benchmark does not measure

This benchmark does not score answer wording, correctness of prose, or stylistic quality. It measures a narrower prerequisite:

whether the right source file is present in the ranked context.

That is why it pairs well with judge and the task benchmark.

Reproduce

bash
node scripts/run-retrieval-benchmark.mjs --save
node scripts/run-retrieval-benchmark.mjs --json

For the full multi-benchmark dashboard:

bash
node scripts/run-benchmark-matrix.mjs --save --skip-clone

MIT License