Skip to content

Retrieval benchmark

Result: SigMap finds the right file in the top 5 far more often than chance — 84.4% hit@5 vs 13.6% random across 90 tasks.

When you ask an LLM a coding question, the answer quality depends entirely on whether the right files are in context. This benchmark measures exactly that — without running any LLM.

Method: For each of 18 real repos (5 tasks each, 90 total), we ask: does SigMap's output include the correct file in the top-5 ranked results? We compare against the expected random probability of finding that file if files were selected at random.

No LLM API was used. All scores are retrieval-rank arithmetic.

Reproduce:

bash
node scripts/run-retrieval-benchmark.mjs --save --skip-run
# Or with fresh gen-context runs (takes a few minutes):
node scripts/run-retrieval-benchmark.mjs --save

Results

RepoFilesSigsRandom hit@5SigMap hit@5LiftCorrectPartialWrong
express6683%80%2/52/51/5
flask192026%100%3.8×5/50/50/5
gin107765%100%21×3/52/50/5
spring-petclinic132939%60%1.6×3/50/52/5
rails1,1791100.4%80%189×2/52/51/5
axios252920%60%2/51/52/5
rust-analyzer635500.8%100%127×4/51/50/5
abseil-cpp700380.7%100%140×4/51/50/5
serilog991005%80%15.8×2/52/51/5
riverpod446431.1%100%89×4/51/50/5
okhttp181828%100%3.6×5/50/50/5
laravel1,5331130.3%100%307×2/53/50/5
akka211642.4%100%42×3/52/50/5
vapor1311343.8%60%15.7×1/52/52/5
vue-core2321212.2%100%46×2/53/50/5
svelte370631.4%60%44×1/52/52/5
fastify312816%60%3.7×3/50/52/5
fastapi483210%80%7.7×3/51/51/5
Average13.6%84.4%6.2×51/9025/9014/90

10 of 18 repos hit 100% (all 5 tasks found in top-5). Only 14/90 tasks produced a wrong result.


Source file coverage per project

The default maxTokens: 6000 budget fits only a fraction of source files for large repos. This table shows how many files SigMap includes vs drops, measured with scoped srcDirs (source code only, no test/examples).

RepoLanguagesrcDirsTotal filesIncludedDroppedCoverageGradeHit@5
expressJavaScriptlib/660100%A80%
okhttpKotlin3 dirs1918195%A100%
fastifyJavaScriptlib/3128390%A60%
serilogC#src/Serilog/1151001587%B80%
flaskPythonsrc/flask/2620677%B100%
ginGo.130765458%C100%
vaporSwiftSources/25013411654%C60%
fastapiPythonfastapi/53322160%C80%
riverpodDartpackages/25411414045%D100%
axiosTypeScriptlib/67293843%D60%
vue-coreVuepackages/30710220533%D100%
spring-petclinicJavasrc/84255930%D60%
svelteSveltepackages/svelte/src, src/4106334715%D60%
akkaScala4 dirs3986433416%D100%
railsRuby7 dirs1,4421131,3298%D80%
laravelPHPsrc/Illuminate/1,8421131,7296%D100%
rust-analyzerRustcrates/2,007501,9572%D100%
abseil-cppC++absl/1,542381,5042%D100%

Grade key: A ≥95% · B ≥80% · C ≥60% · D <60% (default 6K token budget)

Key finding: low coverage ≠ low retrieval quality

The most striking result is that repos with the lowest file coverage still achieve the highest hit@5:

  • rust-analyzer — 2% file coverage (50 of 2,007 files) → 100% hit@5
  • abseil-cpp — 2% file coverage (38 of 1,542 files) → 100% hit@5
  • laravel — 6% file coverage (113 of 1,842 files) → 100% hit@5
  • rails — 8% file coverage (113 of 1,442 files) → 80% hit@5
  • akka — 16% file coverage (64 of 398 files) → 100% hit@5

This works because SigMap's token budget drop order prioritises recently-changed and high-signal files first. The files that answer real coding tasks tend to be the hot, actively-developed files — exactly the ones the budget keeps.

The pattern breaks when task files are structurally peripheral (config, rarely-touched utilities). That is what causes the 60% hit@5 on svelte (15% coverage, 347 files dropped) and spring-petclinic (30% coverage, some task files dropped).

How to increase coverage for large repos

1. Raise maxTokens in your config

json
{ "maxTokens": 12000 }

Most frontier models (Claude, GPT-4o, Gemini) handle 12K–24K context easily. Doubling the budget roughly doubles file coverage on large repos.

2. Use per-module strategy for monorepos

For rails, laravel, vue-core, akka — each is a monorepo with distinct sub-packages:

json
{ "strategy": "per-module" }

This writes one context-<module>.md per srcDir instead of one combined file, so each module gets its full budget.

3. Set explicit srcDirs

Without a config, SigMap auto-detects source dirs. Adding srcDirs prevents test files and generated code from consuming the budget:

json
{ "srcDirs": ["src", "lib"] }

4. Use hot-cold strategy

json
{ "strategy": "hot-cold" }

Recently-changed files go into the hot context (always injected). Older files go into a cold context served on demand via MCP. This is the highest-coverage option for very large repos.


Before vs after (quality tiers)

Without SigMap, the context provided to the LLM is either truncated at the token limit or assembled from an unordered file list — equivalent to random selection for large repos.

Context quality — all 90 tasks across 18 repos

  Without SigMap (random selection):
  Correct  ██░░░░░░░░░░░░░░░░░░░░░░  14%  —  12/90 tasks
  Partial  ████░░░░░░░░░░░░░░░░░░░░  17%  —  15/90 tasks
  Wrong    ████████████████████████  70%  —  63/90 tasks

  With SigMap:
  Correct  █████████████████████░░░  57%  —  51/90 tasks
  Partial  ████████████░░░░░░░░░░░░  28%  —  25/90 tasks
  Wrong    ████░░░░░░░░░░░░░░░░░░░░  16%  —  14/90 tasks

Wrong context drops from 70% → 13%. Correct context jumps from 14% → 59%.

For large repos (rails 1,179 files; laravel 1,533; rust-analyzer 635; abseil-cpp 700):

  • Without SigMap: random hit@5 is 0.3–0.8% — effectively zero
  • With SigMap: 80–100% hit@5 across all four

What the tiers mean

TierDefinitionWhat it means for the LLM
CorrectTarget file is rank-1 resultLLM receives the most relevant context immediately
PartialTarget file in ranks 2–5Context present, but mixed with less-relevant files
WrongTarget file not in top-5LLM operates without the key file in context

Methodology

  • Tasks: 5 per repo × 18 repos = 90 tasks. Each task is a natural-language query with one or more expected_files (real files from the cloned repo).
  • Random baseline: min(1, 5/fileCount) — the probability that a uniformly random 5-file selection contains the target file.
  • SigMap hit@5: does the SigMap retrieval ranker return the expected file within its top-5 ranked results?
  • No LLM API used. Scores are purely rank-position arithmetic against ground-truth file labels.
  • Task files are at benchmarks/tasks/<repo>.jsonl — readable and verifiable.

Summary

MetricWithout SigMapWith SigMap
Average hit@513.6%84.4%
Lift6.2×
Wrong context (top-5 miss)70%16%
Correct context (rank-1 hit)14%57%
100% hit@5 repos0/1810/18

Reproduce

bash
# Uses existing generated output (fast, ~1s)
node scripts/run-retrieval-benchmark.mjs --skip-run

# Re-runs gen-context on all 18 repos first (~2 min)
node scripts/run-retrieval-benchmark.mjs

# Save results to benchmarks/reports/retrieval.json
node scripts/run-retrieval-benchmark.mjs --save --skip-run

# JSON output for scripting
node scripts/run-retrieval-benchmark.mjs --json --skip-run

Made in Amsterdam, Netherlands 🇳🇱

MIT License