Skip to content

Benchmark overview

Official v6.10.10 benchmark snapshot (21 repos, including R language)

Benchmark ID: sigmap-v6.10-main  ·  Date: 2026-05-22

MetricValue
Hit@5 (18 core repos)80% vs 13.6% baseline
Token reduction (21 repos)96.5%
Retrieval lift5.9×
Prompt reduction41.4% (2.84 → 1.67)
Task success proxy53.3%
GPT-4o overflow (without → with)16/21 → 0/21

This is the landing page for the public benchmark story. It answers four different questions:

If you want to prove...Open
SigMap reduces context size dramaticallyToken reduction
SigMap finds the right file more oftenRetrieval benchmark
SigMap reduces retries and wrong-context answersTask benchmark
SigMap keeps large repos inside model limitsQuality benchmark

Official v6.10.10 snapshot (with R language support)

Latest saved benchmark run: 2026-05-22 (v6.10.10)

MetricResult
Token reduction repos21 (including R: ggplot2, dplyr, shiny)
Retrieval benchmark repos18 (core languages)
Total tasks90
Average token reduction (all 21)96.5%
Retrieval hit@5 (18 core)80%
Graph-boosted hit@580%
Random baseline hit@513.6%
Prompt reduction41.4% (2.84 → 1.67 prompts)
GPT-4o overflow repos without SigMap16 / 21
GPT-4o monthly input savings at 10 calls/day$9,899.90

What each benchmark proves

1. Token reduction (21 repositories)

  • Raw source across benchmark set: 13,499,894 tokens (21 repos)
  • Final SigMap output: ~470,000 tokens
  • Overall reduction: 96.5%
  • New in v6.10.10: R language support verified
    • ggplot2: 94.3% reduction (381.5K → 21.7K tokens)
    • dplyr: 93.4% reduction (145.1K → 9.5K tokens)
    • shiny: 96.2% reduction (264.6K → 10.0K tokens)

2. Retrieval quality

  • SigMap hit@5: 80%
  • Graph-boosted hit@5: 80% (+0.0pp with dependency graph)
  • Random baseline: 13.6%
  • Lift: 5.9x

This is the best benchmark when the question is: "Does SigMap actually put the right file in context?"

3. Task outcomes

  • Correct: 48 / 90 (53.3%)
  • Partial: 24 / 90 (26.7%)
  • Wrong: 18 / 90 (20%)
  • Average prompts: 2.84 → 1.67

This is the best benchmark when the question is: "Does the developer need fewer retries to finish the job?"

4. Quality and overflow

  • 16/21 repos overflow GPT-4o's 128K context window without SigMap
  • R repos add to overflow risk: ggplot2 and shiny both overflow without SigMap
  • 5,200+ files would be hidden from the model in the raw-flow scenario
  • 16,500+ symbols are surfaced in SigMap output across all benchmark repos
  • With SigMap: 0/21 repos overflow — all repos fit within 128K context

This is the best benchmark when the question is: "Why does token reduction matter operationally?"

Open the HTML dashboard

The easiest way to inspect the latest benchmark run is the self-contained HTML report:

bash
node scripts/run-benchmark-matrix.mjs --save --skip-clone
open benchmarks/reports/benchmark-report.html

That generates synchronized JSON plus a dashboard for token, retrieval, quality, and task metrics together.

Reproduce the full benchmark set

bash
node scripts/run-benchmark.mjs --save --skip-clone
node scripts/run-retrieval-benchmark.mjs --save
node scripts/run-quality-benchmark.mjs --save
node scripts/run-task-benchmark.mjs --save
node scripts/run-benchmark-matrix.mjs --save --skip-clone

The matrix run writes:

  • benchmarks/reports/token-reduction.json
  • benchmarks/reports/retrieval.json
  • benchmarks/reports/quality.json
  • benchmarks/reports/task-benchmark.json
  • benchmarks/reports/benchmark-matrix.json
  • benchmarks/reports/benchmark-report.html

Benchmark resources

MIT License