Skip to content

Task benchmark

Official v6.10.10 benchmark snapshot

Benchmark ID: sigmap-v6.10-main  ·  Date: 2026-05-22 (with R language)

MetricValue
Hit@580% vs 13.6% baseline
Graph-boosted hit@580%
Retrieval lift5.9×
Prompt reduction41.4% (2.84 → 1.67)
Task success proxy53.3%
Token reduction (21 repos)96.5%
GPT-4o overflow (without → with)16/21 → 0/21

Latest saved run: 2026-05-22 (v6.10.10) — Now includes R language support (ggplot2, dplyr, shiny)

This page answers the question people care about most:

does SigMap help the developer finish the task with fewer retries?

Headline result

MetricWithout SigMapWith SigMap
Task success proxy10%53.3%
Prompts per task2.841.67
Prompt reduction41.4%
Retrieval hit@513.6%80%
Token reduction96.5%

Why the task benchmark exists

Retrieval is a prerequisite, but not the whole story. Developers feel the difference as:

  • fewer prompt retries
  • fewer "can you share more files?" loops
  • fewer answers grounded in the wrong module

The task benchmark models that outcome from the ranked file quality tiers:

  • rank 1 hit → likely one prompt
  • rank 2–5 hit → likely follow-up prompt
  • miss → likely multiple retries

Current saved score card

TierMeaningTasksShare
CorrectRight file was ranked first4752.2%
PartialRight file was present but not first2426.7%
WrongRight file never surfaced in top 51921.1%

Prompt model summary

MetricValue
Average prompts without SigMap2.84
Average prompts with SigMap1.66
Reduction40.6%
Average hit@5 lift5.8x across repo baselines

What changed in the v5 story

The earlier SigMap story was mostly "smaller context." The v5 story is more useful:

  • use ask to build the focused context
  • use validate to make sure coverage is healthy
  • use judge to check whether the answer was actually grounded
  • use learning when the same files repeatedly help or hurt

That makes the benchmark more than a marketing claim. It maps onto the actual daily workflow.

Benchmark snapshot by repo

RepoPrompt reductionCorrect / Partial / Wrong
flask64.8%5 / 0 / 0
gin43.7%3 / 1 / 1
rails47.2%2 / 1 / 2
rust-analyzer64.8%4 / 1 / 0
serilog26.1%0 / 2 / 3
laravel64.7%2 / 3 / 0
vapor17.7%1 / 1 / 3
fastapi48.9%4 / 0 / 1

These rows show why the task benchmark matters. Some repos have great retrieval lift but still need workflow help around validation and judge-based trust.

Reproduce

bash
node scripts/run-task-benchmark.mjs --save
node scripts/run-task-benchmark.mjs --json

For the full multi-benchmark dashboard:

bash
node scripts/run-benchmark-matrix.mjs --save --skip-clone
open benchmarks/reports/benchmark-report.html

MIT License