Task benchmark

Official v6.10.10 benchmark snapshot

Benchmark ID: sigmap-v6.10-main · Date: 2026-05-22 (with R language)

Metric	Value
Hit@5	80% vs 13.6% baseline
Graph-boosted hit@5	80%
Retrieval lift	5.9×
Prompt reduction	41.4% (2.84 → 1.67)
Task success proxy	53.3%
Token reduction (21 repos)	96.5%
GPT-4o overflow (without → with)	16/21 → 0/21

Latest saved run: 2026-05-22 (v6.10.10) — Now includes R language support (ggplot2, dplyr, shiny)

This page answers the question people care about most:

does SigMap help the developer finish the task with fewer retries?

Headline result

Metric	Without SigMap	With SigMap
Task success proxy	10%	53.3%
Prompts per task	2.84	1.67
Prompt reduction	—	41.4%
Retrieval hit@5	13.6%	80%
Token reduction	—	96.5%

Why the task benchmark exists

Retrieval is a prerequisite, but not the whole story. Developers feel the difference as:

fewer prompt retries
fewer "can you share more files?" loops
fewer answers grounded in the wrong module

The task benchmark models that outcome from the ranked file quality tiers:

rank 1 hit → likely one prompt
rank 2–5 hit → likely follow-up prompt
miss → likely multiple retries

Current saved score card

Tier	Meaning	Tasks	Share
Correct	Right file was ranked first	47	52.2%
Partial	Right file was present but not first	24	26.7%
Wrong	Right file never surfaced in top 5	19	21.1%

Prompt model summary

Metric	Value
Average prompts without SigMap	2.84
Average prompts with SigMap	1.66
Reduction	40.6%
Average hit@5 lift	5.8x across repo baselines

What changed in the v5 story

The earlier SigMap story was mostly "smaller context." The v5 story is more useful:

use ask to build the focused context
use validate to make sure coverage is healthy
use judge to check whether the answer was actually grounded
use learning when the same files repeatedly help or hurt

That makes the benchmark more than a marketing claim. It maps onto the actual daily workflow.

Benchmark snapshot by repo

Repo	Prompt reduction	Correct / Partial / Wrong
flask	64.8%	5 / 0 / 0
gin	43.7%	3 / 1 / 1
rails	47.2%	2 / 1 / 2
rust-analyzer	64.8%	4 / 1 / 0
serilog	26.1%	0 / 2 / 3
laravel	64.7%	2 / 3 / 0
vapor	17.7%	1 / 1 / 3
fastapi	48.9%	4 / 0 / 1

These rows show why the task benchmark matters. Some repos have great retrieval lift but still need workflow help around validation and judge-based trust.

Reproduce

bash

node scripts/run-task-benchmark.mjs --save
node scripts/run-task-benchmark.mjs --json

For the full multi-benchmark dashboard:

bash

node scripts/run-benchmark-matrix.mjs --save --skip-clone
open benchmarks/reports/benchmark-report.html

Task benchmark ​

Headline result ​

Why the task benchmark exists ​

Current saved score card ​

Prompt model summary ​

What changed in the v5 story ​

Benchmark snapshot by repo ​

Reproduce ​