Skip to content

Quality benchmark

Token reduction is the mechanism. This page measures what it means in practice — context window fit, signature coverage, and API cost.

No LLM API key was used. All metrics are computed from:

  • raw token counts measured by node gen-context.js --report --json on each repo
  • published model context-window sizes from official documentation
  • published API pricing from OpenAI and Anthropic pricing pages

Reproduce with:

bash
node scripts/run-quality-benchmark.mjs --save

1. Context window fit

A model can only process what fits in its context window. When raw source exceeds the limit, content must be truncated or selectively omitted — the LLM works with an incomplete view of the codebase.

The table below compares each repo's measured raw token count against published context limits. SigMap output always fits because the token budget is capped (default: 6,000 tokens).

RepoRaw tokensGPT-4o 128KClaude 200KGemini 1MSigMap
express15.5KFITS ✓FITS ✓FITS ✓FITS ✓
flask84.8KFITS ✓FITS ✓FITS ✓FITS ✓
gin172.8KEXCEEDS +35%FITS ✓FITS ✓FITS ✓
spring-petclinic77.0KFITS ✓FITS ✓FITS ✓FITS ✓
rails1.5MEXCEEDS ×12EXCEEDS ×7.5EXCEEDS +49%FITS ✓
axios31.7KFITS ✓FITS ✓FITS ✓FITS ✓
rust-analyzer3.5MEXCEEDS ×27EXCEEDS ×17EXCEEDS ×3.5FITS ✓
abseil-cpp2.3MEXCEEDS ×18EXCEEDS ×11EXCEEDS ×2.3FITS ✓
serilog113.7KFITS ✓FITS ✓FITS ✓FITS ✓
riverpod682.7KEXCEEDS ×5.3EXCEEDS ×3.4FITS ✓FITS ✓
okhttp31.3KFITS ✓FITS ✓FITS ✓FITS ✓
laravel1.7MEXCEEDS ×13EXCEEDS ×8.5EXCEEDS +68%FITS ✓
akka790.5KEXCEEDS ×6.2EXCEEDS ×4.0FITS ✓FITS ✓
vapor171.2KEXCEEDS +34%FITS ✓FITS ✓FITS ✓
vue-core404.2KEXCEEDS ×3.2EXCEEDS ×2.0FITS ✓FITS ✓
svelte438.2KEXCEEDS ×3.4EXCEEDS ×2.2FITS ✓FITS ✓

10/16 repos exceed GPT-4o's 128K limit. 9/16 exceed Claude's 200K limit. With SigMap: 0/16 exceed any limit.

What "EXCEEDS" means technically

When raw content is larger than the context window, it cannot be sent as-is. Tooling (IDEs, agents, API clients) must decide what to truncate or omit before the request is made. The LLM itself never sees the overflowing content. What gets omitted depends on the tool — there is no universal behaviour.


2. Signature coverage

SigMap extracts function and class signatures from source files and writes them into a compact context file. This table measures two things that are directly countable:

  • Signatures in context (SigMap) — lines in the SigMap output file that are function/class/interface declarations, counted exactly
  • Source files not in context (no SigMap) — files that would be truncated when raw content exceeds the GPT-4o 128K limit, assuming files are included in full and sequentially

What "not in context" means: if a repo's raw source exceeds the GPT-4o 128K window and you attempt to include all files, files beyond the limit are cut. SigMap avoids this entirely because its output is always within the token budget.

RepoSignatures in SigMap outputSource files not in context (raw, GPT-4o limit)
express110 of 6
flask2090 of 19
gin45028 of 107
spring-petclinic130 of 13
rails6481,079 of 1,179
axios530 of 25
rust-analyzer395612 of 635
abseil-cpp350662 of 700
serilog3010 of 99
riverpod672363 of 446
okhttp1150 of 18
laravel5781,417 of 1,533
akka508177 of 211
vapor36434 of 131
vue-core205159 of 232
svelte195262 of 370

Total: 5,067 signatures extractable into context with SigMap. 4,793 source files not in context without it (raw, GPT-4o limit).

Methodology notes

  • Signature count is exact: output file lines matching function/class/interface declaration patterns
  • "Files not in context" assumes worst-case: all files concatenated sequentially, truncated at 128K tokens. Real tools may use different file selection strategies.
  • SigMap output size for these repos ranges from 201 to 8,800 tokens — all within the default 6,000-token budget (some repos use a higher configured budget).

3. API input-token cost

This is the most directly computable metric: fewer tokens sent = lower API bill. Numbers use measured rawTokens and finalTokens from the token reduction benchmark, multiplied by published per-token prices. No modelling involved.

Pricing source: OpenAI pricing page · Anthropic pricing page GPT-4o: $2.50/1M input (regular) · $1.25/1M (cached). Claude Sonnet: $3.00/1M · $0.30/1M cached. Baseline assumption: 10 API calls/day per repo. Adjust to your actual usage.

GPT-4o

RepoRaw cost/daySigMap cost/daySaved/daySaved/month
express$0.39$0.005$0.38$11.44
flask$2.12$0.08$2.04$61.08
gin$4.32$0.14$4.18$125.31
spring-petclinic$1.92$0.02$1.91$57.24
rails$37.36$0.18$37.18$1,115.36
axios$0.79$0.04$0.75$22.61
rust-analyzer$88.06$0.15$87.92$2,637.46
abseil-cpp$57.95$0.16$57.79$1,733.78
serilog$2.84$0.15$2.70$80.93
riverpod$17.07$0.16$16.91$507.16
okhttp$0.78$0.04$0.75$22.40
laravel$41.96$0.18$41.78$1,253.54
akka$19.76$0.18$19.59$587.60
vapor$4.28$0.16$4.12$123.58
vue-core$10.10$0.22$9.88$296.50
svelte$10.95$0.20$10.75$322.62
TOTAL$298.54/day$8,958/month

Claude Sonnet: $358/day · $10,750/month saved at regular pricing. At cached pricing: $35.83/day saved.


Summary

MetricSourceWithout SigMapWith SigMap
Repos exceeding GPT-4o 128KMeasured10/160/16
Repos exceeding Claude 200KMeasured9/160/16
Source files not in context (GPT-4o limit)Measured4,7930
Signatures extractable into contextMeasured (SigMap output)05,067
GPT-4o input cost (10 calls/day, all repos)Computed from measured tokens × pricing~$299/day~$0.43/day

Reproduce these numbers

bash
# Run token reduction benchmark first (clones repos if needed)
node scripts/run-benchmark.mjs --save

# Then run quality analysis (no LLM API needed)
node scripts/run-quality-benchmark.mjs --save

# Results written to:
#   benchmarks/reports/token-reduction.json
#   benchmarks/reports/quality.json

Made in Amsterdam, Netherlands 🇳🇱

MIT License