How I built SigMap
Most AI coding tools fail for one reason:
They read the wrong files.
I built SigMap to fix that — without embeddings, vector databases, or external services.
What SigMap does
Before the story, the outcome — so you know what you're reading toward:
| Number | |
|---|---|
| File retrieval accuracy | 78.9% hit@5 (vs 13.6% baseline — 5.8× lift) |
| Token reduction | 40–98% across 18 real open-source repos |
| Task success rate | 52.2% (vs 10% without context) |
| Prompts per task | 1.69 (down from 2.84) |
| Dependencies | Zero |
npx sigmap # generates context for your whole codebase
sigmap ask "where is auth handled?" # ranked files, 450 tokens, right answerNo API key. No infrastructure. Works offline. Under 10 seconds.
The problem I was hitting every day
Working on real projects in Amsterdam, I'd open a chat with Claude or GPT-4, paste in some code, ask a question, and hit one of two outcomes:
- Token limit mid-response. Start over.
- Confident answer referencing functions that didn't exist. Start over.
Every session followed the same arc:
1. Open chat
2. Paste what feels like enough context
3. Get a wrong answer (or truncated)
4. Paste more context
5. Hit token limit
6. Start new session
7. RepeatThis was the LLM context problem — and it was costing real time and money.
The numbers behind the frustration
The average codebase I worked with had ~972K raw source tokens. A single "paste everything" approach costs $4.86 per GPT-4o query. At 50 queries/day: $7,288/month — before retries.
The models weren't the problem. The input was.
The aha moment
I stopped asking "how do I get the LLM to answer better?" and started asking:
What is the minimum information the LLM needs to give the right answer?
That reframe changed everything.
Example — codebase understanding query:
Query: "Where is authentication handled?"
Without SigMap:
→ AI scans random files
→ Misses middleware, entry points, token utilities
→ Hallucinates functions, gives generic answer
→ Requires 2–3 follow-up prompts
With SigMap:
→ Ranks auth-related files by signature match
→ Surfaces auth/service.ts, auth/token.ts, middleware/auth.ts
→ AI answers using actual implementation, first try
→ 1.69 prompts on average vs 2.84This is the difference between guessing and grounding.
The insight: LLMs don't need your code. They need your structure.
A function signature tells an LLM almost everything it needs to navigate a codebase:
// This 1 line replaces 80 lines of implementation body
export function loginUser(email: string, password: string): Promise<Token>Signatures only, right files, right answer:
Full repo (972K tokens) → hallucinated functions, $4.86/query
Top 5 file signatures (450 tokens) → grounded answer, $0.002/queryBuilding the first version
The first version was a 200-line Node.js script — a regex walking a TypeScript project, pulling out function and class declarations:
// v0.1 — the whole thing, roughly
for (const line of src.split('\n')) {
if (line.match(/^export\s+(function|class|interface)/)) {
sigs.push(line.trim());
}
}It worked. Barely. But I ran it on a project I was debugging, pasted the output into Claude, and got the right answer in one shot. No retries. No overflow.
That was the proof. The rest was engineering.
Why not embeddings?
The first thing everyone asks. I tried it.
| Embeddings | SigMap | |
|---|---|---|
| Infrastructure required | Vector DB | None |
| API key needed | Yes | No |
| Works offline | No | Yes |
| Deterministic results | No | Yes |
| Debuggable ranking | No | Yes |
| Drift over time | Yes (reindex) | No |
Embeddings gave me semantic search. SigMap gives me deterministic AI code retrieval — same input, same output, every time. You can read the score. You can explain why a file ranked where it did. When something goes wrong, you fix it in minutes instead of retraining.
The less obvious answer was TF-IDF — a decades-old information retrieval technique that scores query terms against document frequency. On function and class identifiers (not prose), it turns out to be remarkably effective:
score(file, query) = Σ tf(term, file) × idf(term, all_files)When you search for "authentication", the file with authenticateUser, validateToken, hashPassword scores massively higher than formatDate or parseQuery. No embeddings. No model. Pure math.
Why determinism matters for LLM context
Non-deterministic retrieval fails silently — you get different files on different runs, can't reproduce a bug, can't explain why the AI answered wrong. Deterministic retrieval fails loudly: you see exactly which file scored, and why, and you can fix it.
Refining with the LLMs themselves
Here's where it gets interesting.
I used Claude to review the extractors. Not vaguely — I'd paste the extractor code, paste a file that wasn't parsed correctly, and ask: "What edge cases does this miss?"
The feedback was immediate:
- Generic types
function foo<T extends Bar>()— not captured - Arrow function exports
export const handler = (req, res) =>— invisible - Stacked modifiers
public static async— class methods missed - Python decorators — broke indentation-based detection
Each session: list of fixes. Implement. Re-run on real codebases. Paste diff back. Ask again.
This loop — build → test → LLM review → fix → repeat — compressed months of edge cases into days.
The meta-pattern
I was using an LLM to improve a tool designed to make LLMs work better. Better extractors → better context → better LLM reviews → better extractors. A genuine compounding loop.
Proof block
Benchmark: sigmap-v6.6-main
Date: 2026-05-03
Hit@5: 80.0% (baseline 13.6% — 5.9× lift)
Prompt reduction: 41.0%
Task success: 52.2% (baseline 10%)
Prompts per task: 1.68 (baseline 2.84)
Token reduction: 40–98% (avg 96.8% across 18 repos)Measured on 90 coding tasks across 18 real public repos. No LLM API. Fully reproducible.
Benchmark resources:
- Full methodology →
- Benchmark suite (GitHub) — scripts and tasks
- Archived dataset (Zenodo) — for independent verification
The feedback loop that runs everything
After release, I built sigmap judge — a groundedness scorer that checks whether an AI answer is traceable to the signatures SigMap surfaced:
sigmap judge --response answer.txt --context .context/query-context.md
# Score: 0.301 ✅ PASS — response references your actual codeI used this as a feedback signal for every subsequent improvement. The benchmark became a mirror:
v5.0 hit@5: 66.7% task success: 38% ← baseline
v5.8 hit@5: 76.7% task success: 47% ← intent detection added
v6.0 hit@5: 80.0% task success: 52% ← graph boosting addedThe full loop:
Build a feature
↓
Run benchmark suite on 18 real repos
↓
Ask LLM: "What edge cases does this miss?"
↓
Fix the edge cases
↓
Re-run benchmark
↓
Numbers improved? → Ship
Numbers flat? → Rethink
Numbers regressed? → Revert
↓
RepeatGood measurement is worth more than good ideas. You can generate ideas endlessly; you can only validate them as fast as your feedback loop closes.
One concrete win: graph boosting
TF-IDF finds the direct answer files. But it misses dependencies — the types, utilities, and helpers those files import. The LLM gets the right function but not the context around it.
Fix: after scoring, build a dependency graph from import statements. 1-hop neighbours of top-scoring files get a +0.4 bonus:
auth/service.ts score: 1.8 ← directly matches query
auth/token.ts score: 0.4 + 0.4 (graph) = 0.8 ← imported by service.ts
utils/crypto.ts score: 0.1 + 0.4 (graph) = 0.5 ← imported by token.tsTwo hours to build. Benchmark moved from 76.7% → 80.0% hit@5. That's 3.3 percentage points across 90 tasks — real queries where the right file now appears when it didn't before.
What I learned
1. The right abstraction beats raw power. 450 tokens of signatures outperforms 972K of raw source. Less, structured correctly, wins.
2. Determinism is underrated. Every AI-adjacent tool reached for embeddings and semantic search. TF-IDF on identifiers is transparent, debuggable, and reproducible. When it fails, you see exactly why.
3. Feedback loops compound. The investment in sigmap judge and the benchmark paid back in every subsequent release. Measure right, improve fast.
4. Zero dependencies forces clarity. Every time I wanted a library, I had to solve it in plain Node.js. The extractor is a hand-written parser. The ranker is linear algebra. The cache is a JSON file. None of it is clever — it's all just direct.
5. The real metric is prompts per task. Not hit@5, not token reduction — how many prompts until the user got what they needed. That's the number that matters. Everything else is a proxy.
Where it is now
| Feature | What it does |
|---|---|
| Signature extraction | 29 languages, zero dependencies, milliseconds |
| Deterministic retrieval | TF-IDF + graph boost, 78.9% hit@5 |
| Groundedness scoring | Catch hallucinations before they reach production |
| Learned weights | Files that helped rank higher next time |
| MCP server | 9 on-demand tools for Claude Code, Cursor, Windsurf |
| Incremental cache | Re-runs skip unchanged files |
Zero npm dependencies. Works offline. Runs on npx.
Why I open-sourced it
The LLM context problem — AI needs codebase understanding, but sending everything is broken — is a problem every developer hitting AI tools faces every day. The solution shouldn't require a subscription, a vector DB, or an account.
If SigMap saves you one hour a week, it was worth building.
Try it on your repo
npx sigmap
sigmap ask "where is auth handled?"
sigmap judge --response answer.txt --context .context/query-context.mdNo install. No API key. No config. Run it on your repo and see which files it finds.
The difference between guessing and grounding is about 10 seconds.
Built in Amsterdam. Made to work everywhere.