Reliability layer

LLM skills are non-deterministic. The reliability layer (v0.2.0) handles that systematically at every probabilistic boundary, while leaving deterministic skills completely untouched. It lives in src/orchestrator.ts.

1. Confidence routing

A probabilistic skill returns a confidence (0.0–1.0). The orchestrator routes it:

Band	Range	Action
high	`≥ 0.85`	proceed
review	`0.65 – 0.85`	proceed, flag in the trace
low	`< 0.65`	retry

2. Auto-judge

After a probabilistic skill produces output, the orchestrator automatically runs the boundary judge on the skill's judge_blocks — there is no manual judge step in the pipeline. A judge score below the skill's confidence_threshold (default 0.80) is a failure.

3. Retry with negative context

When a probabilistic skill fails — low confidence, a failed assertion, or a judge rejection — the orchestrator re-invokes it with negative context: the previous summary plus the failure reason.

RetryContext = { attempt, previous_summary, failure_reason }

The retry budget is the skill's retries (default 2). If all attempts fail, the run halts and exposes the full diagnostic.

Putting it together

The per-skill loop:

run skill (with retry context if this is a retry)
  → apply STATE writes + checkpoint
  → [probabilistic] confidence routing
  → assertions (base-assert)
  → [probabilistic] auto-judge on judge_blocks
  → pass?  record success, continue
  → fail?  retries left → re-invoke with negative context
           exhausted    → halt and expose

Deterministic skills run only the first three lines — zero overhead.

Seeing it work

Each --inject mode drives one path:

bash

npm start -- --inject lowconf

↻ extract-highlights probabilistic selected 4 highlights              0ms  $0.0000
  └─ confidence 0.60 below 0.65
✓ extract-highlights probabilistic selected 3 highlights (recovered on attempt 2)  0ms  $0.0000
  └─ judge: 1.00
STATUS: SUCCESS   total: 1ms   cost: $0.0000   retries: 1

Mode	Trigger	Outcome
`lowconf`	confidence `0.60` < `0.65`	routing retries → recovers
`hallucination`	judge catches an ungrounded highlight	judge retries → recovers
`persistent`	every attempt stays ungrounded	retries exhausted → HALTS
`coverage`	deterministic assertion fails	HALTS immediately (no retry)

Golden anchors

Probabilistic skills declare golden_anchors — worked input/output examples of acceptable output. They are threaded into the judge prompt so the model has a concrete reference for what "grounded" looks like (the offline heuristic ignores them).

Measured

The layer is benchmarked by npm run bench, which runs the chain across a matrix of documents × inject-modes with the offline heuristic judge (deterministic, so the numbers are reproducible) and reads the metrics back from the real NDJSON traces:

Metric	Result
Judge catch rate (fabricated highlights caught)	100% (6/6)
Retry recovery rate	66.7% (6/9)
Avg attempts at the probabilistic boundary	2.0
Deterministic zero-overhead	true
Scenario success rate	69.2% (9/13)

Recovery and success rates intentionally include the designed-to-fail scenarios — the persistent mode (never recoverable) and the thin-document coverage halt — so they are lower bounds, not best-case figures. The headline results are that the judge caught every fabricated highlight and that deterministic skills were never retried.

bash

npm run bench           # print the table
npm run bench -- --save # also write metrics into version.json

See it in a second skill

summarize is a probabilistic extractive summarizer that exercises this whole layer: it returns a confidence and verbatim judge_blocks, and on a retry it takes fewer, stronger sentences so a low-confidence first attempt recovers. Watch it live with npx tsx examples/summarize.ts.

Reliability layer ​

1. Confidence routing ​

2. Auto-judge ​

3. Retry with negative context ​

Putting it together ​

Seeing it work ​

Golden anchors ​

Measured ​

See it in a second skill ​