Skip to content

Reliability layer

LLM skills are non-deterministic. The reliability layer (v0.2.0) handles that systematically at every probabilistic boundary, while leaving deterministic skills completely untouched. It lives in src/orchestrator.ts.

1. Confidence routing

A probabilistic skill returns a confidence (0.0–1.0). The orchestrator routes it:

BandRangeAction
high≥ 0.85proceed
review0.65 – 0.85proceed, flag in the trace
low< 0.65retry

2. Auto-judge

After a probabilistic skill produces output, the orchestrator automatically runs the boundary judge on the skill's judge_blocks — there is no manual judge step in the pipeline. A judge score below the skill's confidence_threshold (default 0.80) is a failure.

3. Retry with negative context

When a probabilistic skill fails — low confidence, a failed assertion, or a judge rejection — the orchestrator re-invokes it with negative context: the previous summary plus the failure reason.

RetryContext = { attempt, previous_summary, failure_reason }

The retry budget is the skill's retries (default 2). If all attempts fail, the run halts and exposes the full diagnostic.

Putting it together

The per-skill loop:

run skill (with retry context if this is a retry)
  → apply STATE writes + checkpoint
  → [probabilistic] confidence routing
  → assertions (base-assert)
  → [probabilistic] auto-judge on judge_blocks
  → pass?  record success, continue
  → fail?  retries left → re-invoke with negative context
           exhausted    → halt and expose

Deterministic skills run only the first three lines — zero overhead.

Seeing it work

Each --inject mode drives one path:

bash
npm start -- --inject lowconf
↻ extract-highlights probabilistic selected 4 highlights              0ms  $0.0000
  └─ confidence 0.60 below 0.65
✓ extract-highlights probabilistic selected 3 highlights (recovered on attempt 2)  0ms  $0.0000
  └─ judge: 1.00
STATUS: SUCCESS   total: 1ms   cost: $0.0000   retries: 1
ModeTriggerOutcome
lowconfconfidence 0.60 < 0.65routing retries → recovers
hallucinationjudge catches an ungrounded highlightjudge retries → recovers
persistentevery attempt stays ungroundedretries exhausted → HALTS
coveragedeterministic assertion failsHALTS immediately (no retry)

Golden anchors

Probabilistic skills declare golden_anchors — worked input/output examples of acceptable output. They are threaded into the judge prompt so the model has a concrete reference for what "grounded" looks like (the offline heuristic ignores them).

Measured

The layer is benchmarked by npm run bench, which runs the chain across a matrix of documents × inject-modes with the offline heuristic judge (deterministic, so the numbers are reproducible) and reads the metrics back from the real NDJSON traces:

MetricResult
Judge catch rate (fabricated highlights caught)100% (6/6)
Retry recovery rate66.7% (6/9)
Avg attempts at the probabilistic boundary2.0
Deterministic zero-overheadtrue
Scenario success rate69.2% (9/13)

Recovery and success rates intentionally include the designed-to-fail scenarios — the persistent mode (never recoverable) and the thin-document coverage halt — so they are lower bounds, not best-case figures. The headline results are that the judge caught every fabricated highlight and that deterministic skills were never retried.

bash
npm run bench           # print the table
npm run bench -- --save # also write metrics into version.json

See it in a second skill

summarize is a probabilistic extractive summarizer that exercises this whole layer: it returns a confidence and verbatim judge_blocks, and on a retry it takes fewer, stronger sentences so a low-confidence first attempt recovers. Watch it live with npx tsx examples/summarize.ts.

MIT License