Harness engineering 01/17

Presentation deck

Shipping a self-validating research pipeline

How a coding agent produces structured, evidence-backed company research at scale — and why the website never calls an LLM.

Yingting Huang · May 2026

What it buys you 02/17

What harness engineering buys you

Roughly 9,500 lines of harness — scripts, schemas, internal docs — produced 46 complete reports, 9,400 cited sources, and 14,400 atomic claims. Fully automated. Fully reproducible.

Every claim resolves to a quoted excerpt and a canonical URL.

The agent writes the content; the website never calls an LLM.

Static Astro build over typed YAML — no SSR, no inference at runtime.

Mental model 03/17

Don't ask the agent to write a report

Ask it to run a pipeline. That distinction does most of the work.

Pipeline 04/17

Eight chapters, one tight loop

flowchart TB
  Input[user input] --> Ctx[load runtime context]
  Ctx --> Run[create report run]
  Run --> Loop{for each<br/>of 8 chapters}
  Loop --> Gather[gather sources]
  Gather --> Draft[draft chapter YAML]
  Draft --> Gate[chapter gate]
  Gate -->|fail| Draft
  Gate -->|pass| Loop
  Loop --> Final[finalize:<br/>meta · ledger ·<br/>cross-chapter]
  Final --> Out[YAML artifacts]
  Out --> Site[Astro static build]

Live render of the article's pipeline diagram.

Every step is bound to a specific command, schema, and failure mode.
The instruction surface is under a hundred lines; policy lives in config.
The agent never has to remember the rules — it re-reads them every loop.

World model 05/17

YAML is the world model

Every report is a typed YAML dataset, not a markdown document.

Why typed YAML 06/17

Why typed YAML pays off

A schema defines the envelope, ID grammar, controlled vocabularies, figure contracts, and a shared regex for inline claim refs.

The agent cannot bullshit silently — every enum miss fails the gate with a one-line fix.
The renderer needs no defensive layer: figure shapes are guaranteed valid at validation time.
All 46 reports are isomorphic, so sector indexes and 'top-rated' rollups are trivial.
A CI step replays every historical report through the current schema — no quiet corruption.

Constitution 07/17

Workflow is configuration, not prompt

One config file is the agent's brief, the validator's rulebook, and the policy layer — all at once.

What the config owns 08/17

What that single config file owns

Edit one line; the next agent loop sees the new rule. Markdown the agent reads contains zero policy.

Chapter briefs: descriptions, planned tables and figures, evidence strategy, quality bar.
Per-chapter and per-report gates: minimum sources, distinct domains, paywall caps.
Agent policies: research rules, hard rules, retry policy, treatment of volatile facts.
The agent reads the projection of these rules — there is no second copy to drift against.

Tool design 09/17

Why I rewrote the URL fetcher

Off-the-shelf 'fetch a URL' tools handle blogs. They don't handle SEC filings, paywalled news, anti-bot shields, or PDFs.

Fetch tool 10/17

One default command, a few escape hatches

An agent's attention is a token budget. Every parallel path is a chance to choose wrong.

Browser fingerprint rotation across TLS / HTTP-2 identity profiles, picked per host.
Per-host fast-paths plus reader-proxy and Wayback Machine fallbacks when origins refuse.
First-class PDF handling: magic-byte detect, pipe through extractor, return clean text.
Boilerplate stripping by default; a flag for product and pricing pages where chrome is the content.
Disk cache with TTL and structured JSON output, not a blob to re-parse.

Harness 11/17

Putting the agent in a well

An LLM is a strong but easily-distracted executor. Surround it with deterministic code that makes its mistakes legible and its corrections cheap.

Feedback channel 12/17

The chapter gate is the agent's CI

flowchart LR
  Draft[chapter draft] --> Check[chapter gate]
  Check --> Fail{pass?}
  Fail -->|no| Out[structured output:<br/>failures · warnings ·<br/>cascade-suppressed ·<br/>retry order]
  Out --> Fix[agent fixes top<br/>of retry order]
  Fix --> Draft
  Fail -->|yes| Next[next chapter]

Live render of the harness's draft-and-gate loop.

The validator returns structured entries, not prose.
The agent switches on a stable enum, never NLP-parses a message.
Failures are sorted upstream-first so fixes don't re-break what just resolved.

Four levers 13/17

Four levers turn the loop into CI

Stable failure dimensions — ~50 enums like tableShape or claimRefMissing. No NLP-parsing.
One-line fixes per dimension, back-filled with specifics: 'add 2 more sources, one primary-tier'.
Cascade suppression: mark root causes; hide derivative failures until they retest automatically.
Retry precedence: upstream causes first, then corroboration, depth, and references.

High leverage 14/17

Smaller, high-leverage additions

Object-level aggregation: 'table T102 has 2 problems' beats two free-floating entries.
Global hints when a dimension fires on three or more objects — fix the pattern, not the symptom.
Acknowledged warnings: dismiss with a written justification instead of inventing content.
Multiple output formats: text for humans, grep-friendly for shells, JSON for programs.
Fail-safe retry budget: each retry must strictly reduce the failure count or stop.

Render edge 15/17

The website is intentionally boring

Static HTML, no database, no SSR, no runtime LLM. The semantic content is decided at build time.

Render pipeline 16/17

Renderer and validator share their contracts

flowchart LR
  YAML[YAML reports] --> Loader[Astro content loader]
  Loader --> Pages[run pages · sectors ·<br/>top-rated · search]
  Pages --> Renderers[diligence renderer ·<br/>figure renderer ·<br/>table renderer]
  Contracts[shared figure<br/>contracts] --> Renderers
  Contracts --> Validator[chapter gate]
  Renderers --> HTML[static HTML +<br/>print stylesheet]

Live render of the site's intentionally boring pipeline.

The renderer contract file is also imported by the validator — they cannot drift.
Inline claim refs share their regex with the validator; one toggle hides every reference.
Native window.print() plus CSS Paged Media gives clean A4 PDFs without Puppeteer.

Takeaways 17/17

Three sentences to take away

YAML is the world model, not a rendering format. Every bug has a single home.
Workflow is configuration, not prompt. The agent reads a projection — never the rules.
The validator is the agent's feedback channel: stable dimensions, one-line fixes, cascade suppression, retry precedence.
9,500 lines of harness, 46 reports, 14,400 cited claims. Invest in the well; the agent does the digging.