Harness engineering 01/17

Presentation deck

Shipping a self-validating research pipeline

How a coding agent produces structured, evidence-backed company research at scale — and why the website never calls an LLM.

Yingting Huang · May 2026

What it buys you 02/17

What harness engineering buys you

Roughly 9,500 lines of harness — scripts, schemas, internal docs — produced 46 complete reports, 9,400 cited sources, and 14,400 atomic claims. Fully automated. Fully reproducible.

Every claim resolves to a quoted excerpt and a canonical URL.

The agent writes the content; the website never calls an LLM.

Static Astro build over typed YAML — no SSR, no inference at runtime.

Mental model 03/17

03

Don't ask the agent to write a report

Ask it to run a pipeline. That distinction does most of the work.

Pipeline 04/17

Eight chapters, one tight loop

flowchart TB
  Input[user input] --> Ctx[load runtime context]
  Ctx --> Run[create report run]
  Run --> Loop{for each<br/>of 8 chapters}
  Loop --> Gather[gather sources]
  Gather --> Draft[draft chapter YAML]
  Draft --> Gate[chapter gate]
  Gate -->|fail| Draft
  Gate -->|pass| Loop
  Loop --> Final[finalize:<br/>meta · ledger ·<br/>cross-chapter]
  Final --> Out[YAML artifacts]
  Out --> Site[Astro static build]
Live render of the article's pipeline diagram.
  • Every step is bound to a specific command, schema, and failure mode.
  • The instruction surface is under a hundred lines; policy lives in config.
  • The agent never has to remember the rules — it re-reads them every loop.
World model 05/17

05

YAML is the world model

Every report is a typed YAML dataset, not a markdown document.

Why typed YAML 06/17

Why typed YAML pays off

A schema defines the envelope, ID grammar, controlled vocabularies, figure contracts, and a shared regex for inline claim refs.

  • The agent cannot bullshit silently — every enum miss fails the gate with a one-line fix.
  • The renderer needs no defensive layer: figure shapes are guaranteed valid at validation time.
  • All 46 reports are isomorphic, so sector indexes and 'top-rated' rollups are trivial.
  • A CI step replays every historical report through the current schema — no quiet corruption.
Constitution 07/17

07

Workflow is configuration, not prompt

One config file is the agent's brief, the validator's rulebook, and the policy layer — all at once.

What the config owns 08/17

What that single config file owns

Edit one line; the next agent loop sees the new rule. Markdown the agent reads contains zero policy.

  • Chapter briefs: descriptions, planned tables and figures, evidence strategy, quality bar.
  • Per-chapter and per-report gates: minimum sources, distinct domains, paywall caps.
  • Agent policies: research rules, hard rules, retry policy, treatment of volatile facts.
  • The agent reads the projection of these rules — there is no second copy to drift against.
Tool design 09/17

09

Why I rewrote the URL fetcher

Off-the-shelf 'fetch a URL' tools handle blogs. They don't handle SEC filings, paywalled news, anti-bot shields, or PDFs.

Fetch tool 10/17

One default command, a few escape hatches

An agent's attention is a token budget. Every parallel path is a chance to choose wrong.

  • Browser fingerprint rotation across TLS / HTTP-2 identity profiles, picked per host.
  • Per-host fast-paths plus reader-proxy and Wayback Machine fallbacks when origins refuse.
  • First-class PDF handling: magic-byte detect, pipe through extractor, return clean text.
  • Boilerplate stripping by default; a flag for product and pricing pages where chrome is the content.
  • Disk cache with TTL and structured JSON output, not a blob to re-parse.
Harness 11/17

11

Putting the agent in a well

An LLM is a strong but easily-distracted executor. Surround it with deterministic code that makes its mistakes legible and its corrections cheap.

Feedback channel 12/17

The chapter gate is the agent's CI

flowchart LR
  Draft[chapter draft] --> Check[chapter gate]
  Check --> Fail{pass?}
  Fail -->|no| Out[structured output:<br/>failures · warnings ·<br/>cascade-suppressed ·<br/>retry order]
  Out --> Fix[agent fixes top<br/>of retry order]
  Fix --> Draft
  Fail -->|yes| Next[next chapter]
Live render of the harness's draft-and-gate loop.
  • The validator returns structured entries, not prose.
  • The agent switches on a stable enum, never NLP-parses a message.
  • Failures are sorted upstream-first so fixes don't re-break what just resolved.
Four levers 13/17

Four levers turn the loop into CI

  • Stable failure dimensions — ~50 enums like tableShape or claimRefMissing. No NLP-parsing.
  • One-line fixes per dimension, back-filled with specifics: 'add 2 more sources, one primary-tier'.
  • Cascade suppression: mark root causes; hide derivative failures until they retest automatically.
  • Retry precedence: upstream causes first, then corroboration, depth, and references.
High leverage 14/17

Smaller, high-leverage additions

  • Object-level aggregation: 'table T102 has 2 problems' beats two free-floating entries.
  • Global hints when a dimension fires on three or more objects — fix the pattern, not the symptom.
  • Acknowledged warnings: dismiss with a written justification instead of inventing content.
  • Multiple output formats: text for humans, grep-friendly for shells, JSON for programs.
  • Fail-safe retry budget: each retry must strictly reduce the failure count or stop.
Render edge 15/17

15

The website is intentionally boring

Static HTML, no database, no SSR, no runtime LLM. The semantic content is decided at build time.

Render pipeline 16/17

Renderer and validator share their contracts

flowchart LR
  YAML[YAML reports] --> Loader[Astro content loader]
  Loader --> Pages[run pages · sectors ·<br/>top-rated · search]
  Pages --> Renderers[diligence renderer ·<br/>figure renderer ·<br/>table renderer]
  Contracts[shared figure<br/>contracts] --> Renderers
  Contracts --> Validator[chapter gate]
  Renderers --> HTML[static HTML +<br/>print stylesheet]
Live render of the site's intentionally boring pipeline.
  • The renderer contract file is also imported by the validator — they cannot drift.
  • Inline claim refs share their regex with the validator; one toggle hides every reference.
  • Native window.print() plus CSS Paged Media gives clean A4 PDFs without Puppeteer.
Takeaways 17/17

Three sentences to take away

  1. YAML is the world model, not a rendering format. Every bug has a single home.
  2. Workflow is configuration, not prompt. The agent reads a projection — never the rules.
  3. The validator is the agent's feedback channel: stable dimensions, one-line fixes, cascade suppression, retry precedence.
  4. 9,500 lines of harness, 46 reports, 14,400 cited claims. Invest in the well; the agent does the digging.