Last month, my trading agent ran for 18 hours unsupervised.

It opened positions at market open, adjusted stops at three checkpoints, and filed a structured journal entry at each one. It handled a partial fill, a cancelled order, and a circuit breaker event. It sent me an iMessage when it needed a decision and stayed quiet when it didn't.

It didn't hallucinate. It didn't spiral. It didn't destroy my portfolio.

Here's what made that possible: it wasn't the model. It wasn't even the prompt.

It was the harness.

What a Harness Actually Is

Ask ten AI developers what a harness is and nine will say 'a testing framework.' That's not what I mean.

A harness — in the agentic sense — is the verification and safety layer that sits around your agent. It's the difference between an agent that can do something and one that reliably does it, survives edge cases, and fails safely when it breaks.

"A harness is not the agent. It's the environment the agent runs in. — Anthropic Engineering"

Nobody teaches this. There are a hundred courses on prompt engineering. Zero courses on harness design. And yet every serious production agent system lives or dies by it.

The environment the agent runs in
The environment the agent runs in — not the agent itself

The 5 Components of a Real Harness

The 5-component stack

Snapshot + rollback Before every action
Dry-run mode Before going live
Schema validation Every LLM output
Structured audit logging Every decision
Circuit breakers 7 in my live system

1. Snapshot and rollback.

Before the agent acts, take a snapshot of state. If the action fails, roll back cleanly. In my trading system, every checkpoint writes a state file before touching positions. If the script crashes mid-run, the next checkpoint reads the last clean state and continues. No phantom positions, no double-fills.

2. Dry-run mode.

Every production action should have a dry-run equivalent. My system has DRY_RUN=1 as a gate on every order submission. I run dry-run for a full trading day before flipping live. The agent behaves identically — logs everything, makes every decision — but hits a wall before it reaches the brokerage API.

3. Schema validation.

If your agent returns JSON, validate the schema before you act on it. An agent that returns {"action": "buy", "size": null} should fail loudly at the validation layer, not silently at the execution layer. Silent failures are the worst kind — they succeed by doing nothing.

4. Structured audit logging.

Every agent decision should be logged with: timestamp, inputs considered, decision made, action taken, outcome observed. Not for debugging — for trust. The only way to let an agent run unsupervised is to be able to reconstruct exactly what it did and why. Without this, you're flying blind.

5. Circuit breakers.

Know in advance what constitutes an unacceptable state. Consecutive losses above a threshold. An API that stops responding. A position that crosses a hard stop. Build the circuit breaker before you need it. My system has seven of them. They've all fired at least once.

Infrastructure for trust
Infrastructure for trust — the audit trail that makes unsupervised agents possible

Why This Is a Career Skill, Not Just a Best Practice

Here is the talent gap that's about to matter enormously.

Every company is deploying agents. Most of those agents will fail in production — not because the model is bad, but because the harness doesn't exist. The agent works in the demo. It works on the test cases. It breaks at 3am on a Tuesday when an API returns a malformed response and nobody is watching.

"The engineer who can build reliable harnesses — who thinks in circuit breakers and rollbacks before they think in prompts — is the engineer every serious team needs right now."

This skill has no job title yet. It will.

The closest frame I have for it: harness engineering is what makes the difference between an agent that runs and an agent that runs in production. That gap is not small. That gap is the entire value proposition of deploying agents at all.

Where to Start

Build a harness for the smallest agent you have. Add one thing:

  • A dry-run flag that does everything except the irreversible action.
  • A state file that records what happened before the next run starts.
  • A schema check on every LLM output before you act on it.

You don't need all five components on day one. You need one, running, in production, before the agent breaks.

Because it will break.

"The question is whether you're ready when it does."