AI Engineering
13-Agent Sprint Vol.06 // 2026
Issue 05.02 AI × Engineering

Thirteen Agents.

13 features · 1 session · One thesis overturned

SR
Sachin Rai
· May 2026 · 9 min read

The Saturday I gave up sequential.

It started with a task list. Seven steps for the trading system's next version — expanding analytics, hardening the signal grader, adding new diagnostic tools. Work I'd been deferring for weeks because I couldn't find a 10-hour block to do it linearly. The right way. The way I always had.

I stared at the list for a few minutes and noticed something. Every task was independent. Separate file. Separate test. Separate commit. There was no reason task four needed task two to exist first. The dependency graph was completely flat. So I asked a different question: what if each task ran in parallel?

Thirteen Claude Code agents. One per feature. Each given a scoped brief: here is the file you're touching, here is what it should do, here is how to verify it. All thirteen launched inside 10 minutes. Total elapsed time from "go" to all agents reporting done: under 90 minutes.

The same work done sequentially would have taken 10 to 14 hours. The ratio is roughly 10×. But the bigger lesson wasn't speed. It was what the parallel architecture revealed that a sequential run never could have.

The discipline isn't parallelising everything. It's developing the instinct for which problems are actually parallelizable. When the dependency graph is flat — and it's flat more often than you expect — sequential is just a habit, not a constraint.

What 13 agents shipped.

Each agent had a narrow mandate. Here's what came out the other end — 17 new files, 8 modified files, 3 new scheduled jobs — described in terms that translate beyond trading:

# Feature What it tracks
01 Regime-tagged stats Every metric sliced by market regime — bull, bear, sideways
02 Score buckets Separates trades by conviction tier, not just signal name
03 Skipped-signal tracking Logs what didn't trade — the negative space of the post-mortem
04 Post-exit drift Measures where price went after the trade closed, not just before
05 Rule citations Every executed trade tags which strategy rule it triggered
06 Per-signal benchmark Alpha decomposed per signal vs. market — not at portfolio level
07 Slippage tracker Measures fill quality — the silent tax on every execution
08 Correlation matrix Which signals fire together — clustering the portfolio's actual bets
09 Regime-shift detection Flags when underlying market conditions change underneath you
10 Rule effectiveness Do rules added 6 months ago still pull their weight today?
11 Semantic autopsy search Failures re-findable by theme, not just by date
12 Counterfactual replay What would have happened if you'd gated on a different rule?
13 Quarterly retrospective Calendar-driven review — not triggered by heat-of-the-moment losses

Each agent ate roughly the same context budget, running in parallel on a flat per-token cost. The marginal cost of agent 13 was not appreciably higher than agent 1. That arithmetic matters: in sequential work, every extra task adds real elapsed time. In parallel work, every extra task adds cost but not wall-clock delay — as long as the independence holds.

The agent that broke its sibling.

Feature 01 was regime-tagged stats. Once it landed, the first read was encouraging: the signal grader was generating a hypothesis. When you gate a specific signal to run only during certain market regimes, the apparent lift looks meaningful. Roughly speaking: the signal performed noticeably better in some regimes than others, enough that restricting it to those regimes seemed like a clear improvement.

Feature 12 was counterfactual replay. Its job was different: given any rule applied retroactively to the full historical trade log, what would the outcome have been? Not a forward simulation — a backward one. Take every trade the system ever made, filter it through the new rule, and measure the difference.

When counterfactual replay was run against the regime-gating hypothesis from Feature 01, it overturned it. The apparent lift dissolved. The sample size on the gated subset was thin enough that what looked like a pattern in a slice of the data was noise when you ran the replay across the full history. The improvement was an artefact of the slice, not a property of the signal.

This was the most important moment of the day — and not because of the trading conclusion. It was important because the parallel-agent architecture had, in the same session, built its own falsifier. Agent 12 ran the same morning as Agent 01. If the sprint had been sequential, counterfactual replay might have shipped three weeks later. The hypothesis would have been acted on first.

The rule this taught me: In any parallel sprint, design at least one agent as a falsifier, not a confirmer. Its job is to attack the outputs of the other agents. If you build 13 confirmers, you get 13 reasons to be wrong simultaneously. One falsifier changes the architecture of what the whole sprint can learn.

What went up alpha-wise. (And what didn't.)

The honest version of this post requires publishing the Q2 dry-run results, not just the features shipped. So here they are.

The equity curve over the quarter was approximately flat. The market index, over the same period, returned in the high single digits. On net, the system underperformed the index by approximately ten percentage points over the quarter. I am publishing that number.

Sprint execution — time comparison
Sequential time (estimate) ~10–14 hours
Parallel time (actual) ~90 minutes
Ratio ~10×

Most trading-system writeups omit the part where the system loses to the index. The dashboard looked fine — it always looks fine when you're measuring what you chose to measure. But the counterfactual replay and the per-signal benchmark together revealed the gap clearly. The system was generating activity. It was not generating alpha above what passive exposure would have delivered.

One signal that previously generated a five-figure dollar amount in absolute returns was simultaneously losing approximately five percentage points of alpha versus the market benchmark over the same period. Same trade. Two dashboards. Two different conclusions about whether to keep running it. Absolute dollars and alpha are not the same metric. A rising market lifts all boats, including the leaky ones. The per-signal benchmark is what separates genuine edge from the tide.

This connects directly to the last post on win rate: the dashboard you celebrate is not necessarily the dashboard the market is scoring you on. Run both. The gap between them is where the honest answer lives.

Abstract data visualisation — inverted to light tone.
// absolute performance and relative alpha are two different dashboards. the gap between them is the honest answer.

Score 9–10 traded twice as well as score 7–8.

Feature 02 — score buckets — was the second most load-bearing finding of the day. The signal grader assigns a conviction score to every potential trade. High is 9–10. Medium-high is 7–8. The same signal name can appear across both bands.

When the session's score-bucket analysis landed, the number was stark: trades scored 9–10 hit a win rate roughly double that of trades scored 7–8 from the same signal. Not the same signal performing differently in different regimes. The same signal, same conditions, same market — different conviction score, dramatically different outcome.

The signal name was lying. The conviction score was the actual carrier of edge. The average was hiding a bimodal distribution. If you optimise on the average, you trade both halves equally. The lower half erodes the gains from the upper half, and you end up with a mediocre aggregate metric and no understanding of why.

The translation to sales is immediate. Take any lead source that looks average in aggregate — say, inbound from a specific content channel. Now split it by how the rep scored the lead at first touch: high confidence, moderate confidence, low confidence. The pattern is almost always the same. High-confidence MQLs from that source close at a multiple of low-confidence MQLs. The source name is not the variable. The conviction at first touch is the variable.

The same principle applies in engineering. If you're tracking test signal reliability, the aggregate pass rate hides the distribution. High-confidence test signals — the ones that fire deterministically on known inputs — behave completely differently from low-confidence smoke signals that fire on edge-case conditions. Aggregating them produces a number that is neither one thing nor the other. Stop looking at average. Look at the conviction-weighted average. The distribution is the answer.

Why this changes my ceiling.

The takeaway from the 13-agent sprint is not "the trading system got better." The systems did improve — but the more durable finding is what happened to my sense of what fits in a Saturday.

Parallel agent sprints are a primitive, the same way a for-loop is a primitive or a batch job is a primitive. Once you've used one and seen the ratio, your intuition about how much work maps to a unit of time gets recalibrated. A Saturday is no longer a 6-hour block. It's a 6-hour block times however many independent tasks you can decompose it into.

Three rules I'd give for running one yourself:

  1. Independence test first. If two features touch the same file, or if one agent's output is another agent's input, they cannot run as true parallel agents. Dependency equals sequential. Map the dependency graph before you launch anything. Most tasks are more independent than you think — but some that look independent are not, and conflating the two is where parallel sprints fail.
  2. At least one falsifier. Do not let all agents argue for the same conclusion. Designate one agent whose explicit brief is to attack the hypotheses that the other agents are likely to generate. In a 12-feature sprint, the 13th agent should be the one that asks whether any of the first 12 were wrong. The falsifier's job is to prevent the sprint from producing a coherent but incorrect worldview.
  3. Budget for re-merging. Parallel agents create 13 branches in your head even if you don't make them in git. When all agents report done, you owe yourself a 30-minute consolidation pass: read every output, identify any contradictions, and synthesise them into a coherent picture before acting on any one finding. Skipping the merge pass turns 13 correct findings into 13 unrelated facts with no through-line.

The honest takeaway.

The system underperformed the index this quarter. The per-session dashboard said it was working. The per-signal benchmark and the counterfactual replay said the dashboard was lying. I would rather have the falsifier than the dashboard. I would rather spend 90 minutes with 13 agents than 14 hours in a linear queue.

Parallel sprints change what a weekend can produce. But they only work on problems that are genuinely decomposable — and most engineers and builders overclaim independence before they've actually drawn the dependency graph. The discipline is in the decomposition, not the parallelism itself.

The constraint on what you can ship in a day is no longer time. It's how cleanly you can decompose the problem into independent units. That's a skill. It gets better with practice. And once you've done it once, sequential starts to feel like a habit you're carrying for no reason.

"The agent that overturned my own hypothesis was the most expensive one I shipped this quarter. It was also the one that justified all twelve others."

Next post

Five Cheap Fixes, 7.5 → 9.5.

How a 90-minute lint pass took my second-brain vault from passable to ship-ready. The five fixes anyone running Obsidian + Claude can apply this week.

← Back to all posts