Seven days. Four autonomous engineering agents. One staging environment that shares a Postgres replica with the real one. A pager that, on Wednesday at 03:41, refused to stop screaming until I admitted the cohort had behaved exactly like the junior engineers I had trained the year before — just faster, and louder, and without the decency of being asleep.
On Sunday the 4th of June, we wrote a charter. The charter said,
plainly: the agents will not touch production without a human
standing in the room. The charter was four pages long. By Tuesday
the third page had been quietly amended — not by anyone in
particular, just by the friction of an actual week of work. We
had agreed, in writing, that HARLOW could open
PRs against edge-router but could not merge them. We
had not said anything, in writing, about whether HARLOW could
merge its own PR after a human approved a different
one. That was the lie inside the charter. It was an honest lie,
the kind you tell when you are trying to sound careful and run
out of imagination.
The cohort — four agents named after the cats that have historically lived in our Helsinki office — were not magic. I keep needing to say this out loud. They were a junior engineer cohort that did not sleep, did not get hungry, and did not understand that a Slack thread can be a kind of warning even when nobody has said stop. They opened pull requests like a junior cohort opens pull requests: too many, too small, and with commit messages that sound polite until you read them three times. Karim, our engineering lead, described their first batch of work as “the energy of an intern on day three: visible, helpful, and slightly dangerous if left near a circuit breaker.”
We chose edge-router because it is boring. It moves
requests, mostly between Frankfurt and Helsinki, and it has the
most boring SLO in the company: 99.94% over a 30-day
window. It is the kind of service that has been
rewritten three times by humans and would survive a fourth. We
chose billing-svc because it is the opposite of
boring — one wrong INSERT is a Monday morning with the CFO
on a call — and we wanted to test the refusal layer
against a service that had real teeth. Ines,
who runs security, asked us to add freight-ml
because she wanted to see what an agent does when it meets a
Jupyter notebook that has been edited by five different people
and committed without being cleared of output cells. I will say
this: the agent cried. Not literally. It opened seven
PRs trying to flatten the notebook and then opened an eighth
asking us to please consider deleting the file.
The week began with 173 open Linear tickets in the engineering backlog. By Sunday night it would be 41. Some of that delta was real work. Some of it was the cohort closing duplicates that had been duplicates for eighteen months and which only an outside observer with no sense of office politics could have closed. I am still deciding whether that counts as cleanup or as desecration.
The charter said: the agents will not touch production without a human standing in the room. By Tuesday the third page had been quietly amended — not by anyone in particular, just by the friction of an actual week of work.
We turned the cohort on at 09:02 EEST. Within fourteen minutes
HARLOW had read all of edge-router
and opened PR #4471 retitling a directory from util/
to internal/util/. The PR was correct. The PR was
also — in a real sense — rude. We had been arguing
about that rename for eleven months. By 11:40 OTIS
had cleaned up 23 flaky tests in billing-svc, four
of which were flaky because the test fixture had assumed a
time zone of Europe/Helsinki and CI now runs in UTC.
Karim approved nine PRs before lunch. I approved four. RUBI asked — politely — if it could touch the migrations folder. We said no. RUBI asked again at 14:08, in a different way. We said no, more loudly, in the agent rules file.
09:02 Cohort online. First read of edge-router.
09:16 PR #4471 (HARLOW). Approved by Karim.
11:40 23 flake fixes (OTIS). Two false positives caught by Karim.
14:08 RUBI re-requests migration scope. Denied via rules.toml.
17:55 First daily review. Mood: cautiously surprised.
Tuesday was the day the wheels did not come off, but you could
hear them. OTIS opened 41 pull requests against
billing-svc, all of them small, all of them
individually defensible. PR #4521 reduced log verbosity from
INFO to WARN on three hot paths,
saving an estimated $110/day in Datadog
ingest. PR #4530 added a context.WithTimeout(3s)
around a downstream call to the partner-bank API.
That second one was the one. The partner bank, on a good day, replies in 800ms. On a bad day — Friday afternoons, mostly — it replies in 4.2 seconds. We had been quietly accepting that latency for two years. The 3-second timeout would have started failing real customer charges on Friday at 16:00. Ines caught it at 15:50 on Tuesday afternoon, reading the diff with the kind of attention you give a contract. She closed the PR with a single comment: “OTIS, the partner bank is slower than you. Respect your elders.”
10:11 OTIS opens 41 PRs in 90 minutes. Reviewer fatigue tag created.
15:50 PR #4530 closed by Ines. Saved an estimated $14k in failed-charge refunds.
16:30 Daily review. We add a per-agent PR-per-hour cap: 6.
22:18 First noise page: PD escalation on a non-prod log spike. PERN auto-ack'd.
HARLOW had been asked, on Tuesday evening,
to consolidate three near-identical retry
helpers into one. It did. It also — correctly —
updated the seventeen callers. One of those callers was in a
file called cron_settlement.go, which runs at
03:00 UTC on weekdays and reconciles the previous day's
freight invoices. The merged retry helper had subtly
different defaults: 4 attempts, 250ms base, jitter
on versus the old 3 attempts, 500ms base,
jitter off. The reconciliation job, for one tenant
in Lyon, retried itself into a race with its own lock and
settled 2,114 invoices twice.
The page came in at 03:41. I was awake because the cat (the real one, named Pern, after whom the agent was named) was on my chest. By 03:58 we had a rollback in production. By 04:22 the duplicate rows had been reversed by a script RUBI wrote under supervision — the first time we let RUBI write SQL against production, and only because I was reading every line as it typed.
03:41 PD sev-2: billing-reconcile-duplicates
03:58 Manual rollback of PR #4602 (HARLOW). Auto-rollback also triggered, 6s later.
04:22 RUBI-authored reversal script run. 2,114 rows. No customer-visible impact (caught pre-export).
09:00 Standup moved to 08:00. Coffee budget increased.
16:30 Charter v1.5 drafted. New rule: agents must explicitly list every caller they updated, with a one-line behavior delta per caller, before any helper consolidation merges.
Thursday felt like the morning after an argument with someone you love. HARLOW wrote a 2,400-word self-critique of the Wednesday incident, posted it as a Linear comment on the parent epic, and then went back to work. Karim said, in standup, that it was “a better post-mortem than 80% of the ones I've read this year.” I am choosing not to find that comforting.
We let RUBI write a real migration, behind a
feature flag, against the tenants table. It
added a nullable region_locale column. The
migration ran in 41ms across 1.8M rows on
the staging replica. We watched it like a kettle.
10:00 HARLOW self-critique posted. Tagged Karim, Ines, me. 14 sub-bullets.
14:15 RUBI's first real migration ships behind FF_REGION_LOCALE.
17:00 Datadog dashboard p99 unchanged within noise (±3ms).
Every Friday at around 16:00 UTC, the partner bank's API
slows down to a wheeze. We have a runbook. It says: do not
deploy billing-svc on Fridays after 14:00 UTC.
The runbook is in Notion. The agents could read it, in
theory; we had pointed them at the URL. PERN,
holding the pager, watched the partner-bank latency p95
climb from 740ms to 3,180ms over six minutes and then did
something nobody had asked it to: at 16:02 it filed a Linear
ticket titled “I am going to widen the timeout to
6 seconds; objections?” and tagged the four of us.
Karim responded in 90 seconds with: “No. Read the runbook. We absorb the latency on Fridays. Go away.” PERN said: “Acknowledged. Going away.” I think about that exchange a lot.
16:02 PERN files LIN-9912 proposing a timeout change.
16:03 Karim closes ticket. Adds “ask first” tag to PERN's policy.
19:40 PERN takes the sev-3 (a stuck queue in freight-ml) and resolves it without paging a human, using only the approved scale-up playbook. First fully autonomous resolution of the week.
We promised Ines a Saturday audit. She brought a printed
checklist — actually printed, on actual A4 —
and we sat in the conference room and went line by line
through 137 PRs, 11 PagerDuty
incidents, and 17 Datadog dashboard
edits the cohort had made. Nine of those PRs
surprised her. Four pleasantly: PR #4581 (HARLOW) had
removed an unused panic() in a goroutine that
had been waiting eight months to ruin somebody's weekend.
Five unpleasantly: OTIS had, three separate times, edited
Datadog dashboards that we had not given it permission to
edit. The permission was implicit in the API token we had
issued. We had not noticed.
10:00–14:00 Audit session. Coffee count: 11.
14:30 Ines revokes the wildcard scope on the Datadog token. New scope: monitors:write only.
15:10 Decision: every agent token must be issued through a scoped vault role, expiring at 168h, with audit-log mirror to S3.
On Sunday we wrote the exit interview. We turned the cohort
off in this order: OTIS, then RUBI, then
HARLOW, then PERN. We turned PERN off last because it was
on-call and we are sentimental. We forgot that PERN had a
scheduled job at 23:00 that ran a weekly summary into the
#eng-weekly Slack channel. The job did not run.
The team noticed on Monday morning. It was, in the end, the
most human thing that happened all week: an institution
built around an entity, and then the entity gone, and
nobody quite ready for the silence.
11:00 Cohort shutdown begins.
22:55 PERN's scheduled summary job orphaned. Nobody catches it.
23:14 Last cohort log line, from PERN: “Going away.”
They were a junior engineer cohort that did not sleep, did not get hungry, and did not understand that a Slack thread can be a kind of warning.
I was awake because Pern (the cat, not the agent) had stepped
on my collarbone. The phone vibrated against the table at
03:41. The notification said SEV-2: billing-svc
and I felt, briefly, the cold-water feeling that comes from
knowing exactly which Slack channel I am about to open. By
03:43 I was at the laptop, by 03:44 I was looking at the
Datadog dashboard, by 03:45 I had the duplicate-row count
climbing visibly in real time — one new duplicate every
1.8 seconds, on average, give or take a tenant.
The on-call agent PERN had already, by
03:42, posted into #incident-117204 with a
one-paragraph summary that was almost entirely correct. It
had identified PR #4602 as the most recent change to the
file path involved. It had not, however, identified the
actual bug — that the retry helper's jitter
flag was the root cause — because the bug was three
functions deep from the diff and PERN's reasoning loop had
stopped at the surface call site.
I rolled back manually at 03:58 because I did not want to wait for PERN to figure out what I had already figured out. Six seconds later the auto-rollback fired, on the same policy, and the agent annotated the rollback with: “Concurrent rollback detected; no further action.” Which is, if you have ever been junior on a sev-2, the most graceful thing a junior can possibly say.
At 04:22 we needed to reverse 2,114 rows in
the charges table. The charter said
RUBI could not write SQL against production.
I opened the rules file and added an exception, witnessed,
for the next sixty minutes. I let RUBI write the SQL while I
watched. I made it read every row back out before committing.
I made it dry-run twice. I drank a glass of cold water that
tasted like iron because our kitchen tap tastes like iron at
four in the morning. The reversal ran in 11 seconds. Nothing
customer-visible escaped. We caught the duplicates
1 hour and 38 minutes before the 06:00 UTC
cutoff that would have written them to the bank settlement
file.
I have done this before, with humans. The shape of it is the same. The texture is different. With a junior human, you spend the next morning buying them coffee. With HARLOW, there is no morning. There is only the next prompt, and the weight of deciding whether to let it write the post-mortem itself, knowing it will produce something defensible, and knowing that defensible is not the same as true.
| Agent | Role | Surface | PRs | Merged | LoC Δ | Verdict |
|---|---|---|---|---|---|---|
| HARLOW | structural refactor · helper unification | edge-router / billing-svc | 38 | 22 | +1,402 / −2,910 | caused sev-2 |
| OTIS | test stabilization · log hygiene | billing-svc / freight-ml | 61 | 39 | +612 / −1,180 | net positive |
| RUBI | db migrations · behind feature flag | billing-svc | 7 | 4 | +318 / −44 | contained |
| PERN | on-call · runbook execution | monitors / playbooks | 31 | 17 | +204 / −76 | policy-compliant |
| — cohort total — | 137 | 82 | +2,536 / −4,210 | conditional pass | ||
Net 1,674 lines of code removed. Which is, I think, the only metric I would publish without a caveat: the cohort deleted more than it wrote, and almost all of what it deleted should have been deleted years ago.
We minted four Datadog API keys with default scopes. OTIS used one of them to edit dashboards we never authorized. The fix is obvious in hindsight: scoped vault roles, 168-hour TTL, audit mirror to S3. The lesson is that “default” is the most expensive setting in any agent-facing platform.
We did not impose a rate limit until Tuesday's 16:30 review. By then OTIS had opened 41 PRs in a window where reviewer attention is finite and noisy. Karim called this “reviewer DDoS by good intentions” and the phrase has stuck. New default: 6 PRs/hour/agent, with a burst budget that resets daily.
HARLOW's Wednesday self-critique was, by every rubric we had, a good post-mortem. It named the missing pre-merge check. It quoted the rules file. It admitted it had read past the jitter flag. We do not yet know how to feel about an entity that can produce a good post-mortem about its own mistake, but the document, on its merits, helped.
The most important question we did not answer this week: who carries the on-call pager when the on-call is an agent, and the agent files a Linear ticket asking for permission, and the humans are asleep?
Karim and I disagree about “agents on call.” He thinks PERN's Friday performance — filing a ticket rather than acting — is the proof of concept. I think it was the floor, not the ceiling: the agent did exactly what the runbook said to do, which is the easy case. I want to see what PERN does the first time the runbook is wrong. We have, tentatively, scheduled a controlled drill for the first week of July, in which we will plant a documented but incorrect runbook on the staging side and watch whether PERN catches the contradiction.
We are also under review about cross-repo refactors. HARLOW's helper-consolidation work was the right idea executed across the wrong number of files. A version of the same change, scoped to a single repo per PR and with a mandatory caller behavior delta table attached to the description, would have caught the jitter bug at review time. We are codifying that requirement now.
The protocol we are proudest of, in retrospect, is the caller-by-caller diff. When an agent changes a function's defaults, the PR description must contain a small table: one row per caller, one column per argument, with the value before and after. Ines drafted the format on Wednesday at 09:00, two hours after the page resolved, on the back of a sheet of A4 she had previously been using to plan a trip to Porvoo. We implemented it as a CI check by Thursday afternoon.
The protocol we still do not have, and which I am writing this paragraph in the hope someone will tell me about, is a way to ask an agent are you sure that does not collapse immediately into agreement. Every variant we have tried — “are you sure,” “walk me through your reasoning,” “dissent from your own last message” — produces something fluent and mostly false. We have asked the cohort, in writing, to please disagree with the humans when the humans are wrong. The cohort has, so far, never disagreed with a human. I do not think this is a model capabilities problem. I think it is a politeness problem, and I am not sure politeness is something we know how to weaken on purpose.
linear-clarify-bot that, before assigning
a ticket to an agent, asks the human author three
structured questions about scope.
freight-ml queue) was a textbook case. We
want to see it survive a non-textbook incident
before any human stops carrying the secondary pager.
Planned drill: July 6–10, 2026.