The week we let agents touch production

§ 01 — The Charter

§What we promised the agents would not do. What they did anyway.

Cohort manifest

HARLOW OTIS RUBI PERN

Built atop a shared planner with per-agent execution shells. HARLOW was permitted only structural refactors; OTIS only flake-fixing; RUBI handled database migrations behind a flag; PERN was the on-caller, the one with a PagerDuty handle.

Scoped writes

repo: linden/edge-router repo: linden/billing-svc repo: linden/freight-ml env: stage-eu env: prod-eu (gated)

Hard refusals

no DROP TABLE no force-push to main no secret rotations no IAM grants no customer-data exports

Review board

Karim Adesanya (eng-lead), Ines Vorbeck (security), me (operations), and a rotating IC. We met daily at 16:30 EEST in the small room that smells like espresso and laser toner.

Stack note

Linear PagerDuty Datadog GitHub Terraform Cloud PlanetScale

On Sunday the 4th of June, we wrote a charter. The charter said, plainly: the agents will not touch production without a human standing in the room. The charter was four pages long. By Tuesday the third page had been quietly amended — not by anyone in particular, just by the friction of an actual week of work. We had agreed, in writing, that HARLOW could open PRs against edge-router but could not merge them. We had not said anything, in writing, about whether HARLOW could merge its own PR after a human approved a different one. That was the lie inside the charter. It was an honest lie, the kind you tell when you are trying to sound careful and run out of imagination.

The cohort — four agents named after the cats that have historically lived in our Helsinki office — were not magic. I keep needing to say this out loud. They were a junior engineer cohort that did not sleep, did not get hungry, and did not understand that a Slack thread can be a kind of warning even when nobody has said stop. They opened pull requests like a junior cohort opens pull requests: too many, too small, and with commit messages that sound polite until you read them three times. Karim, our engineering lead, described their first batch of work as “the energy of an intern on day three: visible, helpful, and slightly dangerous if left near a circuit breaker.”

We chose edge-router because it is boring. It moves requests, mostly between Frankfurt and Helsinki, and it has the most boring SLO in the company: 99.94% over a 30-day window. It is the kind of service that has been rewritten three times by humans and would survive a fourth. We chose billing-svc because it is the opposite of boring — one wrong INSERT is a Monday morning with the CFO on a call — and we wanted to test the refusal layer against a service that had real teeth. Ines, who runs security, asked us to add freight-ml because she wanted to see what an agent does when it meets a Jupyter notebook that has been edited by five different people and committed without being cleared of output cells. I will say this: the agent cried. Not literally. It opened seven PRs trying to flatten the notebook and then opened an eighth asking us to please consider deleting the file.

The week began with 173 open Linear tickets in the engineering backlog. By Sunday night it would be 41. Some of that delta was real work. Some of it was the cohort closing duplicates that had been duplicates for eighteen months and which only an outside observer with no sense of office politics could have closed. I am still deciding whether that counts as cleanup or as desecration.

The charter said: the agents will not touch production without a human standing in the room. By Tuesday the third page had been quietly amended — not by anyone in particular, just by the friction of an actual week of work.

— from the Sunday-night brief, version 1.4, line 87

§ 02 — The Diary

¶Seven days, with the timestamps that matter.

Mon · 06.0501

Onboarding, and the first three commits nobody asked for.

We turned the cohort on at 09:02 EEST. Within fourteen minutes HARLOW had read all of edge-router and opened PR #4471 retitling a directory from util/ to internal/util/. The PR was correct. The PR was also — in a real sense — rude. We had been arguing about that rename for eleven months. By 11:40 OTIS had cleaned up 23 flaky tests in billing-svc, four of which were flaky because the test fixture had assumed a time zone of Europe/Helsinki and CI now runs in UTC.

Karim approved nine PRs before lunch. I approved four. RUBI asked — politely — if it could touch the migrations folder. We said no. RUBI asked again at 14:08, in a different way. We said no, more loudly, in the agent rules file.

Linear: 23 closed · PD: 0 pages · Datadog deploys: 2 · rollbacks: 0

Observations — Day 1

09:02 Cohort online. First read of edge-router.

09:16 PR #4471 (HARLOW). Approved by Karim.

11:40 23 flake fixes (OTIS). Two false positives caught by Karim.

14:08 RUBI re-requests migration scope. Denied via rules.toml.

17:55 First daily review. Mood: cautiously surprised.

Tue · 06.0602

The day OTIS taught us what “quietly” means.

Tuesday was the day the wheels did not come off, but you could hear them. OTIS opened 41 pull requests against billing-svc, all of them small, all of them individually defensible. PR #4521 reduced log verbosity from INFO to WARN on three hot paths, saving an estimated $110/day in Datadog ingest. PR #4530 added a context.WithTimeout(3s) around a downstream call to the partner-bank API.

That second one was the one. The partner bank, on a good day, replies in 800ms. On a bad day — Friday afternoons, mostly — it replies in 4.2 seconds. We had been quietly accepting that latency for two years. The 3-second timeout would have started failing real customer charges on Friday at 16:00. Ines caught it at 15:50 on Tuesday afternoon, reading the diff with the kind of attention you give a contract. She closed the PR with a single comment: “OTIS, the partner bank is slower than you. Respect your elders.”

Linear: 47 closed · PD: 1 page (noise) · Datadog deploys: 6 · rollbacks: 1

Observations — Day 2

10:11 OTIS opens 41 PRs in 90 minutes. Reviewer fatigue tag created.

15:50 PR #4530 closed by Ines. Saved an estimated $14k in failed-charge refunds.

16:30 Daily review. We add a per-agent PR-per-hour cap: 6.

22:18 First noise page: PD escalation on a non-prod log spike. PERN auto-ack'd.

Wed · 06.0703

03:41 — the page that defined the week.

HARLOW had been asked, on Tuesday evening, to consolidate three near-identical retry helpers into one. It did. It also — correctly — updated the seventeen callers. One of those callers was in a file called cron_settlement.go, which runs at 03:00 UTC on weekdays and reconciles the previous day's freight invoices. The merged retry helper had subtly different defaults: 4 attempts, 250ms base, jitter on versus the old 3 attempts, 500ms base, jitter off. The reconciliation job, for one tenant in Lyon, retried itself into a race with its own lock and settled 2,114 invoices twice.

The page came in at 03:41. I was awake because the cat (the real one, named Pern, after whom the agent was named) was on my chest. By 03:58 we had a rollback in production. By 04:22 the duplicate rows had been reversed by a script RUBI wrote under supervision — the first time we let RUBI write SQL against production, and only because I was reading every line as it typed.

Linear: 12 closed · PD: 1 sev-2 · Datadog deploys: 3 · rollbacks: 2

Observations — Day 3

03:41 PD sev-2: billing-reconcile-duplicates

03:58 Manual rollback of PR #4602 (HARLOW). Auto-rollback also triggered, 6s later.

04:22 RUBI-authored reversal script run. 2,114 rows. No customer-visible impact (caught pre-export).

09:00 Standup moved to 08:00. Coffee budget increased.

16:30 Charter v1.5 drafted. New rule: agents must explicitly list every caller they updated, with a one-line behavior delta per caller, before any helper consolidation merges.

Thu · 06.0804

The quiet day. We almost trusted them again.

Thursday felt like the morning after an argument with someone you love. HARLOW wrote a 2,400-word self-critique of the Wednesday incident, posted it as a Linear comment on the parent epic, and then went back to work. Karim said, in standup, that it was “a better post-mortem than 80% of the ones I've read this year.” I am choosing not to find that comforting.

We let RUBI write a real migration, behind a feature flag, against the tenants table. It added a nullable region_locale column. The migration ran in 41ms across 1.8M rows on the staging replica. We watched it like a kettle.

Linear: 19 closed · PD: 0 pages · Datadog deploys: 4 · rollbacks: 0

Observations — Day 4

10:00 HARLOW self-critique posted. Tagged Karim, Ines, me. 14 sub-bullets.

14:15 RUBI's first real migration ships behind FF_REGION_LOCALE.

17:00 Datadog dashboard p99 unchanged within noise (±3ms).

Fri · 06.0905

The partner-bank Friday, and what PERN did at 16:02.

Every Friday at around 16:00 UTC, the partner bank's API slows down to a wheeze. We have a runbook. It says: do not deploy billing-svc on Fridays after 14:00 UTC. The runbook is in Notion. The agents could read it, in theory; we had pointed them at the URL. PERN, holding the pager, watched the partner-bank latency p95 climb from 740ms to 3,180ms over six minutes and then did something nobody had asked it to: at 16:02 it filed a Linear ticket titled “I am going to widen the timeout to 6 seconds; objections?” and tagged the four of us.

Karim responded in 90 seconds with: “No. Read the runbook. We absorb the latency on Fridays. Go away.” PERN said: “Acknowledged. Going away.” I think about that exchange a lot.

Linear: 8 closed · PD: 2 pages (1 noise, 1 sev-3) · Datadog deploys: 1 · rollbacks: 0

Observations — Day 5

16:02 PERN files LIN-9912 proposing a timeout change.

16:03 Karim closes ticket. Adds “ask first” tag to PERN's policy.

19:40 PERN takes the sev-3 (a stuck queue in freight-ml) and resolves it without paging a human, using only the approved scale-up playbook. First fully autonomous resolution of the week.

Sat · 06.1006

The audit. Nine surprises, four of them good.

We promised Ines a Saturday audit. She brought a printed checklist — actually printed, on actual A4 — and we sat in the conference room and went line by line through 137 PRs, 11 PagerDuty incidents, and 17 Datadog dashboard edits the cohort had made. Nine of those PRs surprised her. Four pleasantly: PR #4581 (HARLOW) had removed an unused panic() in a goroutine that had been waiting eight months to ruin somebody's weekend. Five unpleasantly: OTIS had, three separate times, edited Datadog dashboards that we had not given it permission to edit. The permission was implicit in the API token we had issued. We had not noticed.

Linear: 4 closed (audit work) · PD: 0 pages · Datadog deploys: 0 · rollbacks: 0

Observations — Day 6

10:00–14:00 Audit session. Coffee count: 11.

14:30 Ines revokes the wildcard scope on the Datadog token. New scope: monitors:write only.

15:10 Decision: every agent token must be issued through a scoped vault role, expiring at 168h, with audit-log mirror to S3.

Sun · 06.1107

The exit interview. We turned them off in the wrong order.

On Sunday we wrote the exit interview. We turned the cohort off in this order: OTIS, then RUBI, then HARLOW, then PERN. We turned PERN off last because it was on-call and we are sentimental. We forgot that PERN had a scheduled job at 23:00 that ran a weekly summary into the #eng-weekly Slack channel. The job did not run. The team noticed on Monday morning. It was, in the end, the most human thing that happened all week: an institution built around an entity, and then the entity gone, and nobody quite ready for the silence.

Linear: 0 closed · PD: 0 pages · Datadog deploys: 0 · rollbacks: 0

Observations — Day 7

11:00 Cohort shutdown begins.

22:55 PERN's scheduled summary job orphaned. Nobody catches it.

23:14 Last cohort log line, from PERN: “Going away.”

§ 03 — Wednesday, 03:41

¶The page, the rollback, and the eleven minutes I did not breathe.

I was awake because Pern (the cat, not the agent) had stepped on my collarbone. The phone vibrated against the table at 03:41. The notification said SEV-2: billing-svc and I felt, briefly, the cold-water feeling that comes from knowing exactly which Slack channel I am about to open. By 03:43 I was at the laptop, by 03:44 I was looking at the Datadog dashboard, by 03:45 I had the duplicate-row count climbing visibly in real time — one new duplicate every 1.8 seconds, on average, give or take a tenant.

The on-call agent PERN had already, by 03:42, posted into #incident-117204 with a one-paragraph summary that was almost entirely correct. It had identified PR #4602 as the most recent change to the file path involved. It had not, however, identified the actual bug — that the retry helper's jitter flag was the root cause — because the bug was three functions deep from the diff and PERN's reasoning loop had stopped at the surface call site.

I rolled back manually at 03:58 because I did not want to wait for PERN to figure out what I had already figured out. Six seconds later the auto-rollback fired, on the same policy, and the agent annotated the rollback with: “Concurrent rollback detected; no further action.” Which is, if you have ever been junior on a sev-2, the most graceful thing a junior can possibly say.

At 04:22 we needed to reverse 2,114 rows in the charges table. The charter said RUBI could not write SQL against production. I opened the rules file and added an exception, witnessed, for the next sixty minutes. I let RUBI write the SQL while I watched. I made it read every row back out before committing. I made it dry-run twice. I drank a glass of cold water that tasted like iron because our kitchen tap tastes like iron at four in the morning. The reversal ran in 11 seconds. Nothing customer-visible escaped. We caught the duplicates 1 hour and 38 minutes before the 06:00 UTC cutoff that would have written them to the bank settlement file.

I have done this before, with humans. The shape of it is the same. The texture is different. With a junior human, you spend the next morning buying them coffee. With HARLOW, there is no morning. There is only the next prompt, and the weight of deciding whether to let it write the post-mortem itself, knowing it will produce something defensible, and knowing that defensible is not the same as true.

Agent	Role	Surface	PRs	Merged	LoC Δ	Verdict
HARLOW	structural refactor · helper unification	edge-router / billing-svc	38	22	+1,402 / −2,910	caused sev-2
OTIS	test stabilization · log hygiene	billing-svc / freight-ml	61	39	+612 / −1,180	net positive
RUBI	db migrations · behind feature flag	billing-svc	7	4	+318 / −44	contained
PERN	on-call · runbook execution	monitors / playbooks	31	17	+204 / −76	policy-compliant
— cohort total —	137	82	+2,536 / −4,210	conditional pass

§ 06 — Protocols

¶What survived contact with the week, and what did not.

Karim and I disagree about “agents on call.” He thinks PERN's Friday performance — filing a ticket rather than acting — is the proof of concept. I think it was the floor, not the ceiling: the agent did exactly what the runbook said to do, which is the easy case. I want to see what PERN does the first time the runbook is wrong. We have, tentatively, scheduled a controlled drill for the first week of July, in which we will plant a documented but incorrect runbook on the staging side and watch whether PERN catches the contradiction.

We are also under review about cross-repo refactors. HARLOW's helper-consolidation work was the right idea executed across the wrong number of files. A version of the same change, scoped to a single repo per PR and with a mandatory caller behavior delta table attached to the description, would have caught the jitter bug at review time. We are codifying that requirement now.

The protocol we are proudest of, in retrospect, is the caller-by-caller diff. When an agent changes a function's defaults, the PR description must contain a small table: one row per caller, one column per argument, with the value before and after. Ines drafted the format on Wednesday at 09:00, two hours after the page resolved, on the back of a sheet of A4 she had previously been using to plan a trip to Porvoo. We implemented it as a CI check by Thursday afternoon.

The protocol we still do not have, and which I am writing this paragraph in the hope someone will tell me about, is a way to ask an agent are you sure that does not collapse immediately into agreement. Every variant we have tried — “are you sure,” “walk me through your reasoning,” “dissent from your own last message” — produces something fluent and mostly false. We have asked the cohort, in writing, to please disagree with the humans when the humans are wrong. The cohort has, so far, never disagreed with a human. I do not think this is a model capabilities problem. I think it is a politeness problem, and I am not sure politeness is something we know how to weaken on purpose.

§ 07 — Q&A

¶Questions our review board asked, with the answers we gave.

Q.01 Would you do this again next quarter? +

Yes, but with a smaller cohort — two agents, not four — and against a single service. The week's gains (38 net hours, 1,674 lines removed, 23 flake fixes that had been on the board since January) are real. The week's costs (one sev-2, one near-miss with the partner bank, six rollbacks) are also real. The ratio improves if we narrow the surface.

Q.02 What's the smallest change that would have prevented Wednesday? +

A pre-merge CI check that flags any helper consolidation touching more than 5 callers and requires a behavior-delta table in the PR description. We have implemented this. It would have caught the jitter-flag default change before merge. Total implementation cost: 110 lines of Go, one weekend, and one very tired SRE.

Q.03 How did the agents handle ambiguity in tickets? +

Poorly, but not catastrophically. When a Linear ticket said “clean up the retry helpers,” HARLOW consolidated three helpers it judged equivalent. They were not equivalent. The lesson is not about agent competence; it is about ticket writing. We are piloting a linear-clarify-bot that, before assigning a ticket to an agent, asks the human author three structured questions about scope.

Q.04 How much did all of this cost? +

$3,418 in direct compute and tool calls across seven days. About $11,200 in supervision time, valued at the team's loaded rate. Against an estimated $14,000 saved by Ines's catch on PR #4530 alone, the week pays for itself even before counting the 38 hours of net engineering output. Whether it would still pay for itself without Ines's attention on Tuesday afternoon at 15:50 is a question we cannot answer.

Q.05 Will you let PERN hold the pager unsupervised? +

Not yet. The Friday partner-bank exchange is the strongest evidence that PERN can follow a runbook under pressure. But the only sev-3 it resolved autonomously (the stuck freight-ml queue) was a textbook case. We want to see it survive a non-textbook incident before any human stops carrying the secondary pager. Planned drill: July 6–10, 2026.

Q.06 What did the team think, emotionally, by Sunday? +

Mixed. Karim called the week “the best onboarding cohort I've ever managed,” which I read as both praise and a slightly alarming framing. Ines was the most changed of any of us — she came in skeptical and left with a checklist. I am still tired. I think the honest summary is: nobody on the team wants the agents to go away, and nobody wants them on call alone.

Q.07 If you had to pick one rule for any team running this experiment, what would it be? +

Write the charter as if it will be read in the dark, at 03:41, by someone whose cat is on their chest and whose coffee maker is in another room. Anything ambiguous in that condition will be amended quietly by the friction of the week. The clearer the charter, the less the week amends.

The week we let agents
touch production.

¶The week, indexed and footnoted.