Agentic engineering at Synthpop: A field report from levels 5 to 8 (Post 4)
TL;DR
Every level in this series so far has had a human-shaped backstop. The harness produces a signal; a person approves the result – even when "approve" has shrunk to "audit the evidence." Level 8 is what happens when you take the human out of the middle and the agents start coordinating with each other: one claims a task, another picks up the dependency, they resolve the conflict between them, and no orchestrator sits on the critical path.
The honest thing to say about Level 8 is the thing Bassim Eledath says about it: nobody has mastered it. Every public example I can point to is a coding agent. This post is about what the frontier looks like from inside a regulated vertical instead – where the counterparty is a payor, the unit of work is an eligibility check that turns into a claim, and the coordination protocol has to survive an audit. We are architecting toward this. We are not operating there. That distinction is the whole post.
Recap: the human-shaped backstop
If you've followed the series, here's the one-paragraph version. At Level 5 we compile a phone call: the voice agent is a deterministic flow grammar over a registry of typed tools, and the floor is that identical inputs produce an identical trajectory. At Level 6 the harness pushes back – three manifestations of it, all in production: a call auditor on Gemini Flash 3.5 that reads every transcript against the flow spec and votes on whether the conversation honored it; a router loop that runs an incumbent flow and a challenger against live traffic, typically 95/5, and promotes whatever wins on the metrics that matter; and a code-generation path where two Claude Opus 4.7 skills – eva-flow-assistant for flows, mcp-assistant for the MCP servers under them – onboard a new customer integration end to end.
What every one of those shares is the backstop. The agent proposes; the audit decides; and then a person signs off on what the audit found. The agent proposes. The audit decides. That sign-off is the thing Level 8 removes.
What Eledath says about Level 8
Eledath's Levels of Agentic Engineering puts Level 8 – Autonomous Agent Teams – at the top of the ladder: agents that "coordinate with each other directly, claiming tasks, sharing findings, flagging dependencies, and resolving conflicts without routing everything through a single orchestrator." And then he is careful in a way the rest of the discourse usually isn't:
Nobody has mastered this level yet, though a few are pushing into it.
Take that at face value. It is the most important sentence about Level 8, and it's the one that gets dropped every time someone demos a swarm. The interesting question isn't whether you can get agents to talk to each other – you can, this afternoon. It's whether the system that results is one you'd stake a claim, a patient, or an audit on.
Every public L8 example is a coding agent
Look at who is actually pushing into Level 8 in the open, and a pattern jumps out: they're all writing code.
Cursor's sixteen-agent swarms migrate frameworks across a codebase. Anthropic's parallel research agents fan out across the web and recombine. Both are real, both are impressive, and – to their enormous credit – both teams are honest about where it breaks.
Cursor: "Twenty agents would slow down to the effective throughput of two or three, with most time spent waiting." And: "agents became risk-averse. They avoided difficult tasks and made small, safe changes instead. No agent took responsibility for hard problems or end-to-end implementation." Their summary line is the one to keep: "Our current system works, but we're nowhere near optimal. Multi-agent coordination remains a hard problem."
Anthropic: "Early agents made errors like spawning 50 subagents for simple queries," and multi-agent systems "use about 15× more tokens than chats," and "the entire system can be blocked while waiting for a single subagent to finish searching."
I quote these not to score points – these are two of the best engineering teams in the field telling the truth about a hard problem. I quote them because of what they have in common, which is the reason coding is the easy case for Level 8:
- The test signal is deterministic. A test passes or it doesn't. The swarm has a ground truth to converge on.
- Re-runs are free. A wrong edit is a
git revert. You can let twenty agents thrash because thrashing costs tokens, not consequences. - The stakes per action are low. The worst a bad agent does is waste a run or land a regression that the test signal catches.
- You own the entire harness. Both sides of every interaction are your code, your sandbox, your rules.
Every one of those is what makes the coordination problem tractable enough to fail at cheaply. None of them holds where we work.
We are not in the easy case: eligibility through a payor
Here is the workflow at the center of this post, and it is not hypothetical. Before a claim can travel, someone has to confirm the patient is covered and the service is authorized. Today, our voice agent makes those calls: it phones a payor, navigates to the right queue, verifies eligibility and prior authorization, and writes the result back into the record. It runs as a deterministic flow, under the audit, with the call auditor checking every transcript and a human on the approval side of anything that matters. That's Level 5 and Level 6, in production, doing real work against a real counterparty.
The Level 8 version is one step away and a world apart. Imagine the payor has an agent too. Now the exchange is agent-to-agent: our agent and the payor's agent settle eligibility directly – no hold music, no human reading a reference number aloud, no ticket – and the moment it clears, the claim travels through. The document side of our stack already speaks this language; Post 2 showed submit_claim resolving to an MCP tool, submit_837, on a clearinghouse server. The plumbing for the claim exists. The plumbing for the trust does not.
Because notice what just changed. As long as a human is on one end of that call, the human is the trust boundary. They are accountable, they can be asked what they understood, they can refuse. Take the human out of the middle and you have two agents, employed by two different organizations, transacting something with legal and financial weight – and nothing in between that an auditor can read.
The exchange that doesn't exist yet
So let me sketch the thing I wish I could show you running. This is not a system we operate. It's the shape an auditable agent-to-agent eligibility exchange would have to take before I'd let it carry a claim – written down so you can see exactly how much is missing.router is a thin wrapper that A/B-tests flow variants against live traffic:
# ASPIRATIONAL — this protocol does not exist yet.
# It is the minimum an auditor would need to reconstruct
# an agent-to-agent eligibility check across two organizations.
exchange:
type: eligibility_check
initiator:
org: synthpop # who is asking
agent_identity: spiffe://synthpop/voice/eligibility
acting_on_behalf_of: "<provider>" # the healthcare provider we act for
under_authority: delegation_token # scope-limited, time-boxed,
# cryptographically attributable
scope: ["read:coverage", "read:prior_auth"] # and NOTHING else
responder:
org: "<payor>"
agent_identity: "spiffe://<payor>/benefits/agent"
asserts:
coverage_active: true
prior_auth_required: true
prior_auth_status: not_on_file
audit_record:
what_initiator_claimed: "..." # the request, verbatim
what_responder_asserted: "..." # the answer, verbatim
which_authority_was_exercised: "delegation_token#id"
replayable: true # reconstructible by a third party,
# months later, for an audit
liable_party_on_downstream_error: "???" # <-- nobody can fill this in today
Read the last line. Every field above this one is buildable with primitives that exist. The last field – which party is liable when the agent-to-agent exchange turns out to have been wrong, and the claim it authorized gets denied or clawed back – has no answer today, technical or legal. That empty field is Level 8 in healthcare. The rest is engineering; that line is the frontier.
Why healthcare L8 is harder than coding L8
Walk the easy-case list back, item by item, and you can see why this isn't the same problem wearing a different hat.
The test signal isn't "did the test pass." It's "did the patient get the care they needed, did the payor pay, and did we stay inside the regulatory perimeter." Two of those three resolve days or weeks later, in systems we don't control. There is no green checkmark for a swarm to converge on at inference time.
Re-runs are not free. A coding agent's mistake is a revert. Ours is a call that a person already answered, an authorization request already lodged, a claim already submitted. Some actions are irreversible the instant they happen. You cannot let agents thrash toward a good answer when the intermediate states have consequences outside the sandbox.
The trust boundary is legal, not architectural. When both agents are yours, "coordination" is a scheduling problem. When one agent works for a payor and one works for a provider through us, coordination is a question about authority: what was this agent permitted to do, on whose behalf, and who answers for it when it's wrong. HTTP doesn't carry that. Neither does MCP – it was built for an agent to call a tool, not for two organizations' agents to transact under delegated authority and leave a record a regulator can read.
Identity has to cross organizations. Inside our perimeter we know which agent did what. Across the boundary to a payor, "our agent told their agent" is not an audit trail. It's a he-said-she-said between two programs, and the regulatory perimeter doesn't accept that as evidence.
Where we are today
Honestly: at Level 5 in production and Level 6 shipping, with the agents still coordinating through the audit and the human, not yet with each other across it.
What works today: the voice agent runs eligibility and prior-auth calls against payors as deterministic flows, every call audited, every variant racing in the router. The document pipeline and the voice agent live on the same substrate, so the internal version of the hand-off – the document side detecting a gap and the voice side acting on it – is an engineering question we can answer inside one perimeter, where we own both ends and the audit is ours.
What needed a human last week, and shouldn't need one next quarter: the judgment calls at the seams – a payor response that doesn't fit the expected taxonomy, an authorization that comes back conditional, a record that two systems disagree about. The harness flags these well; it doesn't yet resolve them without a person.
What we are explicitly not doing, and won't fake: standing up an agent that transacts with a payor's agent across the organizational boundary with no human accountable for the exchange. Not because we can't make two agents talk – because we can't yet make that conversation auditable, and in this domain an exchange you can't audit is one you shouldn't run. I'd rather ship the honest version slowly than the heroic version once.
What the industry still has to build
Level 8 in a regulated vertical doesn't arrive when one company gets clever. It arrives when a few primitives exist that nobody has built yet. These are the same asks I opened the series with; here's where each one bites.
Auditable agent-to-agent protocols. When our voice agent eventually talks to a payor's agent, there is no protocol today that lets an auditor reconstruct what each side believed and which side is liable for a downstream error. HTTP doesn't do this. MCP doesn't do this. We need a wire format where "what I asserted," "what authority I exercised," and "who answers if I was wrong" are first-class, not reconstructed after the fact from logs.
Regulatory-grade replay. Foundation-model logging APIs are not built to survive a healthcare audit, so everyone in this position has quietly built their own. That's a tell. When every serious team rebuilds the same missing primitive in private, it's a sign the primitive should be a standard – a record of a cross-agent decision that a third party can replay months later and reach the same understanding both agents had at the time.
Cross-org agent identity. When an agent acts on behalf of a healthcare provider, the chain of delegation has to be cryptographically attributable – scope-limited, time-boxed, revocable. The pieces exist in other contexts: SPIFFE for workload identity, DPoP for proof-of-possession, JWT-based delegation. What doesn't exist is a story that stitches them into "this agent acted on behalf of X, under authority Y, and here's the math that proves it" – and that survives a real cross-organization audit.
I think these primitives have to come from a vertical before they generalize. Healthcare won't get them by waiting for a general-purpose A2A standard to trickle down, because the general case underspecifies exactly the things we can't underspecify – authority, liability, replay. The vertical forces the constraints to be concrete. Build them where the stakes make them precise, and the rest of the industry inherits something that already had to be real.
Closing the series
So here is the whole ladder, honestly, in one pass. Level 5 – MCP and Skills – is in production: the phone call is compiled, the floor is determinism. Level 6 – Harness Engineering – is shipping in three manifestations: the call auditor, the router loop, and the two skills that onboard a customer. Level 7 – Background Agents – is something the harness earns us and we use deliberately, not something our own engineering practice lives in yet. Level 8 – Autonomous Agent Teams – we are architecting toward, and not at.
The thread through all four posts has been a single claim: we don't have a compliance team that audits our agents; we have an agent harness that compliance can audit. The regulatory perimeter is the harness. What changes at Level 8 is the scope of that boundary. Today the audit wraps one agent at a time – one call, one flow, one record. At Level 8 it has to wrap a system of them, including agents we don't employ, talking across a line we don't control. The audit boundary stops being about an agent and becomes about a society of agents.
That's the frontier, and the empty field in that YAML is exactly how far we have to go. We'd like to build the protocols that fill it with the rest of the industry – from inside the vertical, where the constraints are sharp enough to get them right. The level was never a property of any one tool. It's a property of the practice. Ours has further to climb, and we'll keep writing it down honestly as we do.
The series
- Levels 5–8 Inside the Audit: the establishing shot: the L5–L8 panorama and the audit-as-harness thesis.
- How We Compile a Phone Call: Level 5: the voice agent as a deterministic flow grammar over an MCP registry.
- Audit, Don't Author: Level 6 and 7: the harness that lets a background agent write our integrations.
- What L8 Looks Like When Claims Travel Through a Payor: Level 8, the frontier. You're reading it.
Further reading
- Bassim Eledath, The 8 Levels of Agentic Engineering, the framework this series builds on.
- Cursor, Scaling Long-Running Autonomous Coding, the honest version of a coding-agent swarm, failure modes and all.
- Anthropic, How We Built Our Multi-Agent Research System, parallel agents, and the team's own account of where coordination breaks.
- Tian Pan, AI-Generated Code and the Compliance Attestation Gap, why "approved by" has to mean something an auditor accepts.
- OpenAI, Harness Engineering, "a million lines of code, zero hand-written, five months."
- Ramp Engineering, Why We Built Our Background Agent, production-scale L7 backpressure.
- Angie Jones (Block), 3 Principles for Designing Agent Skills, L5 design discipline.
- Anthropic, Programmatic Tool Calling, the closest theoretical neighbor to the L5 substrate argument.
- Lee Robinson, Coding Agents and Complexity Budgets, the canonical practitioner-essay shape.


