Agentic engineering at Synthpop: A field report from levels 5 to 8 (Post 1)
TL;DR
When the mcp-assistant skill in our integration pipeline produces a Python handler that the harness lets through, the failure trajectory eventually includes a phone call to a clinic. That's a different shape of feedback loop than a failed pytest run.
I'm Davor, VP of Agentic Engineering at Synthpop. We're a twenty-person engineering team building a healthcare patient-journey orchestration platform – about a thousand voice calls a day, over a million PDF documents processed last quarter. We operate inside the audits that come with healthcare, and trying to ship reliable agents inside that frame has changed how we think about Bassim Eledath's Levels of Agentic Engineering.
This is the first of four posts. The thesis: we don't have a compliance team that audits our agents. We have an agent harness that compliance can audit. The rest of this post is the why; the rest of the series is the how.
Who we are
Synthpop coordinates the back-office decisions that determine whether a patient's order gets filled – across intake, eligibility checks, prior authorization, and the dozens of small handoffs between providers, payors, and durable medical equipment suppliers that turn a clinical referral into something a patient actually receives. Most of those handoffs were faxes and phone calls a decade ago. Many of them still are.
We're roughly twenty engineers. We ship two products that share a substrate: a voice agent that handles the phone calls (about a thousand a day) and a PDF pipeline that handles the paperwork (over a million documents last quarter). Separate runtimes, same substrate – a typed flow grammar over a registry of MCP tools. As of April 2026, we're the first healthcare AI agent partner in Google's Gemini Enterprise Agent Gallery, running on Vertex AI.
What's distinctive about how we build is downstream of one constraint. Synthpop's customers are healthcare organizations. We're not in a regulated industry abstractly; we're inside actual audits – customer audits, regulatory audits, our own internal compliance reviews – that come with handling the kind of data that flows through our system. That shapes more or less every engineering decision past the prototype stage. The patterns that work for Vercel or Ramp or Block when they wire agents into their build pipelines don't transfer cleanly to a runtime where every change has to be defensible to an auditor.
The ladder we're climbing
If you've read Bassim Eledath's Levels of Agentic Engineering, the next few paragraphs will be quick. If you haven't, the link is the prerequisite for the rest of this series – Eledath's framework is the lens, and I'm not going to restate it.
Briefly: eight levels of engineering practice maturity, from L1 (Tab Complete) through L8 (Autonomous Agent Teams). The framework is about how a team uses agents to build software, not about how autonomous the resulting product is. That distinction is important, and the rest of this post leans on it.
Here's where Synthpop is, honestly:
- L5 (MCP and Skills) – in production. Our voice agent and PDF pipeline both run on the same substrate: a typed flow grammar – Pydantic schemas over YAML – compiled against a registry of MCP tools. The flow file is the auditable artifact. Post 2 goes deep on this.
- L6 (Harness Engineering) – shipping. Two Claude skills do agent-authored integration work today.
eva-flow-assistantwrites new flow definitions when we onboard a customer.mcp-assistantwrites the MCP server that integrates the customer's specific systems. Both ship under a harness that validates outputs before they reach production. Post 3 goes deep on this. - In our own engineering, L4–L5 daily. The team uses Claude Code with internal codified skills – Eledath's L4 (Compounding Engineering) plus L5 (MCP and Skills). We aren't at L7 (Background Agents) for our own development work yet. We'll get there.
- L8 (Autonomous Agent Teams) – architecting toward, not at. Eledath himself says of L8: "Nobody has mastered this level yet, though a few are pushing into it." We're pushing. Post 4 is the field report from the edge.
This series follows in a genre – engineering-practice posts about agents at production scale. The current canon includes Ramp's writeup of Inspect, Block's skills marketplace (Angie Jones's 3 Principles for Designing Agent Skills is the recommended entry point), OpenAI's Harness Engineering, and Lee Robinson's Coding Agents and Complexity Budgets. The white space we're trying to occupy is engineering-practice posts from a regulated vertical.
The audit is the harness
Eledath's framing of L6 – Harness Engineering – is the most useful single concept in the framework, and it's the load-bearing concept for the rest of this series. The definition: "Building the entire environment, tooling, and feedback loops that let agents do reliable work without you intervening." The unlock isn't a smarter agent. The unlock is backpressure – automated feedback (type checks, tests, linters, contract tests, replay) that catches the agent's mistakes before they ship. Once you have enough backpressure, the agent can run without you.
The canonical L6 examples – Ramp's Inspect, OpenAI's harness work, Block's internal skills marketplace – share a quiet assumption: backpressure machinery can be wired into the agent's environment freely. You pick the sandbox, you pick the test harness, you pick the observability stack, you pick what the agent sees. Want browser sandboxing? Spin one up. Want a fresh database fixture per task? Trivial. Want the agent to read your Sentry and your Datadog and your LaunchDarkly? Wire it in.
For most of the L6 canon, "freely" is the right word. For us, "freely" is the word that breaks.
Three concrete constraints, each of which would shift an L6 design at Ramp or Block or Vercel into a constrained design at Synthpop:
- The sandbox can't touch the data we're processing. The natural L6 move when an agent needs to validate against real-world conditions is to give it the real-world environment. We can't. The realistic environment includes data that, in our domain, has to stay inside a specific perimeter. The agent and the data have to live in separate trust domains by construction – not as a defensive afterthought.
- The verification logs can't leave that perimeter. A standard L6 backpressure stack streams structured logs to a centralized observability tool. Some of those logs reference the data the agent saw. We can't ship those references out of our environment, which means the observability story has to be built inside the perimeter – not bolted on from outside.
- The human signing off on an agent-authored change has to understand it. This is the SOC 2 CC8.1 question, which Tian Pan called the Compliance Attestation Gap earlier this year. "Approved by" historically meant a human read the diff. When the diff is 1,800 lines of generated MCP server code, that definition fractures. Either you make the change small enough to read, or you make the harness signal legible enough to attest to. We chose the second.
Stack these three constraints together and the structure they describe isn't a coding agent inside a generic sandbox. The structure they describe is a particular shape of harness – a harness whose boundaries are defined by what the audit allows, whose feedback loops emit artifacts the audit can replay, and whose human-in-the-loop check is reframed as an audit of the harness itself rather than an audit of every individual change.
Which is to say: the regulatory perimeter isn't something the harness has to work around. The regulatory perimeter is the harness.
We don't have a compliance team that audits our agents. We have an agent harness that compliance can audit.
That's the thesis of this post and of the series. It's a thesis that, as far as I can tell, no engineering blog in 2026 is writing from inside. Vercel can't. Ramp can't. Block can't. They don't operate inside a healthcare regulatory perimeter, and the engineering content of operating inside one is exactly what the L6 conversation is missing right now.
The next three posts are what that thesis looks like when you actually have to ship something. Post 2 is the L5 substrate that makes the harness possible. Post 3 is the L6 harness in detail – how mcp-assistant and eva-flow-assistant ship code without a human reading every line. Post 4 is the L8 frontier, where two products start coordinating with each other and the audit boundary stops being about one agent and starts being about a system of them.
What L5 looks like under the audit
Post 2 – How We Compile a Phone Call: Level 5 Engineering for Voice in a Regulated Domain – is about the substrate.
Briefly: every Synthpop voice flow is a YAML file constrained by a Pydantic schema. The schema declares a closed set of step types – currently 16, ranging from primitives (init_llm, tool_call, execute_tool, check, end_call), through composite control flow (if, foreach, while, repeat, switch_tool, switch_case, break, continue), through speech-control (context_update, generate_content), to variable management (set_vars). The schema also declares the typed inputs of every tool, the explicit transitions between steps, the fallback paths for verification failures, and the max-failure thresholds on each check before the flow escalates. Variables are declared upfront in a flow_vars block. A flow that parses against the schema is, by construction, a flow whose space of conversational trajectories is finite, enumerable, and reviewable.
The validator runs in CI on every flow change as eva validate. Two flow grammars coexist in the codebase right now – a legacy steps: format with explicit next_step transitions, and a modern do: format with inline composite control flow (then:, else:, do: blocks). New flows ship in the modern form; migrating older ones is one of Post 2's threads, and the same skill that authors flows also does the migrations.
The flow file is also the artifact a compliance reviewer signs off on. That's not a thing we bolted onto the architecture for compliance reasons; it's the thing that makes the architecture work at all. "What can this voice agent do during a call?" has a single-file answer. "What can this voice agent not do during a call?" has a single-file answer too – the static analyzer over the YAML proves it.
The MCP tool registry is the other half. The LLM never decides what to do; it decides what argument values to fill into a tool whose existence and signature were declared ahead of time. Most public voice-AI conversations talk about determinism as a property the model has to learn. We treat it as a property the substrate enforces. The model is allowed to be intelligent inside a step; it is not allowed to invent a transition between steps.
A runtime concept worth flagging here, expanded in Post 3: flows can be wrapped in routers – thin dispatchers that A/B test flow variants against production traffic. The flow-authoring agent proposes a new variant; the router runs main and challenger in parallel (typically 95% to main, 5% to challenger); every call from either variant emits the audit record above; if the challenger outperforms the main on the metrics that matter, the router rebalances and promotes it. The loop runs constantly. The agent proposes; the audit decides. The routers carry their own discipline: variants must share the same export-attributes schema, so the audit signal stays comparable across A/B groups.
Post 2 also goes deep on the things you'd never guess from the YAML – like the stale speech_complete trap that turns out to be the central concurrency problem of voice agents, and the explicit speech-timing primitives (spoken_exact, wait_for_speech) we built to defuse it.
What L6 looks like under the audit
Post 3 – Audit, Don't Author: How a Background Agent Writes Our Healthcare Integrations – is about the harness.
L6 in our system shows up in three places. Two of them you've already seen, threaded through the L5 trace above: the call auditor that votes on every conversation in the audit record, and the continuous-improvement loop where the flow agent proposes variants and the router picks winners against production traffic. The third – the one most of the L6 canon focuses on – is agent-authored code under a verifying harness. Post 3 deep-dives that third manifestation; the first two get their full treatment there too.
Every new Synthpop customer is also a new EMR, a new payor portal, a new clearinghouse, a new payment processor, a new set of fax gateways. Each of those needs an MCP server. Hand-writing a new MCP server every time a customer onboards doesn't scale. So we don't.
Two Claude Code skills do agent-authored integration work today, both running on Claude Opus 4.7. mcp-assistant takes a customer's API documentation as input and emits a working MCP server. eva-flow-assistant is the matching skill that composes those MCP tools into flow definitions – and it does more than write new flows; it also improves, debugs, migrates, and locally tests them, against a Flow Review Checklist that compresses about ten engineers' worth of accumulated "don't do this" lessons into a single artifact the skill applies to every flow it touches.
Both skills run as part of a pipeline that, before any output reaches production, runs eva validate against the flow schema, runs the generated code through synthetic traffic (mock-tool implementations of the customer's real systems, registered through an @register_tool decorator and required to pass ruff check, mypy, and 100% test coverage), contract tests against documented examples, schema validation at every tool boundary, static analysis of the flow against the declared tool registry (the scope validator), and replay against historical similar integrations.
Notice what's not in that list: a human reading every line. The point of L6 isn't that the agent writes the code. The point is that the harness signal is rigorous enough to be the approval, with a human auditing the harness signal rather than re-deriving the change. "Approved by" still means something. It just means something slightly different than it did before.
Post 3 is what that pipeline looks like in detail, what we did when the harness wasn't enough, and a substantive sub-thread on how the same skill migrates legacy flows to modern format – a workflow that turned out to be the canary for whether a meta-agent can do judgment work, not just code-generation.
What L8 looks like under the audit
Post 4 – What L8 Looks Like When Claims Travel Through a Payor – is about the frontier.
Eledath calls L8 the active frontier; he's right that nobody has mastered it. The public examples in 2026 are all coding agents: Cursor's 16-agent swarms, Anthropic's parallel research agents. Impressive work, with the honest caveats their own teams attach (lock contention, risk-aversion, coordination failures).
Coding is the easy case for L8. The test signal is deterministic. The re-runs are free. The stakes per individual action are low. We are not in the easy case. We're architecting toward a state where the PDF pipeline can detect a missing form and trigger the voice agent to call the clinic, without a human in between and without a central orchestrator on the critical path. The stakes per action are claim dollars and patient outcomes. A re-run isn't free.
Post 4 is what we have working, what we don't, and what we think the rest of the industry needs to build with us.
A flow, end to end
Everything below is synthetic – the flow name (cpap_resupply), the step structure, the step types, and the transition shapes are real (from one of our demonstration flows). The audit-record fields are a representative shape of what the runtime emits; the patient details are template placeholders.
The flow is triggered when our system identifies a patient due for a CPAP resupply check. The voice agent dials the patient, walks identity verification, confirms resupply consent, captures order details, validates contact info and doctor-visit recency, checks inpatient-care status, and writes back a structured outcome. The trace below is one full execution, with the middle steps elided.
flow: cpap_resupply
flow_version: 1.0
trigger: scheduled-resupply-check
input: { patient_first_name: ●●●, patient_last_name: ●●●,
patient_dob: ●●●, contact.address: ●●●, company_name: ●●● }
step: initialize_llm (init_llm)
model: openai (voice=sage)
transition: → introduce_and_confirm_user
step: introduce_and_confirm_user (tool_call)
tools_offered: [user_confirmed_identity, transfer_user, end_call]
tool_called: user_confirmed_identity
transition: → verify_dob
step: verify_dob (check)
check_tool: check_dob
expected: {{ patient_dob }}
llm_comparison: match=true
failures: 0 / 2
transition: on_success → confirm_resupply
step: confirm_resupply (tool_call)
tools_offered: [user_confirmed_resupply, user_denied_resupply,
transfer_user, end_call]
tool_called: user_confirmed_resupply
transition: → confirm_order_placement
step: confirm_order_placement (tool_call)
tool_called: confirm_order_placement
tool_args: { mask_amt: 1, filters_gross: 2,
heated_humidifier: 1, suplemental_oxygen: 2 }
loop_iterations: 2 (patient updated quantities once)
tool_called: user_confirmed_order
transition: → confirm_contact_info
[steps confirm_contact_info, doctor_visit, inpatient_care_check,
close_call elided — same shape; all transitions in-schema]
step: end (end_call)
audit:
flow_hash: sha256:9c4f...
flow_version: 1.0
span_id: 0a1f7e...
duration_ms: 412091
llm_calls: 23
llm_total_tokens: 18420
tool_calls: 14
steps_executed: 9 of 9
no_undeclared_transitions: true
no_undeclared_tools: true
max_failures_observed: 0
transcript: [attached, full call text]
call_auditor_verdict: compliant
call_successful: true
call_auditor_model: gemini-flash-3.5
human_review_required: false
Three fields in that record carry most of the auditor's weight: no_undeclared_transitions and no_undeclared_tools are the mechanical ones – the static analyzer over the flow file proves them at build time; the runtime confirms they held. call_auditor_verdict is the third, and it's not mechanical. The full call transcript is attached to the audit record, and a separate agent – we call it the call auditor – reads the transcript against the flow spec the call was supposed to honor and answers one question: did the conversation actually behave the way the flow promised? Its verdict goes into the record. We currently run the call auditor on Gemini Flash 3.5 – a different model family from the Claude Opus 4.7 we use to author flows. The choice is consistent with the rest of our stack: our broader platform runs on Vertex AI as part of the Gemini Enterprise partnership above. Gemini family on the verification side, Claude Opus on the authoring side. Different agent, different model, fresh look at the same transcript. No single model failure mode dominates the audit signal.
What you don't see in the trace itself, because it's the whole point: the LLM didn't decide to ask for the DOB before the address. It didn't decide that address verification should be offered as a fallback when DOB verification failed – that branch is declared in the flow definition. It didn't decide that confirm_order_placement could be re-entered when the patient changed an order quantity. It chose tool arguments – the words it said, the values it captured – inside steps. Every transition between steps came from the substrate.
One thing that broke
Same caveat as the trace: synthetic, but a real class of failure, anchored in something the team has actually had to fix.
Last quarter, mcp-assistant generated a handler for a new payor's eligibility verification endpoint. Pydantic validation passed. Contract tests against the documented examples passed. The flow shipped. For the first six weeks in production it ran cleanly.
It turned out the payor's documented response taxonomy was incomplete. The payor's real-world response payload included a status ("verified-with-conditions") that the documentation we'd been given didn't list. The generated handler's switch statement mapped any non-error status onto the same downstream outcome as a clean verification. The conditions – which in some cases included a modified benefit limit – got dropped on the floor. Coverage flows proceeded downstream as if the conditions didn't exist. A small number of fulfillment decisions were made against the wrong limit.
A quarterly internal audit caught the pattern: claim denials from this payor spiked relative to baseline. Triage traced the denials back to the generated handler.
The fix wasn't better code generation. The fix was a new backpressure check. mcp-assistant now has to demonstrate, against a corpus of historical traffic from each payor class, that every response it observed in production maps to a downstream outcome the schema can defend. Documented examples are the floor, not the ceiling. The harness now refuses any MCP server whose handler maps to a default branch on more than a configurable threshold of real traffic.
The general lesson is the post's thesis in miniature. A model improvement wouldn't have caught this. The audit caught it. Then the harness changed to make sure the next audit didn't have to.
What we'd like the rest of the industry to build with us
Three things, in order of how badly we miss them.
Auditable agent-to-agent protocols. When Synthpop's voice agent eventually talks to a payor's voice agent – and to whoever else's voice agent gets to a healthcare-grade L8 – there is no protocol today that lets an auditor reconstruct what each side believed, what each side committed to, and which side is liable for a downstream error. We have HTTP. We have tool-use schemas. We have MCP. None of those were designed with agent-to-agent audit replay as a first-class concern. A protocol-level standard for attestable agent interactions would be load-bearing infrastructure for the L7–L8 conversation.
Regulatory-grade replay. Most foundation-model providers ship logging APIs. None are regulatory-grade in our sense – they don't survive the kind of audit that a healthcare deployment goes through. We've built our own; everyone in our position has. The fact that everyone has built their own says we should be building this together.
Cross-org agent identity. When an agent is acting on behalf of a covered entity, "who acted" only has a sharp answer if the chain of delegation is cryptographically attributable. There are pieces – SPIFFE, DPoP, JWT delegation – but no story that survives the way a real cross-org audit thinks.
We think these primitives have to come from a vertical (probably this one) before they generalize. The vertical forces the constraints to be specific. If you're working on any of the above, our inbox is open.


