Article

Agentic engineering at Synthpop: A field report from levels 5 to 8 (Post 3)

Davor Runje
June 16, 2026
TL;DR

A new customer integration at Synthpop is increasingly written by an agent, not a person – and the reason we can let it ship is not that the agent is smart. It's that it runs inside a harness that won't let unverified work through. Eledath calls Level 6 Harness Engineering, and its core is backpressure: automated feedback that catches the agent's mistakes before they reach production. Most of the L6 canon is about coding agents in an IDE. Ours lives inside an audit perimeter, so it took three shapes – a runtime verifier on every call, a continuous experiment loop on the substrate, and the code-generation pipeline that onboards new clients. This post walks all three. The throughline: audit, don't author.

Recap: the artifact, and the agent that writes it

If you read Post 2, here's the one sentence to carry over: a voice agent at Synthpop is a typed flow – a small language over a registry of MCP tools, type-checked by eva validate before it ever runs. That artifact, and the fact that you can prove things about it ahead of the call, is Level 5. The substrate.

This post is about a different level operating on the same artifact, and the distinction is the one most people get wrong. The validator from Post 2 is not, by itself, a harness – run by a human, it's just a type-checker. It becomes Level 6 the moment an agent is the one authoring the flow and the verifier is looping back on it: generate, validate, read the errors, fix, validate again, until the artifact holds and the harness clears it to ship. The flow language didn't change between Post 2 and Post 3. What changed is that there's now an agent in the loop, and backpressure pushing on it.

That shift – from a human running a type-checker to an agent running under automated feedback – is what this post is about. So: when we onboard a new healthcare client – new EMR, new payor portal, new fax gateway – who writes the tools, and who writes the flows that compose them? Increasingly, not a person. Two Claude skills do, under a harness. This post is how.

What Level 6 actually means

Eledath: "Building the entire environment, tooling, and feedback loops that let agents do reliable work without you intervening." The core concept is backpressure – type systems, tests, linters, contract checks, replay – automated verifiers that catch the agent's mistakes before they ship.

The L6 examples in the wild (OpenAI's Harness Engineering, Ramp's Inspect, Block's skills marketplace) are about coding agents. They share an assumption: backpressure machinery can be wired into the agent's environment freely – any sandbox, any logger, any verifier. We can't; we live inside an audit perimeter. So L6 at Synthpop took a different shape.

L6 shows up in three places

L6 isn't one thing at Synthpop. It shows up in three distinct production patterns:

  • Manifestation 1 – Runtime verification. A separate agent (the call auditor, currently running on Gemini Flash 3.5) reads every call's transcript against the flow spec and votes on whether the conversation honored what the flow promised. Cross-model verification by design – different model family from the Opus 4.7 that authors the flow. The verdict lands in the audit record you saw in Post 1, alongside the mechanical assertions about undeclared transitions and undeclared tools.
  • Manifestation 2 – Runtime experimentation. The flow-authoring agent (eva-flow-assistant) proposes new variants of existing flows. Routers run main + challenger against production traffic; the call auditor and the export-attributes metrics decide which variant wins; winners get promoted. The loop runs constantly. The agent proposes. The audit decides.
  • Manifestation 3 – Code-generation. Two Claude Code skills onboard new customer integrations: mcp-assistant writes the MCP servers, eva-flow-assistant writes the flows that compose those MCP tools. Both ship under a harness that validates outputs before any code reaches production. This is the L6 case the canon writes about. We added the first two manifestations because we live inside an audit perimeter – the canon doesn't.

Most of this post is manifestation 3 – it has the most surface area and the most distance from the canon's assumptions. But the three together are the L6 story at Synthpop, and the shape of the third depends on the first two.

Manifestation 1: Runtime verification – the call auditor

Start with the layer closest to the conversation. Every call leaves an audit record, and three independent things vote on whether the call was sound.

The first two are mechanical and were settled before the phone rang: a static analyzer proved at build time that the flow never transitions to an undeclared step and never calls an undeclared tool, and the runtime confirms those two facts held for this specific call. They are booleans, and they cannot be argued with.

The third is the interesting one. A separate agent – the call auditor – reads the transcript of what was actually said against the flow's spec, and votes on whether the conversation honored what the flow promised. Mechanical checks can prove the agent stayed inside the graph; they cannot prove it said the right things for the right reasons. That requires reading the conversation, which requires a model.

The design choice that makes this audit-grade: the call auditor runs on a different model family than the one that authors and executes the flow. Today that's Gemini Flash auditing flows that run on Anthropic models – coherent with the rest of the stack, since Synthpop runs on Vertex AI and is the first healthcare AI agent partner in Google's Gemini Enterprise Agent Gallery. You do not want the same model's blind spots on both sides of the verification. No single model failure mode gets to dominate the audit signal.

So one record carries three layers: a build-time proof, a runtime confirmation, and a cross-model semantic verdict. (You saw the metrics half of that record in Post 2 – the flow declares what a good call means, as a typed export_attributes list.) That record is L6 backpressure operating per-call, at runtime – the simplest of the three manifestations to state, and the one the other two lean on.

Manifestation 2: Runtime experimentation – the router loop

The second manifestation puts that per-call verdict to work. A router is a thin wrapper that A/B-tests flow variants against live traffic:

organization: Synthpop
routers:
  cgm_resupply:
    description: Compare the production flow against a shorter-intro challenger.
    variants:
      - flow: cgm_resupply_v1     # incumbent
        version: latest
        weight: 95
      - flow: cgm_resupply_v2     # challenger
        version: latest
        weight: 5


That is the whole shape: a named flow, a declared set of variables, and an ordered list of typed steps. There are sixteen step types in all – among them init_llm, context_update, generate_content, execute_tool, check, if, switch_tool, foreach, and repeat – and a flow is nothing but a tree of them.

The language is deliberate about the line it draws. It constrains the things that must be reviewable: the ordering of turns, the legal transitions between them, the boundary of which tools exist, what happens when a tool fails. It leaves open the things the model is actually good at: which tool to call next within the allowed set, what arguments to fill, what words to say. The model chooses slot values. It never chooses destinations. The set of places the conversation can go is fixed at authoring time and proven at build time – the model moves within it, not against it.

That single split – open on language, closed on control flow – is the entire Level 5 idea applied to voice.

Manifestation 3: Code-generation – onboarding a new client

This is the manifestation the L6 canon writes about, and it has the most surface area. Every new healthcare client is a new integration problem: a different EMR, a different payor portal, a different fax gateway, a different clearinghouse, a different subset of EDI. Each one needs MCP tools that talk to those systems and flows that orchestrate them. Hand-writing all of that, per client, is the throughput bottleneck – and the bottleneck was never model quality. It was engineering hours.

Two Claude skills do that work now, and they compose. mcp-assistant writes the MCP server – the Python tools that integrate the client's specific systems. eva-flow-assistant writes the flows that compose those tools into the voice and document pipelines. Both run on Claude Opus. The input is a pile of a client's API documentation – often messy: PDFs, half-broken Swagger, scraped HTML, a Postman collection someone exported in 2022. The output is a working MCP server, the flows that wire it in, and a test suite.

The reason we can let that output ship is the harness around it. Here is what "verifying its own work" actually consists of – and most of it is not novel AI, it's boring, ruthless software engineering pointed at an agent's output.

The skill knows when not to use the model

The single most consequential piece of backpressure isn't a test. It's a design rule the skill applies to every step it writes. The canonical mistake an agent makes authoring these flows is reaching for the model when it doesn't need to – wrapping a deterministic backend call in a conversational tool_call because that's the most general tool in the box. General, but slow, nondeterministic, and unreviewable. So the skill is forced through a decision gate before it writes any tool-using step:

What is this step doing?                              Use:
──────────────────────────────────────────────────────────────────
Calling a backend API with known arguments            execute_tool      (no model)
Routing on what the user said                         switch_tool       (model picks, no args)
Routing on a variable value                           if / switch_case  (no model at all)
Just saying something                                 context_update + generate_content
Conversationally collecting arguments from a caller   tool_call         (and only then)


Every avoided tool_call is a place where the conversation became deterministic instead of improvised. The gate is, in effect, a rule that says use the least powerful construct that does the job – enforced at authoring time, before the flow exists, not caught downstream in a review.

The checklist is the team's scar tissue, made executable

The skill carries a ~50-item Flow Review Checklist and runs it against every flow it touches. It is not abstract best-practice; it's the accumulated list of things that have gone wrong on real calls, written down so an agent can't repeat them. A few representative items:

  • Every context_update with spoken content has a generate_content immediately after it – otherwise the agent goes silent.
  • Farewell and critical-confirmation messages wait for spoken_exact, so they finish before the flow advances.
  • No on_error handler uses an empty list [] – that fails validation; omit it or name a recovery step.
  • Tool options whose execution_message is followed by a proactive step set wait_for_speech, so filler doesn't swallow the intended line.

These are exactly the voice-concurrency traps from Post 2, turned into a verifier. The checklist is L6 backpressure made human-readable.

And then the ordinary gates, applied without mercy
  • eva validate, in a loop. The skill runs the validator after every change, reads the errors, fixes them, and runs again until the artifact holds. That generate → validate → fix loop is the backpressure; the validator from Post 2 is the thing pushing back.
  • Mock tools, fully tested. Before the real integration exists, the skill writes mock tools that return representative data so the flow can be exercised end-to-end. Any Python it writes – mock or real – must pass ruff, pass mypy, and hit 100% test coverage. Those aren't aspirations; they're gates.
  • Typed boundaries and error handling. Every tool is a typed contract; remote calls get on_error handlers, with MRO-aware matching so a ConnectionError handler catches a transport failure the author never explicitly named.

The point of the stack is one sentence: the skill doesn't ship something we didn't verify – it ships something the harness verified.

Can the agent refactor, not just generate?

There's a sharper test of an authoring agent than "can it write new code," and the skill faces it constantly: migration. Two flow grammars coexist in our codebase – a legacy steps: format with explicit next_step jumps, and the modern do: format with inline composite control flow. Moving a flow between them is not mechanical translation. The skill has to map the flow graph, recognize which next_step branches should become if / switch_tool / repeat composites, audit every tool_call against the decision gate above, and – the genuinely hard part – notice when a change to a shared tool would ripple into other flows that use it, and coordinate accordingly. Switching a tool from swallowing errors to raising them, so on_error can catch them, might be correct for the flow in front of you and breaking for one three directories over. Generation is the easy half. Refactoring legacy code without breaking its neighbors is the half that tells you whether the harness is real.

From writing code to auditing intent

When the backpressure is good enough, the human role changes shape. The reviewer is no longer reading every line of a few thousand lines of generated MCP server and pretending that constitutes understanding. They are auditing the harness signal and the agent's stated intent: did the contract tests pass, did the validator clear it, did the checklist run, does the change do what the agent said it would?

This is where a real compliance question surfaces, and it's worth naming rather than dodging. SOC 2 CC8.1 wants change management with an approver who understood the change. What does "understood" mean when the change is thousands of lines an agent wrote? Our answer is that "approved by" stops meaning "read every line" – which was always a polite fiction for large diffs – and starts meaning "reviewed the evidence the harness produced." The harness signal is the artifact the approver actually understood. That's only honest if the harness is genuinely load-bearing, which is the whole reason the bulk of this post is about backpressure and not about prompts.

When the harness isn't enough

The honest part. A harness gives you confidence proportional to what it actually checks, and there are failures it doesn't catch – including ones we know about and haven't fully fixed.

Here's a concrete one, documented in the skill's own reference material so the agent steers around it. When a tool buried inside a repeat loop – say a cancel_order step nested in a switch_tool option inside the loop body – both sets the loop's until condition to true and declares a next_step, the loop can exit on the satisfied condition and ignore the jump. Two correct-looking intentions, one of which silently wins. It's the kind of bug no type-checker catches, because nothing is mistyped: the flow is well-formed and does the wrong thing.

What we did is the realistic thing, not the heroic thing. We didn't claim it away. The failure mode is written into the skill's knowledge with its workaround – set state via sets_vars that the normal exit path reads, so the behavior is correct regardless of which exit route the engine takes – so every flow the agent authors from here on avoids the trap by construction. That's what "the harness wasn't enough" looks like in practice: you find the class of failure, you encode it, and the next thousand flows inherit the lesson. The harness is never finished. It's the thing you improve every time it lets something through.

Why this is L6 and not "just code generation"

L6 isn't about who writes the code. It's about whether the code can be shipped without a human looking at it. Backpressure is the test. We are not in the business of generating code; we are in the business of generating verifiably correct code.

Background Agents (L7): why this runs while we sleep

One more level, briefly, because it's a property of how the skills run rather than what they verify. The integration work doesn't happen in a chat window with an engineer watching tokens stream. It runs as background jobs – kicked off, left to work, surfacing for review when the harness has a verdict. That's Eledath's Level 7: background agents. And it's only safe because of L6 – you can let an agent run unattended exactly to the degree that the backpressure around it is trustworthy. L7 is what L6 buys you. Ramp's Inspect makes the same point from the coding-agent side: the background agent is downstream of the harness, not a substitute for it.

What it cost, and what it bought

The honest accounting, minus the figures I can't put on this page. Building the harness was more expensive than building a code-generation agent would have been – and that's the point. Most of the work went into the verifiers, not the generator. What it bought is the one thing that actually constrains a healthcare-integration business: throughput that isn't capped by the number of integration engineers in the room, without giving up the property that every shipped change is defensible to an auditor. We didn't make onboarding a client free. We made it something a harness can vouch for.

What's next

Everything so far has one human-shaped backstop: the harness produces a signal, and a person still approves the result – even if "approve" now means "audit the evidence." The frontier is what happens when the agents stop routing through that backstop and start coordinating with each other – when the document agent hands work to the voice agent directly, no ticket, no human in the middle, and the audit boundary stops being about one agent and becomes about a system of them. That's Eledath's Level 8, it's where claims travel through a payor, and by his own account nobody has mastered it. The last post is the field report from that edge: What L8 Looks Like When Claims Travel Through a Payor.