Article

Agentic engineering at Synthpop: A field report from levels 5 to 8 (Post 1)

Davor Runje
June 2, 2026
TL;DR

When the mcp-assistant skill in our integration pipeline produces a Python handler that the harness lets through, the failure trajectory eventually includes a phone call to a clinic. That's a different shape of feedback loop than a failed pytest run.

I'm Davor, VP of Agentic Engineering at Synthpop. We're a twenty-person engineering team building a healthcare patient-journey orchestration platform – about a thousand voice calls a day, over a million PDF documents processed last quarter. We operate inside the audits that come with healthcare, and trying to ship reliable agents inside that frame has changed how we think about Bassim Eledath's Levels of Agentic Engineering.

This is the first of four posts. The thesis: we don't have a compliance team that audits our agents. We have an agent harness that compliance can audit. The rest of this post is the why; the rest of the series is the how.

Who we are

Synthpop coordinates the back-office decisions that determine whether a patient's order gets filled – across intake, eligibility checks, prior authorization, and the dozens of small handoffs between providers, payers, and durable medical equipment suppliers that turn a clinical referral into something a patient actually receives. Most of those handoffs were faxes and phone calls a decade ago. Many of them still are.

We're roughly twenty engineers. We ship two products that share a substrate: a voice agent that handles the phone calls (about a thousand a day) and a PDF pipeline that handles the paperwork (over a million documents last quarter). Separate runtimes, same substrate – a typed flow grammar over a registry of MCP tools. As of April 2026, we're the first healthcare AI agent partner in Google's Gemini Enterprise Agent Gallery, running on Vertex AI.

What's distinctive about how we build is downstream of one constraint. Synthpop's customers are healthcare organizations. We're not in a regulated industry abstractly; we're inside actual audits – customer audits, regulatory audits, our own internal compliance reviews – that come with handling the kind of data that flows through our system. That shapes more or less every engineering decision past the prototype stage. The patterns that work for Vercel or Ramp or Block when they wire agents into their build pipelines don't transfer cleanly to a runtime where every change has to be defensible to an auditor.

The ladder we're climbing

If you've read Bassim Eledath's Levels of Agentic Engineering, the next few paragraphs will be quick. If you haven't, the link is the prerequisite for the rest of this series – Eledath's framework is the lens, and I'm not going to restate it.

Briefly: eight levels of engineering practice maturity, from L1 (Tab Complete) through L8 (Autonomous Agent Teams). The framework is about how a team uses agents to build software, not about how autonomous the resulting product is. That distinction is important, and the rest of this post leans on it.

Here's where Synthpop is, honestly:

  • L5 (MCP and Skills) – in production. Our voice agent and PDF pipeline both run on the same substrate: a typed flow grammar – Pydantic schemas over YAML – compiled against a registry of MCP tools. The flow file is the auditable artifact. Post 2 goes deep on this.
  • L6 (Harness Engineering) – shipping. Two Claude skills do agent-authored integration work today. eva-flow-assistant writes new flow definitions when we onboard a customer. mcp-assistant writes the MCP server that integrates the customer's specific systems. Both ship under a harness that validates outputs before they reach production. Post 3 goes deep on this.
  • In our own engineering, L4–L5 daily. The team uses Claude Code with internal codified skills – Eledath's L4 (Compounding Engineering) plus L5 (MCP and Skills). We aren't at L7 (Background Agents) for our own development work yet. We'll get there.
  • L8 (Autonomous Agent Teams) – architecting toward, not at. Eledath himself says of L8: "Nobody has mastered this level yet, though a few are pushing into it." We're pushing. Post 4 is the field report from the edge.

This series follows in a genre – engineering-practice posts about agents at production scale. The current canon includes Ramp's writeup of Inspect, Block's skills marketplace (Angie Jones's 3 Principles for Designing Agent Skills is the recommended entry point), OpenAI's Harness Engineering, and Lee Robinson's Coding Agents and Complexity Budgets. The white space we're trying to occupy is engineering-practice posts from a regulated vertical.

The audit is the harness

Eledath's framing of L6 – Harness Engineering – is the most useful single concept in the framework, and it's the load-bearing concept for the rest of this series. The definition: "Building the entire environment, tooling, and feedback loops that let agents do reliable work without you intervening." The unlock isn't a smarter agent. The unlock is backpressure – automated feedback (type checks, tests, linters, contract tests, replay) that catches the agent's mistakes before they ship. Once you have enough backpressure, the agent can run without you.

The canonical L6 examples – Ramp's Inspect, OpenAI's harness work, Block's internal skills marketplace – share a quiet assumption: backpressure machinery can be wired into the agent's environment freely. You pick the sandbox, you pick the test harness, you pick the observability stack, you pick what the agent sees. Want browser sandboxing? Spin one up. Want a fresh database fixture per task? Trivial. Want the agent to read your Sentry and your Datadog and your LaunchDarkly? Wire it in.

For most of the L6 canon, "freely" is the right word. For us, "freely" is the word that breaks.

Three concrete constraints, each of which would shift an L6 design at Ramp or Block or Vercel into a constrained design at Synthpop:

  • The sandbox can't touch the data we're processing. The natural L6 move when an agent needs to validate against real-world conditions is to give it the real-world environment. We can't. The realistic environment includes data that, in our domain, has to stay inside a specific perimeter. The agent and the data have to live in separate trust domains by construction – not as a defensive afterthought.
  • The verification logs can't leave that perimeter. A standard L6 backpressure stack streams structured logs to a centralized observability tool. Some of those logs reference the data the agent saw. We can't ship those references out of our environment, which means the observability story has to be built inside the perimeter – not bolted on from outside.
  • The human signing off on an agent-authored change has to understand it. This is the SOC 2 CC8.1 question, which Tian Pan called the Compliance Attestation Gap earlier this year. "Approved by" historically meant a human read the diff. When the diff is 1,800 lines of generated MCP server code, that definition fractures. Either you make the change small enough to read, or you make the harness signal legible enough to attest to. We chose the second.

Stack these three constraints together and the structure they describe isn't a coding agent inside a generic sandbox. The structure they describe is a particular shape of harness – a harness whose boundaries are defined by what the audit allows, whose feedback loops emit artifacts the audit can replay, and whose human-in-the-loop check is reframed as an audit of the harness itself rather than an audit of every individual change.

Which is to say: the regulatory perimeter isn't something the harness has to work around. The regulatory perimeter is the harness.

We don't have a compliance team that audits our agents. We have an agent harness that compliance can audit.

That's the thesis of this post and of the series. It's a thesis that, as far as I can tell, no engineering blog in 2026 is writing from inside. Vercel can't. Ramp can't. Block can't. They don't operate inside a healthcare regulatory perimeter, and the engineering content of operating inside one is exactly what the L6 conversation is missing right now.

The next three posts are what that thesis looks like when you actually have to ship something. Post 2 is the L5 substrate that makes the harness possible. Post 3 is the L6 harness in detail – how mcp-assistant and eva-flow-assistant ship code without a human reading every line. Post 4 is the L8 frontier, where two products start coordinating with each other and the audit boundary stops being about one agent and starts being about a system of them.

Stay tuned.