Article

Agentic engineering at Synthpop: A field report from levels 5 to 8 (Post 2)

Davor Runje
June 7, 2026
TL;DR

A voice agent is not a model. It is a skill registry with a phone number.

What rings the clinic is not a large language model improvising over a system prompt. It is a typed flow – a small language with a grammar, a type system, and a compiler – running over a registry of tools, that a reviewer signs off on before the first call is ever placed. Determinism, in the only sense that survives a regulated domain, is not something we coax out of the model. It is the output of that substrate. Eledath calls this Level 5, MCP and Skills. We call it the floor.

The question hiding in every voice-AI pitch

"Deterministic" is in every voice-AI pitch. It is almost never defined. So let me define it tightly, because the definition is the whole argument: given identical inputs and identical tool responses, the conversation takes the same trajectory every time – same branches, same tools called, same boundaries respected.

Notice what that rules out. It rules out a model that decides, turn by turn, what to do next – because two runs of that model over the same call will not produce the same trajectory, and in a domain that gets audited, "usually the same" is not a property you can sign your name under. The interesting engineering question for voice is not how good is the model. It is what is the model allowed to decide, and what has already been decided for it.

What Level 5 actually means

Eledath's Level 5 is MCP and Skills: the unit of capability stops being a prompt and becomes a tool with a contract. A skill is something the agent can invoke – typed inputs, typed outputs, a name – not a paragraph of instructions you hope the model honors.

This matters more for voice than for chat, and the reason is mundane and decisive: chat has retries; voice does not. In a chat product, a wrong turn is a regenerate button. On a phone call, the wrong turn has already been spoken to a person, and you cannot un-say it. There is no undo on a sentence a patient has already heard. Everything downstream – why we compile instead of prompt, why the flow is a reviewable artifact – follows from the fact that the medium is live and one-directional.

The default architecture is Level 2 in disguise

The standard voice-agent shape is speech-to-text into a model holding a system prompt, the model calls some tools, text-to-speech on the way out. In Eledath's vocabulary that is an Agent IDE – Level 2 – bolted to a phone line. Features emerge from the prompt; reliability is a hope. It demos beautifully. We built one. It is the easiest thing in the world to stand up and the hardest thing in the world to get signed off, because nobody can tell you, ahead of the call, what it will say.

In a regulated domain that gap is the whole problem. The reviewer's question is never "is the model good." It is "show me, before this rings, every path this conversation can take and every tool it can touch." A system prompt cannot answer that question. A language can.

The substrate: a language over a skill registry

So the flow is a language. Concretely: a textual language with a grammar, a type system, and a compiler that emits a runtime artifact – that is what we mean by DSL, literally, not as a dressed-up word for "config file." Under the hood the grammar is a set of Pydantic models with semantic checks on top; here is the top of a real flow that handles a continuous-glucose-monitor resupply call:

flows:
  cgm_resupply:
    definition:
      title: CGM Resupply Flow
      version: "1.0"
      flow_vars:
      	# every variable the flow can touch, declared up front
        - patient_first_name
        - patient_dob
        - patient_zipcode
        - order_items
      do:
      	# steps run in order; control flow is explicit
        - step_type: init_llm
          step_name: initialize_llm
          model_type: openai
          model: gpt-realtime-2025-08-28
          instructions: |
            You are a clinical administrator named Doug at Synthpop,
            answering a medical resupply call. Ask the user for
            information and confirm it using the tools provided. Do not
            advance until instructed.
            # ... tone, pacing, and guardrail instructions trimmed
        # ... remaining steps


That is the whole shape: a named flow, a declared set of variables, and an ordered list of typed steps. There are sixteen step types in all – among them init_llm, context_update, generate_content, execute_tool, check, if, switch_tool, foreach, and repeat – and a flow is nothing but a tree of them.

The language is deliberate about the line it draws. It constrains the things that must be reviewable: the ordering of turns, the legal transitions between them, the boundary of which tools exist, what happens when a tool fails. It leaves open the things the model is actually good at: which tool to call next within the allowed set, what arguments to fill, what words to say. The model chooses slot values. It never chooses destinations. The set of places the conversation can go is fixed at authoring time and proven at build time – the model moves within it, not against it.

That single split – open on language, closed on control flow – is the entire Level 5 idea applied to voice.

Skills are tools with healthcare semantics

What goes into the registry is the integration surface of the business. An EMR integration – Niko Health, say – is one set of tools. Eligibility verification is another. A fax gateway is another. Each is a typed contract, and a flow can only invoke tools that are in the registry. There is no escape hatch where the model reaches for a capability nobody declared. If it isn't a registered skill, it does not exist as far as the call is concerned.

A tool in the registry is one of exactly three things, and the distinction is the whole point:

# stub: a signal to the flow engine, no code behind it
- tool_name: end_call

# registered: a Python function in the tool registry
- tool_name: patient_lookup
  from_tool: patient_lookup_cgm_demo

# MCP: a tool on an external MCP server
- tool_name: submit_claim
  from_mcp: { server: clearinghouse, tool_id: submit_837 }


When the flow needs to do something deterministic – look a patient up by phone number – it calls the tool silently, with arguments it already has, and captures the result into declared variables. No model improvisation, no conversational round-trip:

- step_type: execute_tool
  step_name: patient_lookup
  tool:
    tool_name: patient_lookup
    from_tool: patient_lookup_cgm_demo
    fixed_args:
      phone_number: "{{ patient_phone }}"
    sets_vars:
      patient_first_name: $result.patient.Name.First
      patient_dob: $result.patient.DateOfBirthday
      patient_found: $result.patient_found
  # any failure routes to a human, by spec
  on_error:
    "*": inform_transfer


And when the model does get to decide something, watch what it is allowed to decide. This is the identity-verification step – the model collects a date of birth from the caller, but it does not choose where the conversation goes next:

- step_type: check
  step_name: verify_dob
  check_tool:
    tool_name: check_dob
    tool_inputs:
      dob: str
    expected_values:
      # compared against the record on file
      dob: "{{ patient_dob }}"
    llm_comparison:
      comparison_prompt: >
        Return whether the inputted date of birth refers to the same
        real-world person as the expected date. Allow format variations.
  max_failures: 2
  next_step:
    # the only places this step can lead
    on_success: verify_zipcode
    on_fail: inform_transfer
    transfer_call: inform_transfer


The model fills dob and judges a fuzzy match. It does not get to invent a fourth destination. The model chooses slot values; it never chooses control flow. Every place this conversation can go is enumerated in next_step, fixed at authoring time.

Voice has a concurrency problem text agents don't

Here is the part nobody warns you about, and the part that convinced us a prompt was never going to be enough.

Saying something on a phone call is not one action. It is two: you tell the model what to say, and then you tell it to say it. Get the second half wrong and the call goes silent with a live person on the line – the most expensive kind of bug, because it is invisible in every test that doesn't involve a human waiting. We ended up naming this the atomic speech unit, because treating "decide" and "speak" as separable steps was the only way to reason about it. In the language it is two adjacent steps:


- step_type: context_update
  # (1) silently hand the model the line
  content: |
    [SPEAK NOW]: "Thanks for calling Synthpop Health — I'm looking up your account now."
- step_type: generate_content
  # (2) tell it to actually speak
  wait:
    for: spoken_exact
    # block until THIS utterance finishes, not a stale one
    timeout: 15


It gets worse, in ways specific to a live medium. Speech is asynchronous – you ask the agent to talk, and some time later a signal comes back saying "done speaking." If a later step is waiting for its turn to finish and it catches an earlier turn's stale "done" signal, the flow sails right past a sentence that is still being spoken. We call that the stale-completion trap – and the for: spoken_exact in the snippet above is the fix: it tags the request and waits only for its own completion event, never an older one left in the queue. Or the model, prompted to deliver a specific line, says "Sure, got it" first – and the real message gets queued behind the filler and lost. None of these are model-quality failures. The model is fine. They are timing failures, born of the fact that there is a real-time audio stream and a human who expects to interrupt.

You cannot engineer around a timing failure you can't attach a step to. This is the deepest reason the flow is an explicit artifact and not a prompt: a prompt has no place to hang "wait for this exact utterance to finish before advancing." A language does. And a related discipline falls out of the same fact – voice agents need to say less per utterance than chatbots. The patient expects to interrupt. One idea per turn. A chatbot can wall-of-text; a phone call that tries it gets talked over.

Why a language, and not a graph

The obvious objection: fine, you need structure – so why a language? Why not a visual flow builder, or a state-machine library? Four reasons, and they are all the same reason wearing different clothes.

  1. It is diff-able and reviewable. A flow is a text file. A reviewer can read it and comment on a line in a pull request. The audit happens on the artifact, before the call – not on a recording afterward.
  2. It is type-checkable ahead of time. A static analyzer can prove – not test, prove – that a flow never transitions to a state that doesn't exist and never calls a tool that isn't in the registry. That proof runs at build time, every build.
  3. It is versionable. Every customer integration ships as a tagged release. You can say exactly which version of which flow handled a given call, because the version is stamped on the call.
  4. It is composable. Flows call other flows; tools are shared across the voice product and the document product, because it's the same registry.

Point 2 is not aspirational. There is a single command – eva validate – that the skill authoring these flows runs after every change, and that the build runs as a gate. It loads the flows, resolves inheritance, and refuses anything that doesn't hold together. The errors are blunt and structural:

$ eva validate --directory call_flows
ERROR  Step 'confirm_address' references non-existent step 'confirm_adress'
ERROR  Step 'rate_main_1' cannot jump to 'offer_extras': target is inside a composite block
ERROR  set_vars step sets undeclared variable 'order_total'
ERROR  Duplicate step name 'verify_dob'


A flow that can transition to a step that doesn't exist does not compile. A flow that reads a variable nobody declared does not compile. The proof that the conversation stays inside its declared graph runs before the phone rings – not as a test that might happen to exercise the path, but as a property of the artifact.

Every one of those is a property of text with a grammar. None of them is a property of UI state in a visual builder. That is the whole case.

What this is not

Because the neighbors matter, and this audience knows them:

  • Not Anthropic's Programmatic Tool Calling. That generates orchestration code at inference time; ours is authored and reviewed ahead of time. The difference is the point – inference-time generation is exactly the thing a reviewer can't see before the call.
  • Not NeMo Guardrails / Colang. Colang is a guardrail layer on top of someone else's agent. Ours is the primary substrate, not a fence around one.
  • Not LangGraph. That is Python state machines – code that is the program. Our flow is a reviewable artifact a non-author, including compliance, can read.
  • Not BAML or TypeChat. Those constrain the schema of a single model call. We constrain a multi-turn conversation.
  • Not a visual builder. A diagram is UI state. It is not a language with a grammar you can type-check or diff.

We are not claiming nobody else has structure. We are claiming that which structure you pick decides whether a reviewer can sign off before the phone rings.

The audit record: where Level 5 hands off to Level 6

Every call leaves a record, and the record is half the reason the substrate exists. Three layers sit in it. A static analyzer proved at build time that the flow never leaves its declared transitions or its declared tools; the runtime confirms it didn't. And then a separate agent – the call auditor – reads the transcript of what was actually said against the spec of what the flow promised, and votes on whether the conversation honored it.

The first two layers aren't bolted on afterward – the flow itself declares what a good call means, in the same file, as a typed list of extractors:

export_attributes:
  - name: identity_verified
    type: in
    path: summary.context_vars.verification_status
    values: [verified, confirmed]
  - name: order_confirmed
    type: exists
    path: summary.context_vars.order_result.order_id
  - name: call_successful
    # the canonical metric the audit reads
    type: all
    items:
      - { type: flag, name: identity_verified }
      - { type: flag, name: order_confirmed }


"Did this call succeed" is not a dashboard someone built later. It is part of the flow's definition, versioned with it.

That auditor runs on a different model family than the one that authors and runs the flow – today, Gemini Flash auditing flows that run on Anthropic models. The cross-model split is deliberate, and it is a healthcare-grade choice: you do not want the same model's blind spots on both sides of the verification. No single model's failure mode gets to dominate the audit signal.

That record is where Level 5 stops and Level 6 begins – automated feedback on the substrate itself, which is the next post's subject. For now the point is narrower: the substrate doesn't just run the call. It leaves behind something the audit can read.

We tried the easy way first

The honest accounting, and the most useful thing I can tell you: the prompt-first voice bot was days to a working demo. The compiler-first substrate was materially slower to the first call that actually rang. And the compiler-first one is the only of the two that ever survived a review — because it is the only one that could answer the reviewer's question, what can this say and what can it touch, before it said anything to anyone.

What changes for an operations team is subtle and large: compliance becomes a reviewer of code, not a reviewer of recordings. The review moves from after the call – sampled, retrospective, unfalsifiable – to before the call: complete, and provable. You stop auditing what the agent did and start auditing what the agent can do. That is the unlock. Everything else is in service of it.

What's next

One caution before we go, because it's the hinge of the whole series. The level is a property of the practice, not of any one tool. Everything in this post – the flow language, the type system, eva validate proving the graph holds before the phone rings – is Level 5: the substrate, and what you can prove about it. A validator you run by hand is a type-checker, not a harness. None of this, by itself, is Level 6.

What makes it Level 6 is what happens around the artifact. When we onboard a new healthcare client – new EMR, new payer portal, new fax gateway – somebody has to write the tools and the flows. Increasingly, not a person. An agent does – and it runs the very eva validate you just saw on every change, reads the errors, fixes them, and runs again until the artifact holds, then hands the result to a harness that decides whether it ships. Same language, same validator. The difference is that there's now an agent in the loop and backpressure pushing on it. That loop is the harness – and the harness is the next post: Audit, Don't author.