LOADING
LOADING
A fixed-fee build with a harness, evals, and a one-click rollback — payback proven in a paid Discovery Assessment before a line of production code shipped.
This is a representative composite engagement, not a named client. The architecture and controls are real and reusable; engagement-level outcomes are modeled from the Discovery baseline and labeled illustrative. All industry benchmarks are cited and real.
A national P&C insurer was drowning in first-notice-of-loss intake: manual data entry, slow routing, and complex-claim backlogs. We scoped the work in a paid two-week Discovery Assessment that modeled payback before anyone committed to a build, then delivered a production-grade triage agent on a fixed fee — harness, eval suite, human-in-the-loop, and a tested rollback path. The agent classifies, extracts, and routes incoming claims; a senior adjuster owns every consequential decision. This is a representative composite engagement: the architecture and controls are real and reusable; the named outcomes are modeled and labeled illustrative, anchored to cited industry benchmarks.
A national property-and-casualty insurer runs first-notice-of-loss (FNOL) intake across web forms, a call center, and broker submissions. Every claim lands as semi-structured text — a description of what happened, a policy number, photos, sometimes a PDF — and a human has to read it, pull the structured fields, decide severity, and route it to the right queue. At this insurer's volume that was thousands of intakes a day and a growing complex-claim backlog.
The pressure to automate was not speculative. 78% of organizations now use AI in at least one function, up from 55% a year earlier (Stanford HAI, 2025 AI Index). In Canada specifically, AI adoption in finance & insurance reached 30.6% in Q2 2025 (Statistics Canada). The carrier's competitors were moving, and McKinsey put a number on the prize: generative AI could unlock USD 50-70 billion in value for the insurance industry, concentrated in customer operations and claims (McKinsey, 2024).
The CISO and COO had a shared, sensible fear: a half-finished agent that mis-routes claims, leaks PII, or quietly degrades is worse than the manual process it replaces. They had seen the pilot graveyard. They wanted a build that could survive an audit, not a demo.
The blunt truth about agentic projects is that most of them die. Over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls (Gartner, June 2025). And the failure isn't usually the model — 95% of enterprise generative-AI pilots fail to deliver measurable ROI, with poor integration and misaligned priorities, not weak models, as the root cause (MIT NANDA, 2025).
For this insurer, three concrete failure modes mattered:
There was also a governance gap underneath all of it. Only 21% of organizations report a mature governance model for agentic AI even as adoption accelerates (Deloitte, 2025). Buying a tool would not close that gap; it would move it.
We don't start with a build. We started with a paid two-week Discovery Assessment — a fixed-fee, fixed-scope diagnostic whose job is to decide whether the project is worth doing at all, and to model the payback before anyone commits capital.
This sequencing is deliberate. The single strongest correlate of AI delivering real EBIT impact is fundamental workflow redesign, not bolting AI onto a legacy process (McKinsey, 2025). Discovery is where that redesign gets decided — cheaply, on paper, before it gets expensive in code.
The insurer approved the production build on a fixed fee, because the scope was now genuinely fixed. No time-and-materials open-endedness, no subcontractor army — a small senior pod that had already done the diagnostic.
The agent is not a single prompt. It is a controlled system with four production properties the buyer's CISO signed off on.
The agent runs inside a deterministic harness, not free-form. Each incoming claim flows through fixed stages — ingest, classify, extract, score severity, route — with typed inputs and outputs at every step and structured logging of every decision and its confidence. The LLM does the reading and judgment; the harness owns control flow, retries, timeouts, and the audit trail. This is what makes the system auditable rather than a black box.
We built a labeled golden set from real, anonymized historical claims spanning the full type and severity distribution — including the rare, hard cases. Every change to a prompt, model version, or routing rule runs against that suite before it ships. The accuracy bar is a gate, not a hope. This directly addresses the hallucination risk that even domain-specialized tools carry: purpose-built legal AI tools still hallucinated on 17-34%+ of hard queries despite vendor 'hallucination-free' claims (Stanford HAI / RegLab, 2025). We assume the model is wrong some of the time and measure exactly how often.
The agent has a confidence threshold and a hard-coded escalation rule: anything below threshold, and anything classified as bodily-injury or fraud-flagged, goes to a senior adjuster untouched. The agent accelerates the routine majority; it never decides the consequential minority alone. This mirrors the evidence on where automation pays: GenAI assistance lifted support-agent throughput 14% on average and 34% for novice workers (Brynjolfsson, Li & Raymond, NBER 2023) — the gain is in augmenting people on volume work, not replacing judgment.
There is a one-command rollback to the previous known-good configuration, and a kill switch that reverts the queue to fully manual intake. Both were tested in a staging cutover drill, not just documented. The answer to "what happens at 2 a.m." is now a runbook.
The whole design maps to NIST's AI RMF Generative AI Profile, which names confabulation as one of twelve generative-AI risk categories — giving the insurer a recognized framework to show regulators and internal audit.
A production triage agent handling FNOL intake on a defined claim-type scope, with:
What we deliberately did not ship: autonomous decisions on high-severity claims, any write to the core policy system of record without human confirmation, and any PII flowing to a model context without access controls and logging. Scope discipline is a feature.
Crucially, the build does not end the relationship with a model frozen in time. 91% of ML models degrade over time (Vela et al., Scientific Reports, 2022), and claim language, fraud patterns, and product mix all drift. The eval suite plus the operator dashboard are what make ongoing monitoring — and, if the insurer chooses, a managed retainer — a small, defined task rather than a re-build.
Every claim flows through typed, logged stages; anything below the confidence threshold or flagged bodily-injury/fraud escalates to a human untouched.
A framing note before the numbers: this is a representative composite engagement. The architecture and controls above are real and reusable. The engagement-level outcomes below are modeled from the Discovery baseline and labeled illustrative — they are what a build of this shape and volume is designed to produce, not an audited result for a named client. Industry benchmarks are cited and real.
The model the CFO checked in Discovery, carried into production:
The industry context for why this matters: only 39% of organizations attribute any enterprise-wide EBIT impact to AI, and just ~6% are high performers (McKinsey, 2025, cited). The difference is not the model — it's the workflow redesign and the discipline to ship something that survives contact with production. That is what the harness, evals, human-in-the-loop, and rollback buy.
| Label | Value |
|---|---|
| GenAI pilots with no measurable ROI (MIT NANDA 2025) | 95 |
| Agentic projects canceled by 2027 (Gartner 2025) | 40 |
| Orgs with mature agentic governance (Deloitte 2025) | 21 |
| Orgs attributing enterprise EBIT impact to AI (McKinsey 2025) | 39 |
The failure modes a harness, eval suite, and rollback are designed to avoid. All figures cited and real (percent).
If you run claims, underwriting, or any high-volume intake and you're weighing an agent build, four things from this engagement transfer directly:
We do this on a fixed fee, with senior, AI-literate engineers, no platform lock-in, and a Discovery Assessment first. No subcontractor armies. No six-month roadmap that dies in committee. A scoped build that ships, with the controls to keep it shipping.
| Label | Value |
|---|---|
| AI models deployed in claims | 80 |
| Days cut from complex-case liability assessment | 23 |
| Routing accuracy improvement (%) | 30 |
| Complaint reduction (%) | 65 |
A real, named carrier proves the claims-automation thesis (Aviva, via McKinsey 2025) — context for the modeled outcomes of this composite engagement. Externally reported public reference — not a Maverin engagement.
The model is the easy 20%. The harness, evals, and rollback are the 80% that decides whether this survives an audit and a 2 a.m. incident.
An AI governance program — built before scaling LLMs and agents — that made saying yes faster than saying no.
Healthcare — multi-site health systemFrom a stalled clinical-documentation pilot to a governed, monitored, SLA-backed AI stack — drift, accuracy, cost, and 24×5 on-call owned by one accountable partner.
Weighing a claims or intake agent? Start with a paid two-week Discovery Assessment — we model the payback before you commit to a build. Talk to Maverin.
Start a conversation