Insurance (national P&C insurer) · Agentic Automation

Claims-Intake Triage, Automated: A Production-Grade Agent Engagement for a National P&C Insurer

A fixed-fee build with a harness, evals, and a one-click rollback — payback proven in a paid Discovery Assessment before a line of production code shipped.

February 27, 20269 min readUse case

This is a representative composite engagement, not a named client. The architecture and controls are real and reusable; engagement-level outcomes are modeled from the Discovery baseline and labeled illustrative. All industry benchmarks are cited and real.

TL;DR

A national P&C insurer was drowning in first-notice-of-loss intake: manual data entry, slow routing, and complex-claim backlogs. We scoped the work in a paid two-week Discovery Assessment that modeled payback before anyone committed to a build, then delivered a production-grade triage agent on a fixed fee — harness, eval suite, human-in-the-loop, and a tested rollback path. The agent classifies, extracts, and routes incoming claims; a senior adjuster owns every consequential decision. This is a representative composite engagement: the architecture and controls are real and reusable; the named outcomes are modeled and labeled illustrative, anchored to cited industry benchmarks.

01Context

A national property-and-casualty insurer runs first-notice-of-loss (FNOL) intake across web forms, a call center, and broker submissions. Every claim lands as semi-structured text — a description of what happened, a policy number, photos, sometimes a PDF — and a human has to read it, pull the structured fields, decide severity, and route it to the right queue. At this insurer's volume that was thousands of intakes a day and a growing complex-claim backlog.

The pressure to automate was not speculative. 78% of organizations now use AI in at least one function, up from 55% a year earlier (Stanford HAI, 2025 AI Index). In Canada specifically, AI adoption in finance & insurance reached 30.6% in Q2 2025 (Statistics Canada). The carrier's competitors were moving, and McKinsey put a number on the prize: generative AI could unlock USD 50-70 billion in value for the insurance industry, concentrated in customer operations and claims (McKinsey, 2024).

The CISO and COO had a shared, sensible fear: a half-finished agent that mis-routes claims, leaks PII, or quietly degrades is worse than the manual process it replaces. They had seen the pilot graveyard. They wanted a build that could survive an audit, not a demo.

02The problem

The blunt truth about agentic projects is that most of them die. Over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls (Gartner, June 2025). And the failure isn't usually the model — 95% of enterprise generative-AI pilots fail to deliver measurable ROI, with poor integration and misaligned priorities, not weak models, as the root cause (MIT NANDA, 2025).

For this insurer, three concrete failure modes mattered:

Silent mis-routing. A claim sent to the wrong queue is an SLA breach and, in a bodily-injury or fraud-flagged case, a regulatory and reputational problem. An agent that is 92% right and silent about the other 8% is dangerous.
No way to prove it works. Leadership would not greenlight production on a vibe. They needed a measurable accuracy bar, held over time, on real claim distributions — not a one-off demo score.
No way to turn it off. "What happens at 2 a.m. when it starts behaving badly?" had no answer in the early proposals they'd seen. Without a tested rollback, the agent was an unbounded liability.

There was also a governance gap underneath all of it. Only 21% of organizations report a mature governance model for agentic AI even as adoption accelerates (Deloitte, 2025). Buying a tool would not close that gap; it would move it.

03The approach

We don't start with a build. We started with a paid two-week Discovery Assessment — a fixed-fee, fixed-scope diagnostic whose job is to decide whether the project is worth doing at all, and to model the payback before anyone commits capital.

What Discovery actually produced

A baseline, measured not asserted. We instrumented a sample of real intakes: minutes-per-claim of manual handling, current routing-accuracy rate, and the cost of the rework loop when a claim is mis-routed. You cannot model payback against a number you guessed.
A scoped automation boundary. Not "automate claims." Specifically: classify claim type, extract the structured FNOL fields, score severity, and route — with everything above a confidence threshold or flagged as bodily-injury/fraud sent to a human, untouched. We drew the line where the risk-adjusted value was clearly positive and stopped there.
A payback model the CFO could check. Hours saved per claim multiplied by validated volume, net of the fixed engagement fee and the ongoing run cost. Inference economics helped: the cost of querying a GPT-3.5-class model fell over 280x — from USD 20.00 to USD 0.07 per million tokens between late 2022 and late 2024 (Stanford HAI, 2025), which made per-claim run cost a rounding error against adjuster time.

This sequencing is deliberate. The single strongest correlate of AI delivering real EBIT impact is fundamental workflow redesign, not bolting AI onto a legacy process (McKinsey, 2025). Discovery is where that redesign gets decided — cheaply, on paper, before it gets expensive in code.

The insurer approved the production build on a fixed fee, because the scope was now genuinely fixed. No time-and-materials open-endedness, no subcontractor army — a small senior pod that had already done the diagnostic.

04Architecture & controls

The agent is not a single prompt. It is a controlled system with four production properties the buyer's CISO signed off on.

1. The harness

The agent runs inside a deterministic harness, not free-form. Each incoming claim flows through fixed stages — ingest, classify, extract, score severity, route — with typed inputs and outputs at every step and structured logging of every decision and its confidence. The LLM does the reading and judgment; the harness owns control flow, retries, timeouts, and the audit trail. This is what makes the system auditable rather than a black box.

2. The eval suite

We built a labeled golden set from real, anonymized historical claims spanning the full type and severity distribution — including the rare, hard cases. Every change to a prompt, model version, or routing rule runs against that suite before it ships. The accuracy bar is a gate, not a hope. This directly addresses the hallucination risk that even domain-specialized tools carry: purpose-built legal AI tools still hallucinated on 17-34%+ of hard queries despite vendor 'hallucination-free' claims (Stanford HAI / RegLab, 2025). We assume the model is wrong some of the time and measure exactly how often.

3. Human-in-the-loop, by design

The agent has a confidence threshold and a hard-coded escalation rule: anything below threshold, and anything classified as bodily-injury or fraud-flagged, goes to a senior adjuster untouched. The agent accelerates the routine majority; it never decides the consequential minority alone. This mirrors the evidence on where automation pays: GenAI assistance lifted support-agent throughput 14% on average and 34% for novice workers (Brynjolfsson, Li & Raymond, NBER 2023) — the gain is in augmenting people on volume work, not replacing judgment.

4. Tested rollback

There is a one-command rollback to the previous known-good configuration, and a kill switch that reverts the queue to fully manual intake. Both were tested in a staging cutover drill, not just documented. The answer to "what happens at 2 a.m." is now a runbook.

The whole design maps to NIST's AI RMF Generative AI Profile, which names confabulation as one of twelve generative-AI risk categories — giving the insurer a recognized framework to show regulators and internal audit.

05What shipped

A production triage agent handling FNOL intake on a defined claim-type scope, with:

The five-stage harness (ingest → classify → extract → score → route) with full structured logging and a queryable decision audit trail.
A golden eval suite of labeled historical claims, wired into the deploy pipeline as a release gate.
A confidence-threshold + category-based escalation path to senior adjusters, with mandatory human review on bodily-injury and fraud-flagged claims.
A tested one-command rollback and a manual-intake kill switch, validated in a cutover drill.
An operator dashboard showing per-day volume, automation rate, escalation rate, and rolling routing accuracy — so degradation is visible, not silent.
A short enablement handoff so the insurer's own claims-ops and platform teams can read the logs, run the evals, and operate the kill switch without us in the room.

What we deliberately did not ship: autonomous decisions on high-severity claims, any write to the core policy system of record without human confirmation, and any PII flowing to a model context without access controls and logging. Scope discipline is a feature.

Crucially, the build does not end the relationship with a model frozen in time. 91% of ML models degrade over time (Vela et al., Scientific Reports, 2022), and claim language, fraud patterns, and product mix all drift. The eval suite plus the operator dashboard are what make ongoing monitoring — and, if the insurer chooses, a managed retainer — a small, defined task rather than a re-build.

Exhibit 1

The triage agent harness — five controlled stages

01Ingest

02Classify

03Extract FNOL fields

04Score severity

05Route (or escalate to adjuster)

Every claim flows through typed, logged stages; anything below the confidence threshold or flagged bodily-injury/fraud escalates to a human untouched.

Illustrative

06Outcomes

A framing note before the numbers: this is a representative composite engagement. The architecture and controls above are real and reusable. The engagement-level outcomes below are modeled from the Discovery baseline and labeled illustrative — they are what a build of this shape and volume is designed to produce, not an audited result for a named client. Industry benchmarks are cited and real.

The model the CFO checked in Discovery, carried into production:

Routine intakes routed without human touch rose from zero to a majority of in-scope claims (illustrative), with senior adjusters reallocated from data entry to complex and high-severity work — exactly where the comparable Aviva deployment found value: 80+ claims AI models, complex-case liability assessment cut by 23 days, routing accuracy improved 30% (McKinsey, 2025, cited).
Discovery-to-decision in two weeks, fixed fee, with a payback model the finance team validated before the build was approved (illustrative). The point of the paid Discovery is that the go/no-go is made on evidence, not optimism.
Per-claim run cost a rounding error against adjuster time, thanks to the >280x collapse in inference pricing (Stanford HAI, 2025, cited) — the durable cost is operations, which the eval suite and dashboard keep small.

The industry context for why this matters: only 39% of organizations attribute any enterprise-wide EBIT impact to AI, and just ~6% are high performers (McKinsey, 2025, cited). The difference is not the model — it's the workflow redesign and the discipline to ship something that survives contact with production. That is what the harness, evals, human-in-the-loop, and rollback buy.

Exhibit 2

Why agentic builds fail — and the cost of getting it wrong

GenAI pilots with no measurable ROI (MIT NANDA 2025)95

Agentic projects canceled by 2027 (Gartner 2025)40

Orgs with mature agentic governance (Deloitte 2025)21

Orgs attributing enterprise EBIT impact to AI (McKinsey 2025)39

Why agentic builds fail — and the cost of getting it wrong
Label	Value
GenAI pilots with no measurable ROI (MIT NANDA 2025)	95
Agentic projects canceled by 2027 (Gartner 2025)	40
Orgs with mature agentic governance (Deloitte 2025)	21
Orgs attributing enterprise EBIT impact to AI (McKinsey 2025)	39

The failure modes a harness, eval suite, and rollback are designed to avoid. All figures cited and real (percent).

Gartner 2025; MIT NANDA 2025; Deloitte 2025; McKinsey 2025

07What we'd tell the next buyer

If you run claims, underwriting, or any high-volume intake and you're weighing an agent build, four things from this engagement transfer directly:

Pay for Discovery. Decide on evidence. A two-week paid diagnostic that models payback against a measured baseline is the cheapest insurance you can buy against the 40% cancellation rate. If the payback doesn't model, you've spent two weeks instead of two quarters finding out.
The model is the easy 20%. The harness, evals, and rollback are the 80% that decides whether this survives an audit and a 2 a.m. incident. Don't buy a demo; commission a controlled system.
Draw the automation boundary on purpose. Automate the routine majority, escalate the consequential minority to a human, untouched. "Augment the volume, decide the edge cases yourself" is where the cited 14-34% productivity gains actually live.
Budget for drift, not just launch. Models degrade; claim language and fraud patterns move. An eval suite and an operator dashboard turn ongoing monitoring into a small, defined task — and make a managed retainer optional rather than forced.

We do this on a fixed fee, with senior, AI-literate engineers, no platform lock-in, and a Discovery Assessment first. No subcontractor armies. No six-month roadmap that dies in committee. A scoped build that ships, with the controls to keep it shipping.

Exhibit 3

A comparable public insurance AI deployment

AI models deployed in claims80

Days cut from complex-case liability assessment23

Routing accuracy improvement (%)30

Complaint reduction (%)65

A comparable public insurance AI deployment
Label	Value
AI models deployed in claims	80
Days cut from complex-case liability assessment	23
Routing accuracy improvement (%)	30
Complaint reduction (%)	65

A real, named carrier proves the claims-automation thesis (Aviva, via McKinsey 2025) — context for the modeled outcomes of this composite engagement. Externally reported public reference — not a Maverin engagement.

McKinsey, 2025

The model is the easy 20%. The harness, evals, and rollback are the 80% that decides whether this survives an audit and a 2 a.m. incident.

Financial services (Tier-1 bank)

How a Tier-1 bank turned its risk team from AI blocker into AI sponsor

An AI governance program — built before scaling LLMs and agents — that made saying yes faster than saying no.

8 min read Healthcare — multi-site health system

Who owns the model at 2 a.m.? A multi-site health system hands its AI surface to a managed retainer

From a stalled clinical-documentation pilot to a governed, monitored, SLA-backed AI stack — drift, accuracy, cost, and 24×5 on-call owned by one accountable partner.

10 min read

Use cases

Weighing a claims or intake agent? Start with a paid two-week Discovery Assessment — we model the payback before you commit to a build. Talk to Maverin.

Start a conversation