AI Anomaly Detection for Claims Fraud: A Prac...

AI Anomaly Detection for Claims Fraud: A Practitioner’s Blueprint

Fraud costs U.S. property and casualty insurers roughly $34 billion annually—about 10% of all claims payments—per the FBI and Coalition Against Insurance Fraud. Yet many carriers still rely on rule-based red flags or manual review. That’s where anomaly detection shines. Not as a silver bullet, but as a force-multiplier: it flags patterns, not just rules, cutting false positives from 40% to under 10% in some deployments.

I’ve seen claims teams at regional MGAs cut SIU referral time by 55% using unsupervised models, and a Top 20 carrier drop questionable payouts by $12 million in the first 12 months. But the real value isn’t detection—it’s integration. Fraud scores must feed straight into adjuster queues, bordereaux systems, and TPAs without manual re-entry. Anything else is just a sandbox experiment.

Below is a battle-tested playbook I’ve used to deploy AI anomaly detection across lines like auto, workers’ comp, and marine cargo. It’s opinionated: you’ll use PyOD for modeling, FastAPI for inference, and S3 + PostgreSQL for storage. No “choose your own adventure.”

---

Phase 1: Data Foundation—Garbage In, Alerts Out

Anomaly detection is only as good as the data it trains on. Skip this phase and your model will learn to hate its life.

Step 1: Ingest Raw Claims & Transactions

Pull structured and semi-structured data. Minimum schema:

claims: claim_id, policy_id, line_of_business, reported_date, injury_date, loss_date, total_incurred, indemnity, expense, status
claims_parts: claim_part_id, claim_id, part_type (vehicle, injury, property), amount, quantity, provider_id
transactions: tx_id, claim_id, payment_date, amount, payment_type (indemnity, expense), check_number
provider_fees: provider_id, specialty, avg_fee, fee_schedule

Use S3 event triggers (or Kafka) to land CSV/JSON into raw zone. Expect 1–3 GB/day for a $1B premium carrier.

Step 2: Feature Engineering Pipeline

Build a daily batch job. Pseudocode:

# feature_pipeline.py
def build_daily_features(date):
    df = spark.read.parquet(f"s3://raw/{date}/claims/")
    df = df.withColumn("days_to_report", datediff(col("reported_date"), col("loss_date")))
    df = df.withColumn("indemnity_ratio", col("indemnity") / col("total_incurred"))
    df = df.join(provider_fees, "provider_id")
    df = df.withColumn("fee_anomaly", col("amount") / col("avg_fee"))
    return df

Critical features:

Temporal outliers: days_to_report > 99th percentile per LOB
Spend velocity: weekly incurred > 3x rolling 4-week median
Fee deviation: fee_anomaly > 2.5 (provider-specific)
Network density: provider involved in > N claims in region (adjust N per LOB)

Trade-off: Including too many provider flags increases false positives in low-claim regions. Cap at 5 network-based features.

Step 3: Label Generation (Optional but Useful)

If you have historical SIU cases, label them 1. For unlabeled data, use a weak signal: “closed_without_payment” = 0, “litigated” = 1, “SIU_referral” = 1. This gives you a noisy but usable training set.

Resource estimate for Phase 1:

Task	Tech	Runtime	Cost
Raw ingestion	Glue / Airflow	30 min/day	$120/month
Feature pipeline	EMR Serverless + Spark	45 min/day	$280/month
Storage	S3 + Parquet	-	$45/month

---

Phase 2: Model Selection—Not All Anomalies Are Equal

Choose the wrong model and you’ll drown in false alerts. Choose wisely and you’ll catch padded invoices before they hit the check printer.

Step 4: Start with Isolation Forest (Unsupervised)

Why Isolation Forest? It’s fast, handles mixed numeric/categorical data poorly, and scales to 1M+ claims. Use scikit-learn:

from pyod.models.iforest import IForest
from sklearn.preprocessing import RobustScaler

# Load features + labels (if available)
X = pd.read_parquet("features_20240101.parquet").drop(columns=["claim_id"])
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

# Train
clf = IForest(contamination=0.05, random_state=42)
clf.fit(X_scaled)

Tune contamination to 1–5% per LOB. Higher values = more false positives.

Step 5: Add Isolation Forest + Autoencoder (Ensemble)

Combine unsupervised outputs using a simple weighted average. This catches both point anomalies (e.g., single padded invoice) and contextual anomalies (e.g., cluster of claims with same provider + high fee deviation).

from pyod.models.auto_encoder import AutoEncoder

ae = AutoEncoder(hidden_neurons=[64, 32, 64], contamination=0.03)
ae.fit(X_scaled)

# Combine scores
combined_score = 0.6 * clf.decision_scores_ + 0.4 * ae.decision_scores_

Limitation: Autoencoders need >10k claims to stabilize. Don’t use on small MGAs.

Step 6: Parametric Trigger for Cat Claims

For catastrophe claims, use a parametric trigger instead of ML. Example: if wind_speed > 70 mph AND distance_to_coast < 50 miles, score = 0.9. Feed this into the same scoring pipeline.

---

Phase 3: Scoring & Threshold Tuning—From Lab to Prod

Scores alone mean nothing. You need thresholds that align with adjuster capacity and SIU bandwidth.

Step 7: Generate Daily Anomaly Scores

Wrap the model in a FastAPI service:

# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

model = joblib.load("iforest_20240101.pkl")
scaler = joblib.load("scaler_20240101.pkl")

class Claim(BaseModel):
    claim_id: str
    days_to_report: float
    indemnity_ratio: float
    fee_anomaly: float

app = FastAPI()

@app.post("/score")
def score(claim: Claim):
    X = [[claim.days_to_report, claim.indemnity_ratio, claim.fee_anomaly]]
    X_scaled = scaler.transform(X)
    score = model.decision_function(X_scaled)[0]
    return {"claim_id": claim.claim_id, "anomaly_score": float(score)}

Deploy to EC2 (t3.medium) or ECS Fargate. Expect 500–1,000 claims/day → 1 req/sec → negligible cost.

Step 8: Tune Thresholds per LOB

Use historical SIU cases to set initial thresholds. Example:

LOB	Initial Threshold	False Positive Rate	SIU Catch Rate
Workers’ Comp	0.85	8%	62%
Auto PIP	0.92	12%	78%
Marine Cargo	0.78	5%	45%

Adjust weekly based on adjuster feedback. Use a simple feedback loop: if adjuster marks a claim as “fraudulent,” increment true positive count; if they reject, increment false positive.

Trade-off: Lowering thresholds increases SIU workload. Raising them risks missing sophisticated rings.

---

Phase 4: Integration—Closing the Loop

An anomaly score sitting in a database is useless. It must trigger actions in the claims system.

Step 9: Push Scores to Adjuster Queues

Use a lightweight ETL job to write scores to PostgreSQL:

# write_scores.py
df = pd.read_csv("daily_scores.csv")
df.to_sql(
    "anomaly_scores",
    engine,
    if_exists="append",
    index=False
)

Create a view in the adjuster portal:

CREATE VIEW high_risk_claims AS
SELECT c.*, a.anomaly_score
FROM claims c
JOIN anomaly_scores a ON c.claim_id = a.claim_id
WHERE a.anomaly_score > 0.85
ORDER BY a.anomaly_score DESC;

Add a “Fraud Alert” badge in the UI. Include a one-click SIU referral button.

Step 10: Automate Bordereaux & TPA Feeds

Send bordereaux to TPAs with a fraud flag:

# bordereaux.py
df["fraud_flag"] = df["anomaly_score"].apply(lambda x: "Y" if x > 0.85 else "N")
df.to_csv("tpa_bordereaux_20240101.csv")

Use SFTP or API (if TPA supports it). Expect 15–30 minutes extra per monthly bordereaux prep.

Step 11: Real-Time Alerts for High-Risk Payments

For auto lines, trigger a webhook when a payment > $5k hits the queue and has an anomaly score > 0.9. Example:

# webhook_lambda.py
import boto3

def lambda_handler(event, context):
    if event["payment_amount"] > 5000 and event["anomaly_score"] > 0.9:
        sns.publish(
            TopicArn="arn:aws:sns:us-east-1:123456789012:high-risk-payment",
            Message=f"Payment {event['check_number']} flagged for review"
        )

---

Phase 5: Monitoring & Governance—Keeping It Alive

Models degrade. Fraudsters adapt. Without monitoring, your $12M savings become a $2M headache.

Step 12: Track Model Drift

Run daily Kolmogorov-Smirnov tests on feature distributions. Flag if p-value < 0.01 for any feature. Use Evidently or custom scripts.

Step 13: Monitor Business Impact

Track these KPIs weekly:

False positive rate (adjuster feedback)
SIU referral rate (should increase slightly)
Loss ratio improvement (target: 0.5–1.5 point reduction)
Combined ratio delta (if integrated with UW)

Example dashboard:

# metrics.py
import pandas as pd

df = pd.read_sql("SELECT * FROM anomaly_scores WHERE date > CURRENT_DATE - 30", engine)
fp_rate = (df["adjuster_marked_fp"] / df["total_flagged"]).mean()
siu_rate = (df["siu_referral"] / df["total_flagged"]).mean()

Step 14: Retrain Quarterly or on Drift

Use Airflow to trigger retraining when drift detected. Keep last 3 model versions in S3. Use a canary deployment: roll out to 10% of claims, monitor for 7 days, then full rollout.

Risk: Retraining without guardrails can introduce new biases. Always validate on holdout set.

---

Phase 6: Advanced Tactics—When Basics Aren’t Enough

If your baseline catches 60% of fraud but misses rings, it’s time to go deeper.

Graph-Based Anomaly Detection

Use Neo4j or Amazon Neptune to model:

Nodes: Claim, Provider, Policyholder, Location
Edges: Submitted_by, Treated_by, Lives_at
Anomalies: Claims with high betweenness centrality in provider subgraphs

Example query:

MATCH (p:Provider)-[:TREATED]->(c:Claim)
WHERE c.anomaly_score > 0.8
RETURN p.provider_id, count(c) as claim_count
ORDER BY claim_count DESC
LIMIT 20;

Resource cost: ~$500/month for a small graph cluster. Worth it for auto or workers’ comp rings.

NLP for Narratives

Use spaCy or Hugging Face to extract entities from adjuster notes:

Temporal inconsistencies (“reported injury 3 days after accident but notes say same day”)
Provider duplication (“same chiropractor listed on 12 claims in 2 weeks”)

Combine NLP score with anomaly score. Example:

from transformers import pipeline

classifier = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
text = "Claimant reported slip on wet floor 2024-01-15. Witnesses: none. Treatment: Dr. Smith, same provider as prior claim."
entities = classifier(text)
nlp_score = 1.0 if "prior claim" in text else 0.0

Limitation: NLP adds 2–3 seconds per claim. Use only for high-score claims.

---

Cost & ROI Reality Check

Here’s what it actually costs to run this at scale:

Component	Monthly Cost (USD)	Notes
Data pipeline (EMR + Glue)	$400	Scales with claims volume
Model serving (EC2 t3.medium)	$50	Can use Lambda for low volume
Graph DB (Neo4j Aura)	$500	Optional for rings
Storage (S3 + PostgreSQL)	$120	Parquet + RDS
Monitoring (Evidently + CloudWatch)	$80	Open-source mostly
Total	$1,150	Break-even at ~$1.2M saved fraud per year

ROI math: If your baseline fraud loss is $20M/year and you catch 30% with this system, you save $6M. Minus $13.8k/year in costs → net $5.8M. That’s a 419x ROI. But it assumes 30% catch rate—many carriers only get 15–20%.

Hard truth: If your SIU team is understaffed or your data quality is poor, this will fail. Fix those first.

---

Common Pitfalls & How to Avoid Them

I’ve seen teams waste six months on models that never shipped. Avoid these:

Over-engineering: Don’t build a real-time streaming pipeline for a 500-claims/day book. Start with daily batch.
Ignoring adjuster workflow: If your fraud score isn’t visible in