AI Anomaly Detection for Claims Fraud: A Practitioner’s Blueprint
Fraud costs U.S. property and casualty insurers roughly $34 billion annually—about 10% of all claims payments—per the FBI and Coalition Against Insurance Fraud. Yet many carriers still rely on rule-based red flags or manual review. That’s where anomaly detection shines. Not as a silver bullet, but as a force-multiplier: it flags patterns, not just rules, cutting false positives from 40% to under 10% in some deployments.
I’ve seen claims teams at regional MGAs cut SIU referral time by 55% using unsupervised models, and a Top 20 carrier drop questionable payouts by $12 million in the first 12 months. But the real value isn’t detection—it’s integration. Fraud scores must feed straight into adjuster queues, bordereaux systems, and TPAs without manual re-entry. Anything else is just a sandbox experiment.
Below is a battle-tested playbook I’ve used to deploy AI anomaly detection across lines like auto, workers’ comp, and marine cargo. It’s opinionated: you’ll use PyOD for modeling, FastAPI for inference, and S3 + PostgreSQL for storage. No “choose your own adventure.”
---Phase 1: Data Foundation—Garbage In, Alerts Out
Anomaly detection is only as good as the data it trains on. Skip this phase and your model will learn to hate its life.
Step 1: Ingest Raw Claims & Transactions
Pull structured and semi-structured data. Minimum schema:
- claims: claim_id, policy_id, line_of_business, reported_date, injury_date, loss_date, total_incurred, indemnity, expense, status
- claims_parts: claim_part_id, claim_id, part_type (vehicle, injury, property), amount, quantity, provider_id
- transactions: tx_id, claim_id, payment_date, amount, payment_type (indemnity, expense), check_number
- provider_fees: provider_id, specialty, avg_fee, fee_schedule
Use S3 event triggers (or Kafka) to land CSV/JSON into raw zone. Expect 1–3 GB/day for a $1B premium carrier.
Step 2: Feature Engineering Pipeline
Build a daily batch job. Pseudocode:
# feature_pipeline.py
def build_daily_features(date):
df = spark.read.parquet(f"s3://raw/{date}/claims/")
df = df.withColumn("days_to_report", datediff(col("reported_date"), col("loss_date")))
df = df.withColumn("indemnity_ratio", col("indemnity") / col("total_incurred"))
df = df.join(provider_fees, "provider_id")
df = df.withColumn("fee_anomaly", col("amount") / col("avg_fee"))
return df
Critical features:
- Temporal outliers: days_to_report > 99th percentile per LOB
- Spend velocity: weekly incurred > 3x rolling 4-week median
- Fee deviation: fee_anomaly > 2.5 (provider-specific)
- Network density: provider involved in > N claims in region (adjust N per LOB)
Trade-off: Including too many provider flags increases false positives in low-claim regions. Cap at 5 network-based features.
Step 3: Label Generation (Optional but Useful)
If you have historical SIU cases, label them 1. For unlabeled data, use a weak signal: “closed_without_payment” = 0, “litigated” = 1, “SIU_referral” = 1. This gives you a noisy but usable training set.
Resource estimate for Phase 1:
| Task | Tech | Runtime | Cost |
|---|---|---|---|
| Raw ingestion | Glue / Airflow | 30 min/day | $120/month |
| Feature pipeline | EMR Serverless + Spark | 45 min/day | $280/month |
| Storage | S3 + Parquet | - | $45/month |
Phase 2: Model Selection—Not All Anomalies Are Equal
Choose the wrong model and you’ll drown in false alerts. Choose wisely and you’ll catch padded invoices before they hit the check printer.
Step 4: Start with Isolation Forest (Unsupervised)
Why Isolation Forest? It’s fast, handles mixed numeric/categorical data poorly, and scales to 1M+ claims. Use scikit-learn:
from pyod.models.iforest import IForest
from sklearn.preprocessing import RobustScaler
# Load features + labels (if available)
X = pd.read_parquet("features_20240101.parquet").drop(columns=["claim_id"])
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
# Train
clf = IForest(contamination=0.05, random_state=42)
clf.fit(X_scaled)
Tune contamination to 1–5% per LOB. Higher values = more false positives.
Step 5: Add Isolation Forest + Autoencoder (Ensemble)
Combine unsupervised outputs using a simple weighted average. This catches both point anomalies (e.g., single padded invoice) and contextual anomalies (e.g., cluster of claims with same provider + high fee deviation).
from pyod.models.auto_encoder import AutoEncoder ae = AutoEncoder(hidden_neurons=[64, 32, 64], contamination=0.03) ae.fit(X_scaled) # Combine scores combined_score = 0.6 * clf.decision_scores_ + 0.4 * ae.decision_scores_
Limitation: Autoencoders need >10k claims to stabilize. Don’t use on small MGAs.
Step 6: Parametric Trigger for Cat Claims
For catastrophe claims, use a parametric trigger instead of ML. Example: if wind_speed > 70 mph AND distance_to_coast < 50 miles, score = 0.9. Feed this into the same scoring pipeline.
---Phase 3: Scoring & Threshold Tuning—From Lab to Prod
Scores alone mean nothing. You need thresholds that align with adjuster capacity and SIU bandwidth.
Step 7: Generate Daily Anomaly Scores
Wrap the model in a FastAPI service:
# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
model = joblib.load("iforest_20240101.pkl")
scaler = joblib.load("scaler_20240101.pkl")
class Claim(BaseModel):
claim_id: str
days_to_report: float
indemnity_ratio: float
fee_anomaly: float
app = FastAPI()
@app.post("/score")
def score(claim: Claim):
X = [[claim.days_to_report, claim.indemnity_ratio, claim.fee_anomaly]]
X_scaled = scaler.transform(X)
score = model.decision_function(X_scaled)[0]
return {"claim_id": claim.claim_id, "anomaly_score": float(score)}
Deploy to EC2 (t3.medium) or ECS Fargate. Expect 500–1,000 claims/day → 1 req/sec → negligible cost.
Step 8: Tune Thresholds per LOB
Use historical SIU cases to set initial thresholds. Example:
| LOB | Initial Threshold | False Positive Rate | SIU Catch Rate |
|---|---|---|---|
| Workers’ Comp | 0.85 | 8% | 62% |
| Auto PIP | 0.92 | 12% | 78% |
| Marine Cargo | 0.78 | 5% | 45% |
Adjust weekly based on adjuster feedback. Use a simple feedback loop: if adjuster marks a claim as “fraudulent,” increment true positive count; if they reject, increment false positive.
Trade-off: Lowering thresholds increases SIU workload. Raising them risks missing sophisticated rings.
---Phase 4: Integration—Closing the Loop
An anomaly score sitting in a database is useless. It must trigger actions in the claims system.
Step 9: Push Scores to Adjuster Queues
Use a lightweight ETL job to write scores to PostgreSQL:
# write_scores.py
df = pd.read_csv("daily_scores.csv")
df.to_sql(
"anomaly_scores",
engine,
if_exists="append",
index=False
)
Create a view in the adjuster portal:
CREATE VIEW high_risk_claims AS SELECT c.*, a.anomaly_score FROM claims c JOIN anomaly_scores a ON c.claim_id = a.claim_id WHERE a.anomaly_score > 0.85 ORDER BY a.anomaly_score DESC;
Add a “Fraud Alert” badge in the UI. Include a one-click SIU referral button.
Step 10: Automate Bordereaux & TPA Feeds
Send bordereaux to TPAs with a fraud flag:
# bordereaux.py
df["fraud_flag"] = df["anomaly_score"].apply(lambda x: "Y" if x > 0.85 else "N")
df.to_csv("tpa_bordereaux_20240101.csv")
Use SFTP or API (if TPA supports it). Expect 15–30 minutes extra per monthly bordereaux prep.
Step 11: Real-Time Alerts for High-Risk Payments
For auto lines, trigger a webhook when a payment > $5k hits the queue and has an anomaly score > 0.9. Example:
# webhook_lambda.py
import boto3
def lambda_handler(event, context):
if event["payment_amount"] > 5000 and event["anomaly_score"] > 0.9:
sns.publish(
TopicArn="arn:aws:sns:us-east-1:123456789012:high-risk-payment",
Message=f"Payment {event['check_number']} flagged for review"
)
---
Phase 5: Monitoring & Governance—Keeping It Alive
Models degrade. Fraudsters adapt. Without monitoring, your $12M savings become a $2M headache.
Step 12: Track Model Drift
Run daily Kolmogorov-Smirnov tests on feature distributions. Flag if p-value < 0.01 for any feature. Use Evidently or custom scripts.
Step 13: Monitor Business Impact
Track these KPIs weekly:
- False positive rate (adjuster feedback)
- SIU referral rate (should increase slightly)
- Loss ratio improvement (target: 0.5–1.5 point reduction)
- Combined ratio delta (if integrated with UW)
Example dashboard:
# metrics.py
import pandas as pd
df = pd.read_sql("SELECT * FROM anomaly_scores WHERE date > CURRENT_DATE - 30", engine)
fp_rate = (df["adjuster_marked_fp"] / df["total_flagged"]).mean()
siu_rate = (df["siu_referral"] / df["total_flagged"]).mean()
Step 14: Retrain Quarterly or on Drift
Use Airflow to trigger retraining when drift detected. Keep last 3 model versions in S3. Use a canary deployment: roll out to 10% of claims, monitor for 7 days, then full rollout.
Risk: Retraining without guardrails can introduce new biases. Always validate on holdout set.
---Phase 6: Advanced Tactics—When Basics Aren’t Enough
If your baseline catches 60% of fraud but misses rings, it’s time to go deeper.
Graph-Based Anomaly Detection
Use Neo4j or Amazon Neptune to model:
- Nodes: Claim, Provider, Policyholder, Location
- Edges: Submitted_by, Treated_by, Lives_at
- Anomalies: Claims with high betweenness centrality in provider subgraphs
Example query:
MATCH (p:Provider)-[:TREATED]->(c:Claim) WHERE c.anomaly_score > 0.8 RETURN p.provider_id, count(c) as claim_count ORDER BY claim_count DESC LIMIT 20;
Resource cost: ~$500/month for a small graph cluster. Worth it for auto or workers’ comp rings.
NLP for Narratives
Use spaCy or Hugging Face to extract entities from adjuster notes:
- Temporal inconsistencies (“reported injury 3 days after accident but notes say same day”)
- Provider duplication (“same chiropractor listed on 12 claims in 2 weeks”)
Combine NLP score with anomaly score. Example:
from transformers import pipeline
classifier = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
text = "Claimant reported slip on wet floor 2024-01-15. Witnesses: none. Treatment: Dr. Smith, same provider as prior claim."
entities = classifier(text)
nlp_score = 1.0 if "prior claim" in text else 0.0
Limitation: NLP adds 2–3 seconds per claim. Use only for high-score claims.
---Cost & ROI Reality Check
Here’s what it actually costs to run this at scale:
| Component | Monthly Cost (USD) | Notes |
|---|---|---|
| Data pipeline (EMR + Glue) | $400 | Scales with claims volume |
| Model serving (EC2 t3.medium) | $50 | Can use Lambda for low volume |
| Graph DB (Neo4j Aura) | $500 | Optional for rings |
| Storage (S3 + PostgreSQL) | $120 | Parquet + RDS |
| Monitoring (Evidently + CloudWatch) | $80 | Open-source mostly |
| Total | $1,150 | Break-even at ~$1.2M saved fraud per year |
ROI math: If your baseline fraud loss is $20M/year and you catch 30% with this system, you save $6M. Minus $13.8k/year in costs → net $5.8M. That’s a 419x ROI. But it assumes 30% catch rate—many carriers only get 15–20%.
Hard truth: If your SIU team is understaffed or your data quality is poor, this will fail. Fix those first.
---Common Pitfalls & How to Avoid Them
I’ve seen teams waste six months on models that never shipped. Avoid these:
- Over-engineering: Don’t build a real-time streaming pipeline for a 500-claims/day book. Start with daily batch.
- Ignoring adjuster workflow: If your fraud score isn’t visible in