Claims Fraud Scoring AI Model Development Guide
I’ve seen claims teams waste months on fraud models that never get past POC. Most failures come from one of three places: bad data assumptions, over-engineering the model, or ignoring the operational friction of deploying it. This guide cuts through the noise. It’s a field manual for building a claims fraud scoring model that actuaries trust, adjusters use, and fraudsters fear. You’ll get the exact steps we used to reduce our loss ratio by 1.8 points at a mid-size P&C carrier, with code snippets and resource estimates you can copy-paste into your own stack.
Target: a model that scores every new claim in under 300ms, flags the top 5% most suspicious cases for review, and reduces paid fraudulent claims by at least 25%. We’ll hit that with a hybrid approach: supervised learning for labeled fraud cases, unsupervised anomaly detection for emerging fraud patterns, and a rules engine for known red flags. The model will integrate with Guidewire ClaimCenter, Duck Creek Claims, or any core via REST API.
Trade-off to accept upfront: precision is more important than recall in this phase. We’re optimizing for high-precision low-recall—catching the obvious fraudsters without drowning adjusters in false positives. We’ll tune the threshold later; first, we need a model that’s reliable enough to deploy.
---Step 0: Prerequisites and Resource Plan
You need:
- A labeled dataset of at least 5,000 closed claims with known fraud outcomes (internal or from industry consortia like ISO ClaimSearch or NICB).
- At least 18 months of historical claim data to capture seasonality and emerging fraud patterns.
- Access to a data warehouse (Snowflake, BigQuery, Redshift) and a feature store (Feast, Tecton, or open-source).
- A Python environment with scikit-learn, XGBoost, LightGBM, PyOD, and SHAP installed.
- A fraud analyst or SIU (Special Investigations Unit) team willing to label 200–300 claims per month for active learning.
- Budget: $25K–$50K for cloud compute (GCP/AWS), feature engineering, and model monitoring. This excludes core system integration costs.
Timeline: 8–12 weeks for MVP, 4–6 weeks for A/B testing in production.
---Step 1: Data Acquisition and Governance
Start with raw claims data. You’ll need three sources:
- Claim Header Data: policy number, claim number, date of loss, reported date, close date, line of business (auto, home, WC), state, deductible, coverage type.
- Claim Transaction Data: payments, reserves, adjusters assigned, repair estimates, medical bills, salvage values, subrogation recoveries.
- External Data: ISO ClaimSearch reports, NICB alerts, MVR (motor vehicle records), PII validation (LexisNexis, Accurint), weather APIs (for hail/flood claims).
Example schema in Snowflake:
CREATE OR REPLACE TABLE claims_raw (
claim_number VARCHAR PRIMARY KEY,
policy_id VARCHAR,
line_of_business VARCHAR,
state VARCHAR,
date_of_loss TIMESTAMP_NTZ,
reported_date TIMESTAMP_NTZ,
closed_date TIMESTAMP_NTZ,
deductible_amount FLOAT,
total_paid FLOAT,
reserve_amount FLOAT,
adjuster_id VARCHAR,
is_fraud_confirmed BOOLEAN,
fraud_reason VARCHAR,
-- Add 50+ more fields
);
Real trade-off: External data is expensive and slow to ingest. We dropped NICB feeds after realizing they only flagged 0.3% of our claims as suspicious—too low to justify the $0.12 per claim cost. Instead, we used ISO ClaimSearch reports, which cost $0.04 per claim but provided richer signals like prior losses and injury patterns.
---Step 2: Feature Engineering Pipeline
Fraud signals live in behavior, not static fields. We engineered 120+ features across three categories:
Behavioral Features (Time-Based)
- Claim Velocity: Number of claims per policy in the last 12 months. High velocity = red flag.
- Adjuster Hops: Number of adjusters assigned to a claim. >3 hops usually indicates confusion or fraud.
- Payment Lag: Days between reported date and first payment. >45 days = suspicious.
Example feature calculation in Python:
def calc_claim_velocity(claims_df):
policy_claims = claims_df.groupby('policy_id')['claim_number'].nunique()
return claims_df['policy_id'].map(policy_claims)
# Apply to raw claims
claims_df['claim_velocity_12m'] = calc_claim_velocity(claims_df[claims_df['date_of_loss'] > (claims_df['date_of_loss'].max() - pd.DateOffset(months=12))])
Network Features (Graph-Based)
- Repair Shop Connections: Count unique repair shops per claim. >2 shops = possible kickback scheme.
- Medical Provider Connections: Count unique medical providers per bodily injury claim. >3 providers = likely fraud ring.
We used Neo4j for graph features. Example query:
MATCH (c:Claim {claim_number: 'CL12345'})-[:BILLED_BY]->(p:Provider)
RETURN count(p) AS provider_count;
External Risk Features
- MVR Risk Score: 0–100 score from LexisNexis. >75 = high risk.
- PII Mismatch Score: 0–100 score from Accurint. >80 = synthetic identity risk.
- Weather Anomaly Score: Deviation from historical hail/flood frequency in the ZIP code.
Real trade-off: Feature explosion increases model latency. We started with 200 features, but our REST API response time jumped from 200ms to 800ms. We pruned to 120 features using SHAP importance and a 100ms latency budget. The top 20 features drove 85% of the model’s predictive power.
---Step 3: Labeling Strategy and Active Learning
Labeled fraud data is scarce and noisy. We used a three-tier labeling approach:
- Confirmed Fraud: Claims with SIU findings, court convictions, or confirmed via subrogation recoveries. These are gold-standard labels.
- Suspected Fraud: Claims flagged by adjusters or TPAs but not yet investigated. We treated these as weak labels with a 0.3 probability of being fraud.
- Negative Labels: All other closed claims. We assumed these were clean unless proven otherwise.
For active learning, we prioritized labeling claims the model scored in the 80th–95th percentile. This gave us the most bang for our labeling buck. Example query to identify candidates:
SELECT claim_number, score
FROM fraud_scores
WHERE score BETWEEN 0.8 AND 0.95
ORDER BY score DESC
LIMIT 100;
We hired a retired adjuster to label 200 claims per month. Cost: $5K/month. ROI: we reduced false positives by 12% in the first quarter.
Real trade-off: Label noise hurts model performance. We initially treated all suspected fraud labels as positives, but our precision dropped from 0.82 to 0.65. We switched to probabilistic labeling (0.7 weight for suspected fraud) and retrained weekly. Precision rebounded to 0.85.
---Step 4: Model Selection and Training
We tested seven models:
- XGBoost (baseline)
- LightGBM (faster training)
- CatBoost (handles categoricals well)
- Logistic Regression (interpretable)
- Isolation Forest (unsupervised anomaly detection)
- One-Class SVM (unsupervised)
- Autoencoders (deep learning)
Best performer: LightGBM with focal loss (handles class imbalance). Focal loss down-weights well-classified examples, forcing the model to focus on hard cases. Formula:
FL(p_t) = -(1 - p_t)^γ * log(p_t)
Where p_t is the model’s predicted probability and γ is a focusing parameter (we used 2.0).
Training script (Python):
import lightgbm as lgb
from sklearn.model_selection import train_test_split
# Load features and labels
X = pd.read_parquet('features.parquet')
y = pd.read_parquet('labels.parquet')
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# Train model
train_data = lgb.Dataset(X_train, label=y_train)
params = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'focal_loss': True,
'focal_gamma': 2.0,
'seed': 42
}
model = lgb.train(params, train_data, num_boost_round=1000)
# Save model
model.save_model('fraud_model_lgb.txt')
Resource estimate: Training takes 2–4 hours on a 4-core CPU with 16GB RAM. We used a GCP n2-standard-4 instance ($0.15/hour).
Real trade-off: Focal loss increases training time by 30% but improves precision by 4–6 points. Worth it for fraud modeling.
---Step 5: Hybrid Model Architecture
No single model catches all fraud. We built a two-stage pipeline:
- Stage 1: Supervised Model (LightGBM) for labeled fraud cases. Scores 0–1.
- Stage 2: Unsupervised Anomaly Detection (Isolation Forest) for emerging fraud patterns. Scores 0–1.
Final score = 0.7 * Supervised Score + 0.3 * Unsupervised Score
Why the weights? The supervised model is precise but misses novel fraud patterns. The unsupervised model catches anomalies but has high false positives. The hybrid balances both.
Isolation Forest code:
from sklearn.ensemble import IsolationForest
# Fit on normal claims only
iso_forest = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
iso_forest.fit(X_train[y_train == 0])
# Predict anomaly score (lower is more anomalous)
X_test['iso_score'] = -iso_forest.decision_function(X_test)
We scaled the Isolation Forest scores to 0–1 using min-max scaling:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_test['iso_score_scaled'] = scaler.fit_transform(X_test[['iso_score']])
Hybrid score in production:
def hybrid_score(supervised_score, iso_score_scaled):
return 0.7 * supervised_score + 0.3 * iso_score_scaled
Real trade-off: The unsupervised model adds 150ms to inference time. We mitigated this by running it asynchronously and caching results for 24 hours. Adapters in our Flask API handle this:
from flask import Flask, request, jsonify
import redis
app = Flask(__name__)
redis_client = redis.Redis(host='redis', port=6379)
@app.route('/score', methods=['POST'])
def score():
claim = request.json
claim_id = claim['claim_number']
# Check cache
cached_score = redis_client.get(claim_id)
if cached_score:
return jsonify({'score': float(cached_score)})
# Compute hybrid score
supervised_score = lgb_model.predict_proba([claim['features']])[0][1]
iso_score = redis_client.get(f"iso_{claim_id}")
hybrid = hybrid_score(supervised_score, float(iso_score))
# Cache for 24 hours
redis_client.setex(claim_id, 86400, str(hybrid))
return jsonify({'score': hybrid})
---
Step 6: Threshold Tuning and Business Rules
We don’t deploy a model without a threshold. Start with a conservative threshold (e.g., top 5% of scores) and adjust based on business impact.
Example tuning process:
from sklearn.metrics import precision_recall_curve
# Get scores and labels
y_scores = model.predict_proba(X_test)[:, 1]
# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
# Plot and select threshold
import matplotlib.pyplot as plt
plt.plot(thresholds, precision[:-1], label='Precision')
plt.plot(thresholds, recall[:-1], label='Recall')
plt.legend()
plt.show()
# Select threshold where precision >= 0.85
selected_threshold = thresholds[np.argmax(precision >= 0.85)]
print(f"Selected threshold: {selected_threshold:.3f}")
We landed on a threshold of 0.87, which gave us:
- Precision: 0.85
- Recall: 0.32
- F1: 0.47
Not great recall, but precision is king for fraud. We compensate with business rules:
- If claim amount > $100K AND score > 0.80: auto-flag for SIU.
- If claim involves injury AND score > 0.75: auto-flag for medical review.
- If score > 0.95: auto-reject (no payment until investigation).
Real trade-off: Business rules introduce bias. We saw a 15% increase in fraud detection for high-value claims but missed some low-value fraud rings. We’re now exploring reinforcement learning to dynamically adjust thresholds by line of business.
---Step 7: Integration with Core Systems
Fraud scoring is useless if it doesn’t integrate with adjusters’ workflows. We built three integration points:
- Pre-Validation API: Called during first notice of loss (FNOL). Scores claims immediately and flags high-risk ones for immediate SIU review.
- Adjudication Plugin: Embedded in Guidewire ClaimCenter. Displays fraud score and top 5 risk factors in the adjuster’s dashboard.
- Post-Payment Audit: Batch scoring of closed claims monthly. Flags claims for retroactive review.
Example integration with Guidewire via REST API:
@app.route('/guidewire/webhook', methods=['POST'])
def guidewire_webhook():
claim = request.json['claim']
claim_number = claim['claimNumber']
policy_number = claim['policyNumber']
# Fetch features from warehouse
features = feature_store.get_features(policy_number, claim_number)
# Score claim
score = hybrid_score(features)
# Call Guidewire API to update claim
guidewire_api.update_claim_field(
claim_number,
'fraud_score',
score,
'fraud_risk_factors',
get_top_risk_factors(features)
)
return jsonify({'status': 'success'})
Resource estimate: Integration takes 4–6 weeks per core system. We used MuleSoft for API orchestration ($15K) and a React-based dashboard for adjusters ($25K).
Real trade-off: Core system APIs are slow and rate-limited. Guidewire’s REST API caps at 100 requests/minute. We mitigated this by:
- Batching claims into 50-claim chunks.
- Using a message queue (Kafka) to decouple scoring from updates.
- Caching scores for 1 hour to avoid redundant calls.
Step 8: Model Monitoring and Feedback Loop
Fraud patterns evolve. We monitor three things:
- Data Drift: