NLP for Insurance Fraud Investigation: A Practitioner’s Implementation Guide
I’ve spent years watching claims teams drown in unstructured data—adjuster notes, police reports, repair invoices, social media posts—while fraudsters exploit gaps in manual review. Natural Language Processing (NLP) can automate much of this, but only if you build systems that handle insurance’s unique linguistic patterns: dense jargon, contradictory narratives, and intentional obfuscation. This guide walks you through building an end-to-end NLP pipeline for fraud detection, from data ingestion to model deployment, with realistic trade-offs at each step.
We’ll use a real-world example: detecting padded auto repair claims. The average padded claim costs insurers 12% more than legitimate ones, according to the Insurance Research Council. But without NLP, flagging these requires armies of investigators squinting at PDFs of "1987 Toyota Camry radiator replacement" receipts that seem identical to legitimate ones. NLP won’t replace investigators—it will let them focus where it matters: the 3% of claims that actually need human scrutiny.
Assumptions and Prerequisites
This tutorial targets a practitioner with:
- Python 3.9+ and basic ML knowledge (you know what a TF-IDF vectorizer does)
- Access to 2–5 years of historical claims data (structured + unstructured text)
- A cloud budget of ~$5k/month for experimentation (adjust based on scale)
- No existing NLP infrastructure (we’ll build from scratch)
If your data is locked in legacy systems, plan for 4–6 weeks of ETL work before this pipeline. Most insurers underestimate the cost of cleaning 30-year-old claims notes written in shorthand by adjusters who retired in 1998.
Step 1: Data Acquisition and Legal Scoping
Start with a narrow, high-impact use case. Padded auto repair claims are ideal: fraudsters often reuse the same inflated labor codes across multiple shops, and the language patterns are repetitive enough for NLP to catch.
Data sources to collect:
- Structured claims data: Policy IDs, claim IDs, loss dates, repair facility IDs, paid amounts
- Unstructured text: Adjustor notes, repair estimates, police reports, customer statements, third-party medical reports
- External signals: Vehicle history reports (from Carfax/Experian), weather data (for hail claims), social media scrapes (public posts only)
Legal and ethical guardrails:
- GDPR/CCPA compliance: Exclude claims from EU/California if your model uses PII for training (some jurisdictions treat claim narratives as personal data).
- Bias audits: Review training data for demographic skew—fraud models have historically over-flagged claims from ZIP codes with lower average incomes.
- Model explainability: Document feature importance for regulators. I’ve seen models rejected by state insurance departments for being "black boxes."
Resource estimate: 3–4 weeks for data acquisition if you’re pulling from multiple TPAs and legacy systems. The biggest bottleneck is usually PDF OCR accuracy—older estimates scanned as images will require manual correction.
Step 2: Text Preprocessing for Insurance-Specific Noise
Insurance text isn’t normal text. It’s full of:
- Typos disguised as jargon: "radiator" → "raditor", "estimate" → "estmate"
- Acronym soup: "PPO" (Preferred Provider Org) vs "PPO" (Parking Protection Ordinance)
- Numeric shorthand: "30K miles" vs "30,000 miles"
- Obfuscation: "additional work identified" instead of "we charged for extra parts"
Here’s a preprocessing pipeline that handles these quirks. We’ll use spaCy for linguistic parsing and scikit-learn for vectorization.
import spacy
import re
from sklearn.feature_extraction.text import TfidfVectorizer
# Load a custom insurance-focused pipeline
nlp = spacy.load("en_core_web_sm")
# Add custom tokenizer for insurance jargon
def insurance_tokenizer(text):
# Normalize numerics (e.g., "30K" → "30000")
text = re.sub(r'(\d+)K\b', r'\1000', text)
text = re.sub(r'(\d+)[.,]?(\d*)\s*(miles|mi|km)', r'\1\2', text, flags=re.IGNORECASE)
# Fix common typos
text = re.sub(r'\braditor\b', 'radiator', text)
text = re.sub(r'\bengin\b', 'engine', text)
# Remove adjustor shorthand (e.g., "PT" for "part-time adjustor")
text = re.sub(r'\b(adj|pt|rep)\b', '', text)
# Lemmatize while preserving domain terms (e.g., "repaired" → "repair" but keep "Camry")
doc = nlp(text)
tokens = [token.lemma_ if not token.is_stop else "" for token in doc]
return " ".join(tokens)
# Example preprocessing
text = "Adjuster noted raditor leak on 2018 Toyota Camry. Pt claims 30K miles. Estimate shows rep work."
cleaned = insurance_tokenizer(text)
print(cleaned)
# Output: "adjuster note radiator leak 2018 toyota camry claim mile estimate show repair work"
Trade-off: Aggressive normalization can strip away meaningful context. For example, "adjuster" vs "adjustor" might seem like a typo, but in some claims systems, "adjustor" refers to a specific role. Always validate preprocessing rules against a sample of your data.
Step 3: Feature Engineering for Fraud Signals
Fraud in insurance isn’t just about "lying"—it’s about anomalous patterns. The best NLP features capture linguistic and temporal anomalies.
Linguistic features:
- Repetition score: Count how often repair facilities reuse the exact same phrase across claims (e.g., "additional work identified"). Fraudulent shops often copy-paste estimates.
- Passive voice ratio: Fraudulent narratives use passive voice to distance themselves ("damage was observed" vs "I saw damage").
- Quantitative inconsistency: Compare the number of parts listed in the text vs the number in the structured claim data.
Temporal features:
- Claim velocity: Time between estimate creation and repair completion. Fraudulent shops often inflate labor by adding "discovered damage" after the initial estimate.
- Repair facility churn: How often a facility appears in claims with inflated labor codes. A shop with a 40% loss ratio on the same labor code across 50 claims is a red flag.
External signals:
- Vehicle age vs mileage mismatch: A 2010 Toyota with 15,000 miles should raise eyebrows. Compare repair estimates against VIN data.
- Weather correlation: For hail claims, check if the repair date aligns with historical weather data. Fraudulent hail claims often cluster on sunny days.
Here’s how to compute some of these features:
from collections import Counter
import pandas as pd
def compute_repetition_score(text_series):
"""Count how often repair facilities reuse exact phrases."""
# Group by repair facility ID and join all claim texts
grouped = text_series.groupby(level='repair_facility_id')
phrases = grouped.apply(lambda x: " ".join(x))
# Count phrase frequencies
phrase_counts = Counter(phrases)
most_common = phrase_counts.most_common(10)
return most_common
def compute_passive_voice_ratio(text):
"""Estimate passive voice usage using dependency parsing."""
doc = nlp(text)
passive_verbs = [token for token in doc if token.dep_ == "auxpass"]
return len(passive_verbs) / len([token for token in doc if token.pos_ == "VERB"]) if doc else 0
Trade-off: Linguistic features can be gamed. Fraudsters learn to mimic legitimate language patterns over time. I’ve seen shops start using active voice after a model flagged their passive-heavy narratives. Combine NLP with behavioral signals (e.g., claim velocity) to stay ahead.
Step 4: Labeling Strategy and Weak Supervision
You need labels to train a model, but fraud labels are scarce. Most insurers have a few hundred confirmed fraud cases (from SIUs) but millions of claims. Here’s how to bridge the gap:
Option 1: Expert-labeled seed data
- Pull confirmed fraud cases from your SIU team (usually 100–500 claims).
- Pull confirmed legitimate claims (matching policyholders with no claims in 5 years).
- Use these as seed data for weak supervision.
Option 2: Anomaly detection as weak supervision
- Use unsupervised methods to flag outliers, then have adjusters review a sample.
- For padded repair claims, flag facilities with loss ratios >150% on the same labor code. Loss ratio = (Paid Amount) / (Expected Amount based on region/vehicle).
- This might give you 10k–50k "likely fraud" labels with minimal human effort.
Option 3: Synthetic labeling
- Use LLMs to generate synthetic fraud narratives based on real cases.
- Example prompt: "Generate 50 auto repair claim narratives for a fraudulent shop padding labor on Toyota Camry radiator replacements."
- Verify with domain experts—LLMs can hallucinate unrealistic details (e.g., "1987 Camry radiator replacement requiring 8 hours of labor").
Here’s a weak supervision pipeline using Snorkel:
from snorkel.labeling import LabelingFunction, labeling_function
from snorkel.labeling import PandasLFApplier
# Define labeling functions (LFs)
@labeling_function()
def lf_high_loss_ratio(x):
# Flag repair facilities with loss ratio > 150%
return 1 if x['facility_loss_ratio'] > 1.5 else -1
@labeling_function()
def lf_repetitive_phrase(x):
# Flag claims with reused phrases
return 1 if x['repetition_score'] > 0.8 else 0
@labeling_function()
def lf_weather_mismatch(x):
# Flag hail claims with no weather correlation
return -1 if x['claim_type'] == 'hail' and not x['weather_correlation'] else 0
# Apply LFs to your dataset
lfs = [lf_high_loss_ratio, lf_repetitive_phrase, lf_weather_mismatch]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
# Train a label model
from snorkel.labeling import LabelModel
label_model = LabelModel(cardinality=2)
label_model.fit(L_train=L_train, n_epochs=50)
Trade-off: Weak supervision introduces label noise. A model trained on labels from loss ratios will inherit those ratios’ biases. Always validate weak labels against a small expert-labeled set. I’ve seen models flag claims from rural facilities (lower expected labor costs) as fraudulent because the loss ratio heuristic didn’t account for regional differences.
Step 5: Model Selection and Training
For fraud detection, you need a model that:
- Handles class imbalance (fraud is rare: ~1–3% of claims)
- Provides interpretable features (adjusters need to explain flags)
- Works with mixed data types (text + structured features)
Recommended approaches:
- Text-only model (baseline)
- Use
TF-IDF + Logistic RegressionorBERTfor narrative text. - Pros: Simple, interpretable. - Cons: Misses structured fraud signals (e.g., claim velocity). - Hybrid model (text + structured features)
- Use
CatBoostorXGBoostwith: - Text embeddings (TF-IDF, BERT, or custom) - Structured fraud signals (loss ratio, claim velocity, etc.) - Pros: Captures both linguistic and behavioral patterns. - Cons: Requires careful feature scaling. - Transformer-based (for high-volume claims)
- Fine-tune
RoBERTaorDeBERTaon insurance-specific corpus. - Add structured features via concatenation or cross-attention. - Pros: State-of-the-art performance on complex narratives. - Cons: Expensive to train and serve (~$2k/month for GPU inference at scale).
Here’s a hybrid model using CatBoost and TF-IDF embeddings:
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler
# Text features
text_vectorizer = TfidfVectorizer(
tokenizer=insurance_tokenizer,
max_features=5000,
ngram_range=(1, 3)
)
# Structured features
structural_features = ['loss_ratio', 'claim_velocity', 'passive_voice_ratio']
# Combine features
preprocessor = FeatureUnion([
('text', text_vectorizer),
('structural', StandardScaler())
])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
df[['claim_narrative'] + structural_features],
df['is_fraud'],
test_size=0.2
)
# Train CatBoost
model = CatBoostClassifier(
iterations=500,
learning_rate=0.1,
depth=6,
class_weights=[1, 5], # Adjust for imbalance
verbose=100
)
# Fit on combined features
X_train_processed = preprocessor.fit_transform(X_train)
model.fit(X_train_processed, y_train)
Model performance expectations:
- Baseline (TF-IDF + Logistic Regression): AUC ~0.75
- Hybrid (CatBoost + TF-IDF): AUC ~0.82
- Transformer (RoBERTa + structured features): AUC ~0.88
Trade-off: Transformer models are overkill for most fraud use cases. The marginal gain in AUC (0.82 vs 0.88) rarely justifies the 5–10x cost in compute and latency. Stick with hybrid models unless you’re processing tens of thousands of claims daily.
Step 6: Threshold Tuning and False Positive Management
Fraud models aren’t evaluated on AUC alone. The real metric is net savings:
Net Savings = (False Positives * Cost of Investigation) - (True Positives * Fraud Amount)
Example:
- Your model flags 100 claims as fraudulent.
- Investigating each claim costs $200 (adjustor time + external vendor).
- 50 claims are actual fraud, averaging $1,500 in savings per claim.
- 20 claims are false positives, costing $4,000 in wasted investigation.
- Net savings = (50 * $1,500) - (20 * $200) = $75,000 - $4,000 = $71,000.
Here’s how to optimize the threshold:
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Get predicted probabilities
y_probs = model.predict_proba(X_test)[:, 1]
# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precision[:-1], label="Precision")
plt.plot(thresholds, recall[:-1], label="Recall")
plt.xlabel("Threshold")
plt.ylabel("Score")
plt.legend()
plt.show()
# Find threshold where precision >= 0.80 and recall >= 0.60
optimal_idx = np.argmax(np.where((precision >= 0.80) & (recall >= 0.60), precision, 0))
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal threshold: {optimal_threshold:.3f}")