MSN-01 // DEPLOYED

FRAUD TRIAGE

A two-stage pipeline that catches what the model misses — XGBoost for speed, Google Gemini for the reasoning.

0.978 AUC 84% Recall 84% Precision 284K Records ~2s Latency $0 on Free Tier

BRIEFING

Card networks process millions of transactions a day. Under 1% are fraudulent, but each one is real money lost. The job is two-sided: catch fraud without drowning analysts in false flags, and give them enough context to make a fast, defensible call on the ones that do get flagged. Most systems stop at a score — a 0.87 doesn't tell an analyst why a transaction looks wrong or what to do about it.

This project pairs two complementary AI systems so each does what it's best at: XGBoost rapidly scores every transaction, then Google Gemini takes the small fraction flagged as suspicious and produces a structured, analyst-actionable verdict with reasoning — the speed of classical ML with the explainability of generative AI.

ROLE / METHOD / OUTCOME

Role
Solo build, end to end — modeling, LLM layer, and batch pipeline.
Method
Weighted-loss XGBoost plus a Gemini triage layer with strict JSON output.
Outcome
Industry-grade AUC with a reasoning trail analysts can audit.

HOW IT WORKS

STAGE 1 — XGBOOST

Trained on 284,807 historical card transactions, it scores every transaction with a fraud probability. Fast and cheap, it scales to millions of transactions a day and flags only the highest-risk ones for review.

STAGE 2 — GOOGLE GEMINI

A reasoning layer that examines each flagged transaction's raw features and writes a structured verdict — what's suspicious, why, and the recommended action — only paying the LLM cost on the few cases that need deep analysis.

In practice: overnight the model scores a million transactions and flags the top 100. Each flag is sent through the Gemini triage agent, which returns a standardized JSON case file — verdict, confidence, primary signals, reasoning, and a recommended action. The analyst reviews a decision instead of a bare number. What used to take ~5 minutes of investigation per case becomes ~30 seconds of verification.

THE FINDING WORTH TELLING

During testing, a transaction scored 0.000000 — the ML model was certain it was legitimate. It was actually fraud. The model would have let it through. Reading the raw feature values rather than the model's confidence, the LLM noticed multiple anomalous signals and flagged it anyway. That's the whole case for two layers: the model is excellent on the bulk of cases but has occasional blind spots, and a second opinion reasoning differently sometimes catches what it missed.

THE ENGINEERING ARC

PRODUCTION THINKING

RESULTS

Records processed
284,807
Held-out AUC
0.978
Recall / Precision
84% / 84%
LLM latency
~2s / verdict
Run cost
$0 (free tier)

STACK: XGBoost · Google Gemini · Python · scikit-learn · pandas

MORE MISSIONS

Eight more case files — ML, BI, and supply-chain strategy.

VIEW ALL MISSIONS