FRAUD TRIAGE
A two-stage pipeline that catches what the model misses — XGBoost for speed, Google Gemini for the reasoning.
BRIEFING
Card networks process millions of transactions a day. Under 1% are fraudulent, but each one is real money lost. The job is two-sided: catch fraud without drowning analysts in false flags, and give them enough context to make a fast, defensible call on the ones that do get flagged. Most systems stop at a score — a 0.87 doesn't tell an analyst why a transaction looks wrong or what to do about it.
This project pairs two complementary AI systems so each does what it's best at: XGBoost rapidly scores every transaction, then Google Gemini takes the small fraction flagged as suspicious and produces a structured, analyst-actionable verdict with reasoning — the speed of classical ML with the explainability of generative AI.
ROLE / METHOD / OUTCOME
HOW IT WORKS
STAGE 1 — XGBOOST
Trained on 284,807 historical card transactions, it scores every transaction with a fraud probability. Fast and cheap, it scales to millions of transactions a day and flags only the highest-risk ones for review.
STAGE 2 — GOOGLE GEMINI
A reasoning layer that examines each flagged transaction's raw features and writes a structured verdict — what's suspicious, why, and the recommended action — only paying the LLM cost on the few cases that need deep analysis.
In practice: overnight the model scores a million transactions and flags the top 100. Each flag is sent through the Gemini triage agent, which returns a standardized JSON case file — verdict, confidence, primary signals, reasoning, and a recommended action. The analyst reviews a decision instead of a bare number. What used to take ~5 minutes of investigation per case becomes ~30 seconds of verification.
THE FINDING WORTH TELLING
During testing, a transaction scored 0.000000 — the ML model was certain it was legitimate. It was actually fraud. The model would have let it through. Reading the raw feature values rather than the model's confidence, the LLM noticed multiple anomalous signals and flagged it anyway. That's the whole case for two layers: the model is excellent on the bulk of cases but has occasional blind spots, and a second opinion reasoning differently sometimes catches what it missed.
THE ENGINEERING ARC
- Started with vanilla Gradient Boosting — the textbook baseline. It hit a wall at 0.60 AUC, barely better than random.
- Diagnosed the failure: a 577:1 class imbalance was overwhelming the loss function, and 1,081 duplicate rows produced bimodal predictions that broke the metric.
- Switched to XGBoost with weighted loss —
scale_pos_weight=577weights fraud examples 577× during training; L2 regularization smoothed the duplicate-row impact. AUC jumped from 0.60 to 0.978, at 84% recall and 84% precision on the rare class. - Added the Gemini triage layer — a function that formats the XGBoost score plus the top features into a structured prompt and returns a strict-JSON verdict.
- Scaled to batch — a rate-limited processor that runs 100 transactions through the LLM in ~10 minutes with timing, token tracking, error handling, and caching.
PRODUCTION THINKING
- Structured output enforcement — Gemini set to
response_mime_type="application/json", guaranteeing valid JSON: no regex parsing, no markdown fences, no malformed responses. - Signal compression — only the top 5 most anomalous features (not all 30) go to the LLM, keeping prompts focused and token cost low.
- Rate limiting & caching — 5s between calls to stay under the free-tier limit, with results saved to disk after every batch so a dropped session loses no work.
- Cost projections — tokens-per-call tracked and projected across 1K–1M daily volumes (~$30/month at 100K/day): the analysis that actually decides whether to deploy.
RESULTS
STACK: XGBoost · Google Gemini · Python · scikit-learn · pandas