Two-Stage Fraud Detection: XGBoost for Speed, an LLM for Reasoning

Most fraud systems hand the analyst a number. The model scores a transaction 0.87 and the analyst has to figure out, from scratch, why it looks suspicious and what to do about it. So I built a pipeline that splits the job between two systems that are each good at different things, and gives the analyst a decision instead of a number.

// The whole pipeline in one sentence

A fast model scores every transaction and flags only the risky handful. A reasoning model reads those few in detail and writes a verdict with its reasoning. Speed where you need volume, judgment where you need a human to trust it.

01 A number is not a decision

A bare score is fine at low volume and brutal at scale. Card networks see millions of transactions a day, under 1% of them fraud, each one real money. Handing a reviewer 0.87 and asking them to reconstruct the case from scratch does not scale, and it leaves no audit trail of why a transaction was blocked.

02 The pipeline at a glance

Two stages. The classical model handles the firehose; the language model handles the few cases that actually reach a human. You only pay the expensive model where it earns its keep.

Every transaction goes through Stage 1. Only the top-risk few reach Stage 2. Teal = the reasoning path.

03 Stage 1: XGBoost scores everything

A gradient-boosted model trained on 284,807 historical card transactions scores every transaction with a fraud probability. It is fast and cheap, so it scales to the full firehose and flags only the highest-risk handful for review. Most of the volume never needs a second look.

04 Stage 2: the LLM reasons about the few

Each flagged transaction goes to a Google Gemini triage agent that reads the raw feature values and writes a structured case file in a fixed JSON shape. It gives the verdict, the confidence, the primary signals, the reasoning, and a recommended action:

{
  "verdict": "block",
  "confidence": 0.92,
  "primary_signals": ["V14", "V10", "V12"],
  "reasoning": "Multiple anomalous patterns coincide with an
                elevated model score, consistent with card-testing fraud.",
  "recommended_action": "Block and trigger SMS verification."
}

The analyst reviews a decision instead of a bare number. What used to take ~5 minutes of investigation per case becomes ~30 seconds of verification. And because the LLM only sees the small slice the model already flagged, you only pay its cost where it earns its keep.

05 The case that justified the whole thing

Score: 0.000000. During testing, one transaction scored a perfect zero. The model was certain it was legitimate. It was actually fraud, and the model would have let it straight through.

The LLM, reading the raw features rather than the model's confidence, noticed several anomalous signals and flagged it anyway. That is the entire argument for two layers: the model is excellent on the bulk of cases but has occasional blind spots, and a second opinion that reasons differently sometimes catches what it missed.

06 The engineering arc

It did not start clean. Here is how a coin-flip baseline became a 0.978 model.

Baseline stalled at 0.60 AUC
A vanilla gradient-boosting baseline was barely better than a coin flip. The cause was not the algorithm, it was the data: a 577-to-1 class imbalance drowned the loss function, and 1,081 duplicate rows pushed it toward bimodal predictions that broke the metric.
Weight the rare class
Switching to XGBoost with scale_pos_weight=577 told the model to weight fraud examples 577× during training. L2 regularization smoothed the duplicate-row impact. AUC jumped from 0.60 to 0.978, at 84% recall and 84% precision on the rare class.
Force clean JSON
The LLM layer enforces JSON with response_mime_type="application/json". No regex parsing, no markdown fences to strip.
Keep the prompt tight
Only the top 5 most anomalous features go in the prompt, not all 30. Focused prompts, lower tokens, sharper reasoning.
Batch and cache
A rate-limited processor handles 100 transactions in ~10 minutes with caching, so a dropped session loses no work. Cost projects cleanly across 1K to 1M daily volumes (about $30/month at 100K/day).

XGBoost Google Gemini scale_pos_weight=577 JSON mode L2 regularization Batch + caching

07 Why the shape matters

The interesting part is not the AUC. It is the shape. This is how major payments teams are starting to deploy LLMs internally: as a reasoning layer on top of traditional ML, not as a replacement. The model gives you speed. The LLM gives you explainability and an audit trail. You do not have to pick.

→ See the full project case file

↑ back to top