Most fraud systems hand the analyst a number. The model scores a transaction 0.87 and the
analyst has to figure out, from scratch, why it looks suspicious and what to do about it. So I
built a pipeline that splits the job between two systems that are each good at different things, and
gives the analyst a decision instead of a number.
A fast model scores every transaction and flags only the risky handful. A reasoning model reads those few in detail and writes a verdict with its reasoning. Speed where you need volume, judgment where you need a human to trust it.
01 A number is not a decision
A bare score is fine at low volume and brutal at scale. Card networks see millions of transactions a
day, under 1% of them fraud, each one real money. Handing a reviewer 0.87 and asking
them to reconstruct the case from scratch does not scale, and it leaves no audit trail of why a
transaction was blocked.
02 The pipeline at a glance
Two stages. The classical model handles the firehose; the language model handles the few cases that actually reach a human. You only pay the expensive model where it earns its keep.
03 Stage 1: XGBoost scores everything
A gradient-boosted model trained on 284,807 historical card transactions scores every transaction with a fraud probability. It is fast and cheap, so it scales to the full firehose and flags only the highest-risk handful for review. Most of the volume never needs a second look.
04 Stage 2: the LLM reasons about the few
Each flagged transaction goes to a Google Gemini triage agent that reads the raw feature values and writes a structured case file in a fixed JSON shape. It gives the verdict, the confidence, the primary signals, the reasoning, and a recommended action:
{
"verdict": "block",
"confidence": 0.92,
"primary_signals": ["V14", "V10", "V12"],
"reasoning": "Multiple anomalous patterns coincide with an
elevated model score, consistent with card-testing fraud.",
"recommended_action": "Block and trigger SMS verification."
}
The analyst reviews a decision instead of a bare number. What used to take ~5 minutes of investigation per case becomes ~30 seconds of verification. And because the LLM only sees the small slice the model already flagged, you only pay its cost where it earns its keep.
05 The case that justified the whole thing
The LLM, reading the raw features rather than the model's confidence, noticed several anomalous signals and flagged it anyway. That is the entire argument for two layers: the model is excellent on the bulk of cases but has occasional blind spots, and a second opinion that reasons differently sometimes catches what it missed.
06 The engineering arc
It did not start clean. Here is how a coin-flip baseline became a 0.978 model.
- Baseline stalled at 0.60 AUC
A vanilla gradient-boosting baseline was barely better than a coin flip. The cause was not the algorithm, it was the data: a 577-to-1 class imbalance drowned the loss function, and 1,081 duplicate rows pushed it toward bimodal predictions that broke the metric.
- Weight the rare class
Switching to XGBoost with
scale_pos_weight=577told the model to weight fraud examples 577× during training. L2 regularization smoothed the duplicate-row impact. AUC jumped from 0.60 to 0.978, at 84% recall and 84% precision on the rare class. - Force clean JSON
The LLM layer enforces JSON with
response_mime_type="application/json". No regex parsing, no markdown fences to strip. - Keep the prompt tight
Only the top 5 most anomalous features go in the prompt, not all 30. Focused prompts, lower tokens, sharper reasoning.
- Batch and cache
A rate-limited processor handles 100 transactions in ~10 minutes with caching, so a dropped session loses no work. Cost projects cleanly across 1K to 1M daily volumes (about $30/month at 100K/day).
07 Why the shape matters
The interesting part is not the AUC. It is the shape. This is how major payments teams are starting to deploy LLMs internally: as a reasoning layer on top of traditional ML, not as a replacement. The model gives you speed. The LLM gives you explainability and an audit trail. You do not have to pick.