3. Cross-Validation – Estimating Uncertainty, Not Just Performance

In the last two posts we built a clean, uncontaminated signal. Now the question is: how much should you trust what that signal tells you?

If your test set happened to contain easier transactions – quieter fraud patterns, cleaner signals (biased/non-representative in ML terms) – your model looks better than it is. If it caught a hard month, it looks worse. One split, one number, one gamble.

Cross-validation exists because a single estimate is not enough. You need to know how stable that estimate is.

What it actually does

Instead of one train/test split, you make several. Each time, a different portion of the data becomes the test set. The model trains on the rest. You collect multiple performance estimates and look at them together — not just the average, but the spread.

That spread is the real output. It tells you how sensitive your model’s performance is to which data it saw. A tight spread means stable generalization. A wide spread means your model is fragile – highly dependent on the specific sample it trained on.

Daniel Kahneman in his book Thinking Fast and Slow calls this the difference between confidence and accuracy. We are wired to trust a single confident number more than a range of uncertain ones. Cross-validation forces the range into view whether you want it or not.

What breaks without it

You make model selection decisions – this architecture over that one, these features over those – based on a single measurement that might not hold.

In payments, this matters a lot. Fraud patterns shift across months, regions, transaction types. A model that looks strong on one test window may be quietly worse everywhere else. Cross-validation surfaces that worsening before production does.

The tradeoff

Compute. You are training the model multiple times. In prototype work, this is negligible. On large datasets with expensive models, it becomes a real infrastructure decision.

This is why cross-validation strategy (how many folds, what kind, time-based or random) is an architectural choice, not a default setting.

The mental model to carry forward

Cross-validation is not a better way to measure performance.

It is a way to measure uncertainty in your performance estimate – which is a different and more honest question.

A model with 85% accuracy and low variance across folds is a more trustworthy system than one with 88% accuracy from a single lucky split. Always choose the first one.

But stability of an estimate only matters if you’re estimating the right thing. Cross-validation tells you how consistently your model performs. It says nothing about whether the metric you’re measuring actually reflects the decision you’re trying to make. That’s the problem next post is all about.