1. Why Building Fancy Models is Not Enough—They Also Need Adequate Monitoring!

The job of a Data Scientist doesn’t end with building a model and moving it to production. That’s actually just the beginning—the start of events that can cause sleepless nights!

Machine learning has given us many different types of models to choose from: linear models, tree-based models, neural networks, ensembles, and more. I see many ML practitioners trying out multiple model types on a problem without really understanding the basic assumptions and requirements behind each one. But that’s a separate discussion for another time.

For now, let’s focus on what happens after you build your model.

What is a Machine Learning Model?

Simply put, a machine learning model learns patterns from historical data—it figures out the relationships between inputs and outputs. Once trained, the model uses these learned patterns to make predictions about new situations it hasn’t seen before. These predictions then drive real-world decisions and actions.

A Practical Example: Fraud Detection

Let me illustrate this with a practical case. Consider fraud detection in UPI transactions. We build a classification model (using supervised learning) that predicts whether an incoming UPI transaction is fraudulent or legitimate.

The model learns from historical data containing both fraudulent and legitimate transactions, identifying complex patterns that indicate risk. While classification models can output simple Yes/No predictions, we typically prefer probability scores (called Model Scores) because they give end users more flexibility in decision-making.

The Goal: Keep Model Scores Accurate

Every data scientist’s goal is to maintain model scores as accurate as possible—close to reality—to drive real business value.

Why Model Scores becomes un-reliable after some time?

As mentioned earlier, models rely on complex relationships between input features and outcomes. Model scores can become unreliable when:

  1. Input features change – For example, a specific age group suddenly starts doing more transactions
  2. The outcome rate changes – For example, better compliance policies reduce overall fraud activity
  3. The relationship itself changes – For example, new fraud patterns emerge that didn’t exist during training

This typically happens when the training data becomes outdated compared to current data. As a result, the model’s prediction distribution shifts significantly from what it used to be. This phenomenon is called Model Prediction Drift.

This should not be confused with Model Performance Drift which I will cover separately sometime.

How to check if your model has drifted?

The simplest way is to compare model score distributions for a feature at different points in time—looking at what percentage of the population falls into each score range.

Reference Time (Rt):- The period when the model’s score distribution was considered stable and usable

Current Time (Ct):- The period we’re evaluating for drift

Score BucketReference Time (Rt)Current Time (Ct)
0-0.1022%15%
0.10-0.2018%14%
0.20-0.3012%16%
0.30-0.4011%16%
0.40-0.509%15%
0.50-0.608%7%
0.60-0.707%6%
0.70-0.807%5%
0.80-0.905%4%
0.90-1.001%2%

Looking at this distribution, we can see the model is now placing more of the population in the middle score buckets (indicating more fraud) compared to before.

You can apply this same comparison across major business segments to monitor for drift. Typically, this monitoring is automated with alerts triggered whenever drift exceeds acceptable levels.

Population Stability Index (PSI)

One popular and easy-to-use measure for detecting model drift for Univariate is the Population Stability Index (PSI).

At a high level, PSI measures how much the proportion of records in each score bucket has changed between the reference and current periods, using a logarithmic scale. The total PSI is calculated by summing across all score buckets:

PSI = Σ [ (Actual% – Expected%) * ln(Actual%/Expected%) ]

If

  • PSI < 0.1, small/minor drift
  • PSI between 01.-0.25 , moderate drift (Monitor/Adjust)
  • PSI > 0.25 , significant drift (Need model retrain)

Detailed working of PSI {covered here}

While other drift measures exist (like KS and JS divergence), PSI is the most commonly used because it’s intuitive to explain and effective at catching drift.

For Multivariate drift detection or more sophisticated drift detection, AutoEncoder can be used. {covered here}

Final Notes:

Building accurate prediction models is important, but it’s only half the job. Adequate model governance is equally critical to catch drift early.

In production systems, even minor drift in model predictions can cause significant business impact—revenue decline, customer dissatisfaction, or worse. That’s why monitoring isn’t optional—it’s essential.

If you are interested in 1-1 coaching/mentoring sessions, please check out here. (https://topmate.io/ashok_suthar_iitdhn/)

(Thoughts are personal but polished by GenAI companions)