Going Deeper›Pro· 40 min read

Imbalanced Data: When One Class Is Rare

When 99% of rows are one class, plain accuracy lies — here is how to train and judge fairly.

What you will learn

See why accuracy fails on imbalanced data
Fix the balance with class_weight or resampling
Judge the model with recall, not accuracy

The trap of a rare class

Many of the most important problems are imbalanced: out of 10,000 card payments maybe 50 are fraud; out of 10,000 patients maybe 100 have a rare disease. The class you actually care about — fraud, disease, churn — is the rare one. That changes everything.

Suppose 99% of payments are genuine and 1% are fraud. A lazy model that simply predicts “genuine” every single time scores 99% accuracy — and catches zero fraud. High accuracy, completely useless. This is the single biggest reason accuracy alone is a bad metric.

A worked example: the lazy 99% model

Let us make a fake dataset with 990 genuine and 10 fraud rows, then check what “always predict genuine” scores.

A do-nothing model gets 99% accuracy but catches no fraud

# 990 genuine (0) and 10 fraud (1)
y_true = [0] * 990 + [1] * 10
y_lazy = [0] * 1000          # predict "genuine" for everyone

from sklearn.metrics import accuracy_score, recall_score
print('Accuracy:', round(accuracy_score(y_true, y_lazy), 3))
print('Recall on fraud:', recall_score(y_true, y_lazy))

Note: Output: Accuracy: 0.99 Recall on fraud: 0.0 The lazy model looks brilliant on accuracy (0.99) yet its recall on fraud is 0 — it caught none of the 10 fraud cases. On imbalanced data, recall on the rare class tells the real story; accuracy hides the failure.

Fix 1 — tell the model the rare class matters

The easiest fix needs no new data: set class_weight='balanced'. This tells the model that mistakes on the rare class hurt more, so it stops ignoring it. Most scikit-learn classifiers support it.

class_weight=balanced makes the model take the rare class seriously

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score

# build an imbalanced dataset: ~5% positive
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05],
                           random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=0)

plain    = LogisticRegression(max_iter=500).fit(Xtr, ytr)
weighted = LogisticRegression(max_iter=500,
                              class_weight='balanced').fit(Xtr, ytr)

print('Recall, plain   :', round(recall_score(yte, plain.predict(Xte)), 2))
print('Recall, balanced:', round(recall_score(yte, weighted.predict(Xte)), 2))

Note: Output: Recall, plain : 0.45 Recall, balanced: 0.83 The plain model caught under half the rare cases. Just adding class_weight='balanced' nearly doubled recall — it now catches most of them. (Catching more rare cases usually costs a few false alarms, a trade you tune to the problem.)

Fix 2 — rebalance the data itself

Instead of reweighting, you can change the mix of rows:

Technique	What it does	Watch out for
Oversampling	Duplicate / synthesise more of the rare class	Can overfit to the copies
Undersampling	Drop some of the common class	Throws away real data
SMOTE	Create new synthetic rare examples between real ones	Needs the `imbalanced-learn` library

SMOTE (Synthetic Minority Over-sampling Technique) is the best-known method: rather than copying rare rows, it invents plausible new ones sitting between existing rare points, so the model sees more variety. It lives in the separate imbalanced-learn package (pip install imbalanced-learn), used like this:

SMOTE invents new rare-class rows to balance the training set

# requires: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from collections import Counter

print('Before SMOTE:', Counter(ytr))
Xbal, ybal = SMOTE(random_state=0).fit_resample(Xtr, ytr)
print('After SMOTE :', Counter(ybal))

Note: Output: Before SMOTE: Counter({0: 1330, 1: 70}) After SMOTE : Counter({0: 1330, 1: 1330}) SMOTE manufactured new synthetic positives until both classes matched (1330 each). The model now trains on a balanced set and learns the rare class properly.

Watch out: Resample the training set only, never the test set. The test set must keep the real-world imbalance, or your score will not reflect reality. Inside a pipeline, use imblearn’s pipeline so resampling happens per fold and only on training rows.

Tip: On any imbalanced problem, stop reporting accuracy. Report recall (did we catch the rare cases?), precision (when we flagged it, were we right?), the F1-score and the confusion matrix — the metrics from the Classification Depth lesson.

Q. On a dataset that is 99% one class, why is accuracy a misleading metric?

Answer: When one class dominates, always guessing it gives very high accuracy but zero ability to find the rare class you actually care about. Use recall, precision, F1 and the confusion matrix instead.

✍️ Practice

On the imbalanced make_classification data, print the confusion matrix for the plain vs balanced models and compare the bottom-right cell (caught fraud).
Explain in one sentence why you must apply SMOTE to the training set but never the test set.

🏠 Homework

Name a real imbalanced problem from your own life or work, say which class is rare, and write which metric (not accuracy) you would report and one technique you would use to handle the imbalance.