Imbalanced Data: When One Class Is Rare
When 99% of rows are one class, plain accuracy lies — here is how to train and judge fairly.
What you will learn
- See why accuracy fails on imbalanced data
- Fix the balance with class_weight or resampling
- Judge the model with recall, not accuracy
The trap of a rare class
Many of the most important problems are imbalanced: out of 10,000 card payments maybe 50 are fraud; out of 10,000 patients maybe 100 have a rare disease. The class you actually care about — fraud, disease, churn — is the rare one. That changes everything.
Suppose 99% of payments are genuine and 1% are fraud. A lazy model that simply predicts “genuine” every single time scores 99% accuracy — and catches zero fraud. High accuracy, completely useless. This is the single biggest reason accuracy alone is a bad metric.
A worked example: the lazy 99% model
Let us make a fake dataset with 990 genuine and 10 fraud rows, then check what “always predict genuine” scores.
# 990 genuine (0) and 10 fraud (1)
y_true = [0] * 990 + [1] * 10
y_lazy = [0] * 1000 # predict "genuine" for everyone
from sklearn.metrics import accuracy_score, recall_score
print('Accuracy:', round(accuracy_score(y_true, y_lazy), 3))
print('Recall on fraud:', recall_score(y_true, y_lazy))Note: Output: Accuracy: 0.99 Recall on fraud: 0.0 The lazy model looks brilliant on accuracy (0.99) yet its recall on fraud is 0 — it caught none of the 10 fraud cases. On imbalanced data, recall on the rare class tells the real story; accuracy hides the failure.
Fix 1 — tell the model the rare class matters
The easiest fix needs no new data: set class_weight='balanced'. This tells the model that mistakes on the rare class hurt more, so it stops ignoring it. Most scikit-learn classifiers support it.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
# build an imbalanced dataset: ~5% positive
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05],
random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=0)
plain = LogisticRegression(max_iter=500).fit(Xtr, ytr)
weighted = LogisticRegression(max_iter=500,
class_weight='balanced').fit(Xtr, ytr)
print('Recall, plain :', round(recall_score(yte, plain.predict(Xte)), 2))
print('Recall, balanced:', round(recall_score(yte, weighted.predict(Xte)), 2))Note: Output:
Recall, plain : 0.45
Recall, balanced: 0.83
The plain model caught under half the rare cases. Just adding class_weight='balanced' nearly doubled recall — it now catches most of them. (Catching more rare cases usually costs a few false alarms, a trade you tune to the problem.)
Fix 2 — rebalance the data itself
Instead of reweighting, you can change the mix of rows:
| Technique | What it does | Watch out for |
|---|---|---|
| Oversampling | Duplicate / synthesise more of the rare class | Can overfit to the copies |
| Undersampling | Drop some of the common class | Throws away real data |
| SMOTE | Create new synthetic rare examples between real ones | Needs the imbalanced-learn library |
SMOTE (Synthetic Minority Over-sampling Technique) is the best-known method: rather than copying rare rows, it invents plausible new ones sitting between existing rare points, so the model sees more variety. It lives in the separate imbalanced-learn package (pip install imbalanced-learn), used like this:
# requires: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from collections import Counter
print('Before SMOTE:', Counter(ytr))
Xbal, ybal = SMOTE(random_state=0).fit_resample(Xtr, ytr)
print('After SMOTE :', Counter(ybal))Note: Output: Before SMOTE: Counter({0: 1330, 1: 70}) After SMOTE : Counter({0: 1330, 1: 1330}) SMOTE manufactured new synthetic positives until both classes matched (1330 each). The model now trains on a balanced set and learns the rare class properly.
Watch out: Resample the training set only, never the test set. The test set must keep the real-world imbalance, or your score will not reflect reality. Inside a pipeline, use imblearn’s pipeline so resampling happens per fold and only on training rows.
Tip: On any imbalanced problem, stop reporting accuracy. Report recall (did we catch the rare cases?), precision (when we flagged it, were we right?), the F1-score and the confusion matrix — the metrics from the Classification Depth lesson.
Q. On a dataset that is 99% one class, why is accuracy a misleading metric?
✍️ Practice
- On the imbalanced
make_classificationdata, print the confusion matrix for the plain vs balanced models and compare the bottom-right cell (caught fraud). - Explain in one sentence why you must apply SMOTE to the training set but never the test set.
🏠 Homework
- Name a real imbalanced problem from your own life or work, say which class is rare, and write which metric (not accuracy) you would report and one technique you would use to handle the imbalance.