Evaluate & Improve›Extra· 40 min read

Classification Depth: Confusion Matrix, F1 & ROC-AUC

Go beyond accuracy: read a confusion matrix in code, balance precision and recall with F1, and rank a model with ROC-AUC.

What you will learn

Read a confusion matrix produced by scikit-learn
Compute and interpret the F1 score
Explain what ROC-AUC measures

Building on accuracy, precision and recall

You already met accuracy, precision and recall. This lesson adds the three tools that decide real classification models in interviews and on the job: the confusion matrix in code, the F1 score, and ROC-AUC.

Quick refresher: precision = of everything flagged positive, how much truly was (trust the alarms); recall = of all the real positives, how many were caught (catch them all).

The confusion matrix in scikit-learn

The confusion matrix lays out exactly where the model was right and wrong. scikit-learn prints it as a grid: rows are the true class, columns are the predicted class, and the diagonal holds the correct answers.

A confusion matrix: rows = truth, columns = prediction

from sklearn.metrics import confusion_matrix, classification_report

# 0 = healthy, 1 = sick
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 0]

print(confusion_matrix(y_true, y_pred))

Note: Output: [[4 1] [2 3]] Reading it: top-left 4 = healthy correctly called healthy (true negatives); top-right 1 = healthy wrongly called sick (a false alarm); bottom-left 2 = sick wrongly called healthy (dangerous misses); bottom-right 3 = sick correctly caught. The diagonal (4 and 3) is the correct count.

The classification report: every score at once

Rather than computing each metric by hand, classification_report prints precision, recall and F1 for every class in one go.

One call gives precision, recall and F1 per class

print(classification_report(y_true, y_pred,
      target_names=['healthy', 'sick']))

Note: Output: precision recall f1-score support healthy 0.67 0.80 0.73 5 sick 0.75 0.60 0.67 5 accuracy 0.70 10 For “sick”: precision 0.75 (3 of the 4 it called sick truly were) and recall 0.60 (it caught 3 of the 5 real sick). The f1-score 0.67 blends those two into one number — explained next.

F1: one number that balances precision and recall

Often precision and recall pull in opposite directions — push one up and the other drops. The F1 score combines them into a single number using a harmonic mean, which is just an average that punishes imbalance: F1 is only high when both precision and recall are high. If either is poor, F1 is poor.

The formula is F1 = 2 × (precision × recall) ÷ (precision + recall). For “sick” above: 2 × (0.75 × 0.60) ÷ (0.75 + 0.60) = 2 × 0.45 ÷ 1.35 = 0.90 ÷ 1.35 ≈ 0.67 — matching the report. Use F1 when you care about precision and recall together, especially on imbalanced data where accuracy lies.

ROC-AUC: how well the model ranks

A classifier does not just say “sick/healthy” — it gives each case a probability of being positive, and a threshold (usually 0.5) turns that into a label. ROC-AUC grades the model across all thresholds at once, so it does not depend on where you set the line.

In plain words, AUC is the chance that the model gives a higher score to a random truly-positive case than to a random truly-negative one. It runs from 0.5 (coin-flip, useless) to 1.0 (perfect ranking).

ROC-AUC scores the model’s ranking, not a single threshold

from sklearn.metrics import roc_auc_score

# the model's predicted PROBABILITY of class 1 (sick) for each case
y_true  = [1, 1, 1, 0, 0, 0]
y_proba = [0.9, 0.8, 0.4, 0.35, 0.2, 0.1]

print('ROC-AUC:', round(roc_auc_score(y_true, y_proba), 3))

Note: Output: ROC-AUC: 0.889 An AUC of 0.889 means that if you picked one real sick case and one real healthy case at random, the model gives the sick one a higher risk score about 89% of the time. Well above 0.5 (random), so it ranks cases well.

Metric	Tells you	Best for
Confusion matrix	Exactly where errors happen	Diagnosing what the model confuses
F1 score	Balance of precision and recall	Imbalanced data; one number to compare
ROC-AUC	Ranking quality across all thresholds	Comparing models when the threshold may change

Watch out: On heavily imbalanced data (say 99% healthy), accuracy is meaningless — always reach for the confusion matrix, F1 and ROC-AUC instead. The Imbalanced Data ideas matter here too.

Tip: Workflow: print the confusion matrix to see what goes wrong, use F1 for a single fair score, and use ROC-AUC when you might move the decision threshold (e.g. flag more cases for review).

Q. Why is the F1 score often preferred over accuracy on imbalanced data?

Answer: F1 is the harmonic mean of precision and recall, so it is only high when both are high. A lazy model that ignores the rare positive class gets a low F1, even if its accuracy looks great.

✍️ Practice

Change two predictions in the confusion-matrix example and re-read which box each change lands in.
For a fraud detector where missing fraud is costly, say whether you would watch recall or precision more closely, and why.

🏠 Homework

Take any 10 true/predicted labels you invent, print the confusion matrix and classification report, and explain the F1 score for the positive class in your own words.