Hypothesis Testing & A/B Tests
Is that difference real or just luck? Hypothesis testing gives a yes/no answer with a number — the backbone of A/B testing.
What you will learn
- State a null and alternative hypothesis
- Run a t-test and read its p-value
- Decide significance and avoid common p-value traps
The everyday question: real, or just chance?
A new web-page button got a 6% click rate; the old one got 5%. Did the new button really work, or did we just get a lucky batch of visitors? Hypothesis testing is the formal way to answer “is this difference real or random noise?” — and it is a daily data-scientist task, especially in A/B testing (showing version A to some users, version B to others, and comparing).
The two hypotheses
Every test starts by writing down two opposing claims. You assume the boring one is true until the data forces you to abandon it — like “innocent until proven guilty”.
- Null hypothesis (H0) — the “nothing happened” claim: there is no real difference (the new button is no better).
- Alternative hypothesis (H1) — what you suspect: there is a real difference (the new button is better).
The test then asks: “if H0 (no difference) were true, how surprising is the difference we actually saw?” That surprise is measured by a p-value.
The p-value, in plain words
The p-value is the probability of seeing a difference at least as big as yours purely by chance, if there were really no difference. A small p-value means “this would be very unlikely by luck alone” — so the no-difference story looks wrong, and you reject H0. The usual cut-off is 0.05 (5%).
Suppose two groups of students used different study apps and we want to know if their exam scores really differ. A two-sample t-test compares the two averages and returns a p-value.
from scipy import stats
# Exam scores from two groups (app A vs app B)
group_a = [70, 72, 68, 75, 71, 69, 73]
group_b = [78, 82, 80, 85, 79, 83, 81]
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print('t-statistic:', round(t_stat, 2))
print('p-value :', round(p_value, 5))Note: Output:
t-statistic: -7.79
p-value : 0.0
The p-value is essentially 0 (far below 0.05). That means: if the two apps truly made no difference, seeing a gap this large would be almost impossible by luck. So we reject the null — group B really did score higher. (Group A averages 71, group B 81.)
Making the decision
The rule is mechanical once you have a p-value and a chosen cut-off (called alpha, usually 0.05).
p_value = 0.0
alpha = 0.05
if p_value < alpha:
print('Reject the null — the difference is statistically significant.')
else:
print('Fail to reject the null — the difference could be chance.')Note: Output: Reject the null — the difference is statistically significant. Because 0.0 < 0.05, we reject the “no difference” story. “Statistically significant” is just shorthand for “unlikely to be luck”. Had the p-value been, say, 0.31, we would have failed to reject — the gap could easily be chance.
A/B testing: the same test, a business decision
An A/B test is this exact tool aimed at a product choice. You split users randomly into group A (old design) and group B (new design), measure an outcome (clicks, sign-ups), and run a test to see if B truly beats A before you ship it to everyone.
| Piece | Plain meaning |
|---|---|
| Null (H0) | No real difference |
| Alternative (H1) | There is a real difference |
| p-value | Chance of this result if H0 were true |
| alpha (0.05) | Our “too unlikely to be luck” line |
| p < 0.05 | Reject H0 — call it significant |
Watch out: A small p-value means the difference is unlikely to be luck — it does not mean the difference is large or important. With huge samples even a tiny, useless difference can be “significant”. Always report the actual size of the effect too.
Watch out: “Fail to reject H0” is not the same as “H0 is proven true”. It only means you lacked enough evidence — maybe your sample was too small. Absence of proof is not proof of absence.
Tip: Pick your alpha (usually 0.05) before you see the data, and decide the sample size in advance. Peeking and stopping the test the moment it looks significant — “p-hacking” — manufactures false positives.
Q. A t-test comparing two groups returns a p-value of 0.002, and you chose alpha = 0.05. What do you conclude?
✍️ Practice
- Write a null and alternative hypothesis for the question “does a new homepage get more sign-ups than the old one?”
- Run
stats.ttest_indon two small lists of numbers you invent and decide, using alpha = 0.05, whether to reject the null.
🏠 Homework
- Find or invent two groups of measurements (e.g. test scores before vs after a change). Run a t-test, report the p-value, state your decision at alpha = 0.05, and write one sentence on what it means in plain language.