Going DeeperPro· 45 min read

Hypothesis Testing & A/B Tests

Is that difference real or just luck? Hypothesis testing gives a yes/no answer with a number — the backbone of A/B testing.

What you will learn

  • State a null and alternative hypothesis
  • Run a t-test and read its p-value
  • Decide significance and avoid common p-value traps

The everyday question: real, or just chance?

A new web-page button got a 6% click rate; the old one got 5%. Did the new button really work, or did we just get a lucky batch of visitors? Hypothesis testing is the formal way to answer “is this difference real or random noise?” — and it is a daily data-scientist task, especially in A/B testing (showing version A to some users, version B to others, and comparing).

The two hypotheses

Every test starts by writing down two opposing claims. You assume the boring one is true until the data forces you to abandon it — like “innocent until proven guilty”.

  1. Null hypothesis (H0) — the “nothing happened” claim: there is no real difference (the new button is no better).
  2. Alternative hypothesis (H1) — what you suspect: there is a real difference (the new button is better).

The test then asks: “if H0 (no difference) were true, how surprising is the difference we actually saw?” That surprise is measured by a p-value.

The p-value, in plain words

The p-value is the probability of seeing a difference at least as big as yours purely by chance, if there were really no difference. A small p-value means “this would be very unlikely by luck alone” — so the no-difference story looks wrong, and you reject H0. The usual cut-off is 0.05 (5%).

Suppose two groups of students used different study apps and we want to know if their exam scores really differ. A two-sample t-test compares the two averages and returns a p-value.

A two-sample t-test comparing two groups’ averages
from scipy import stats

# Exam scores from two groups (app A vs app B)
group_a = [70, 72, 68, 75, 71, 69, 73]
group_b = [78, 82, 80, 85, 79, 83, 81]

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print('t-statistic:', round(t_stat, 2))
print('p-value    :', round(p_value, 5))

Note: Output: t-statistic: -7.79 p-value : 0.0 The p-value is essentially 0 (far below 0.05). That means: if the two apps truly made no difference, seeing a gap this large would be almost impossible by luck. So we reject the null — group B really did score higher. (Group A averages 71, group B 81.)

Making the decision

The rule is mechanical once you have a p-value and a chosen cut-off (called alpha, usually 0.05).

Turn a p-value into a decision
p_value = 0.0
alpha = 0.05

if p_value < alpha:
    print('Reject the null — the difference is statistically significant.')
else:
    print('Fail to reject the null — the difference could be chance.')

Note: Output: Reject the null — the difference is statistically significant. Because 0.0 < 0.05, we reject the “no difference” story. “Statistically significant” is just shorthand for “unlikely to be luck”. Had the p-value been, say, 0.31, we would have failed to reject — the gap could easily be chance.

A/B testing: the same test, a business decision

An A/B test is this exact tool aimed at a product choice. You split users randomly into group A (old design) and group B (new design), measure an outcome (clicks, sign-ups), and run a test to see if B truly beats A before you ship it to everyone.

PiecePlain meaning
Null (H0)No real difference
Alternative (H1)There is a real difference
p-valueChance of this result if H0 were true
alpha (0.05)Our “too unlikely to be luck” line
p < 0.05Reject H0 — call it significant

Watch out: A small p-value means the difference is unlikely to be luck — it does not mean the difference is large or important. With huge samples even a tiny, useless difference can be “significant”. Always report the actual size of the effect too.

Watch out: “Fail to reject H0” is not the same as “H0 is proven true”. It only means you lacked enough evidence — maybe your sample was too small. Absence of proof is not proof of absence.

Tip: Pick your alpha (usually 0.05) before you see the data, and decide the sample size in advance. Peeking and stopping the test the moment it looks significant — “p-hacking” — manufactures false positives.

Q. A t-test comparing two groups returns a p-value of 0.002, and you chose alpha = 0.05. What do you conclude?

Answer: Because 0.002 < 0.05, a difference this large would be very unlikely if there were truly no difference, so you reject the null and call the result statistically significant.

✍️ Practice

  1. Write a null and alternative hypothesis for the question “does a new homepage get more sign-ups than the old one?”
  2. Run stats.ttest_ind on two small lists of numbers you invent and decide, using alpha = 0.05, whether to reject the null.

🏠 Homework

  1. Find or invent two groups of measurements (e.g. test scores before vs after a change). Run a t-test, report the p-value, state your decision at alpha = 0.05, and write one sentence on what it means in plain language.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →