How do I test A/B significance?

Use a two‑proportion z‑test to compare conversion rates and compute a p‑value.

What alpha should I use?

Commonly 5% (0.05). Lower alpha reduces false positives but requires larger samples.

The p‑value is the probability of observing a result as extreme as your data if there were no true effect.

How long should I run an A/B test?

Run until you reach the pre‑computed sample size and cover typical seasonality; avoid peeking at results early.

What is the difference between statistical and practical significance?

Statistical significance reduces the chance the result is due to noise; practical significance considers magnitude, cost, and business impact.

A/B Significance

Test statistical significance between two conversion rates using a two‑proportion z‑test. Enter visitors and conversions for A and B, along with the alpha threshold.

What is A/B significance?

Significance tests estimate the likelihood observed differences are due to chance. A two‑proportion z‑test compares conversion rates between variants.

How to use this calculator

Provide visitors and conversions for A and B, and choose an alpha (commonly 5%). The tool returns conversion rates, lifts, p‑value, and significance.

Why it matters

Significance guards against false positives from random noise, helping you ship changes that truly improve outcomes.

Assumptions & limitations

Assumes independent samples and sufficient size. For rare events or small samples, consider exact tests or Bayesian approaches.

Formula and interpretation

Let CR_A = conversions_A ÷ visitors_A, CR_B = conversions_B ÷ visitors_B. The pooled rate is p = (conversions_A + conversions_B) ÷ (visitors_A + visitors_B). The standard error is SE = √( p(1 − p) (1/visitors_A + 1/visitors_B) ). The test statistic is z = (CR_B − CR_A) ÷ SE. The two‑tailed p‑value evaluates the probability of observing a difference as extreme under the null (no true difference).

Sample size, power, and MDE

Statistical power is the probability of detecting a true effect. Larger samples increase power and reduce the minimum detectable effect (MDE). Define a practical lift you care about, pick α (e.g., 5%) and desired power (e.g., 80%), then estimate the required sample size. Avoid underpowered tests that frequently yield inconclusive results.

One‑tailed vs two‑tailed tests

Two‑tailed tests detect differences in either direction and are appropriate when you simply check for change. One‑tailed tests can be used when you only care about improvements, but they must be pre‑registered and not changed mid‑test to avoid bias.

Peeking and multiple comparisons

Repeatedly checking significance early (“peeking”) inflates false positives. Use fixed horizons or sequential methods. When testing many variants or metrics, apply corrections (e.g., Bonferroni, Benjamini‑Hochberg) or control the false discovery rate.

Step‑by‑step example

With A = 500 conversions / 20,000 visitors (2.50%) and B = 600 / 20,000 (3.00%), absolute lift is +0.50 percentage points, relative lift is +20%. The z‑test computes a p‑value; if p < 0.05, we call the result statistically significant.

Best practices

Define success metrics and stopping rules before launching tests.
Segment by traffic source and device to check consistency.
Run long enough to cover typical weekly seasonality.
Focus on practical significance (impact, cost) alongside p‑values.
Validate winners with follow‑up experiments or rollout monitoring.