T-Test
A statistical hypothesis test that compares the means of two groups to determine if they are significantly different, using the t-distribution to account for uncertainty in small to moderate sample sizes.
What Is a T-Test?
The t-test compares two group means and tells you whether their difference is larger than you would expect from noise alone. It is like a z-test but uses the t-distribution, which has fatter tails to account for the fact that you are estimating the standard deviation from the sample rather than knowing it exactly.
Also Known As
- Data science teams: Student's t, Welch's t-test, independent-samples t
- Growth teams: significance test for revenue lifts
- Marketing teams: "the test we use for time on page"
- Engineering teams: t-statistic, two-sample t
How It Works
Imagine running an A/B test with 10,000 visitors per variant measuring average session duration. Variant A has a mean of 120 seconds with SD 90; Variant B has a mean of 125 seconds with SD 92. The standard error of the difference is about 1.28 seconds, giving a t-statistic of roughly 3.9. With ~20,000 degrees of freedom, this is far beyond the 1.96 threshold; p < 0.001. The effect is real statistically, but 5 seconds may or may not be practically meaningful — that is a product decision.
Best Practices
- Do use Welch's t-test by default; it does not assume equal variances.
- Do log-transform or winsorize heavy-tailed metrics before running a t-test.
- Do pair t-tests with confidence intervals on the difference, not just p-values.
- Do not use t-tests on binary conversion data where chi-squared is more appropriate.
- Do not rely on t-tests when n < 30 per group for skewed metrics; use bootstrap or Mann-Whitney.
Common Mistakes
- Running t-tests on revenue data without handling outliers, producing unstable results.
- Using paired t-tests when observations are independent, inflating false positives.
- Confusing statistical significance with practical significance on large samples.
Industry Context
- SaaS/B2B: Engagement metrics (sessions, feature use) fit t-tests well after transformation.
- Ecommerce/DTC: Average order value is the classic t-test target, but needs outlier handling.
- Lead gen/services: Time-to-respond metrics are usually right-skewed and need log transforms.
The Behavioral Science Connection
The t-distribution's fatter tails encode epistemic humility — acknowledging that small samples leave us uncertain about the true spread. This mirrors the "illusion of validity" Kahneman describes, where practitioners mistake pattern detection in small samples for real signal. The t-test's penalty for small n formalizes the wisdom of skepticism.
Key Takeaway
Use the t-test for comparing means; use Welch's variant as the safe default, and always pair it with a confidence interval.