Day 8: Comparing Two Groups — The t-Test
The Problem
Dr. Sofia Reyes is a cancer biologist studying BRCA1 expression in breast tissue. She has RNA-seq data from 12 tumor samples and 12 matched normal samples from the same patients. The mean BRCA1 expression in tumors is 4.2 log2-CPM versus 6.8 log2-CPM in normals — a 2.6-fold reduction. But with only 12 samples per group and considerable biological variability, can she confidently claim BRCA1 is downregulated in tumors?
She cannot use a z-test because the population standard deviation is unknown — she must estimate it from the data itself. She needs the t-test, the most widely used statistical test in biomedical research. But which version? Her samples are paired (tumor and normal from the same patient), which adds another consideration. And before running any test, she should verify that the data meet the test’s assumptions.
This chapter covers the t-test in all its forms: independent, Welch’s, paired, and one-sample. You will learn when each is appropriate, how to check assumptions, and how to quantify the magnitude of differences with Cohen’s d.
What Is the t-Test?
The t-test asks: “Is the difference between two group means larger than what we would expect from random sampling variation alone?”
Think of it this way: you have two piles of measurements. The t-test weighs how far apart the piles’ centers are, relative to how spread out each pile is. If the piles are far apart and tight, the difference is convincing. If they overlap substantially, it is not.
The t-statistic = (difference in means) / (standard error of the difference)
A larger t-statistic means more evidence of a real difference.
The Four Flavors of t-Test
| Test | When to Use | Formula |
|---|---|---|
| One-sample | Compare sample mean to a known value | t = (x-bar - mu0) / (s / sqrt(n)) |
| Independent two-sample | Compare means of two unrelated groups | t = (x-bar1 - x-bar2) / (s_p x sqrt(1/n1 + 1/n2)) |
| Welch’s | Two unrelated groups, unequal variances | t = (x-bar1 - x-bar2) / sqrt(s1^2/n1 + s2^2/n2) |
| Paired | Matched or before/after measurements | t = d-bar / (s_d / sqrt(n)) |
Independent Two-Sample t-Test
Assumptions
- Independence: Observations within and between groups are independent
- Normality: Data in each group are approximately normally distributed
- Equal variances: Both groups have similar spread (homoscedasticity)
The Pooled Standard Error
When variances are assumed equal, we pool them for a better estimate:
s_p = sqrt(((n1-1)s1^2 + (n2-1)s2^2) / (n1 + n2 - 2))
Degrees of freedom: df = n1 + n2 - 2
Welch’s t-Test: The Safer Default
Welch’s t-test does not assume equal variances. It uses each group’s own variance estimate and adjusts the degrees of freedom downward with the Welch-Satterthwaite equation.
Key insight: Welch’s t-test is almost always the better default choice. It performs nearly as well as the pooled t-test when variances ARE equal, and much better when they are not. Most modern statistical software (including R’s
t.test()) uses Welch’s version by default.
Paired t-Test: Matched Samples
When observations are naturally paired — tumor/normal from the same patient, before/after treatment on the same subject — the paired t-test is far more powerful because it controls for inter-subject variability.
The trick: compute the difference for each pair, then perform a one-sample t-test on the differences:
t = d-bar / (s_d / sqrt(n))
Where d-bar is the mean of the paired differences and s_d is their standard deviation.
| Design | Pairing | Correct Test |
|---|---|---|
| Tumor vs normal from same patient | Paired | Paired t-test |
| Drug vs placebo in different patients | Independent | Welch’s t-test |
| Before vs after treatment, same patients | Paired | Paired t-test |
| Wild-type vs knockout mice | Independent | Welch’s t-test |
| Left eye vs right eye of same individuals | Paired | Paired t-test |
Common pitfall: Using an independent t-test on paired data wastes statistical power. If you have natural pairs, always use the paired test. Conversely, using a paired test on unpaired data gives wrong results.
Checking Assumptions
Normality: Shapiro-Wilk Test
The Shapiro-Wilk test checks whether data could have come from a normal distribution.
- H0: Data are normally distributed
- If p > 0.05, normality assumption is reasonable
- If p < 0.05, data are significantly non-normal
Also use QQ plots: if points fall along the diagonal line, data are approximately normal.
Equal Variances: Levene’s Test
Levene’s test checks whether two groups have equal variances.
- H0: Variances are equal
- If p > 0.05, equal variance assumption is reasonable
- If p < 0.05, use Welch’s t-test (or just always use Welch’s)
Cohen’s d: Quantifying Effect Size
A p-value tells you whether a difference exists. Cohen’s d tells you how large it is, in standard deviation units:
d = (x-bar1 - x-bar2) / s_pooled
| Cohen’s d | Interpretation | Biological Example |
|---|---|---|
| 0.2 | Small | Subtle expression change |
| 0.5 | Medium | Moderate drug effect |
| 0.8 | Large | Strong phenotypic difference |
| > 1.2 | Very large | Knockout vs wild-type |
Key insight: A large p-value with a large Cohen’s d suggests you are underpowered — you may have a real effect but too few samples to detect it. A small p-value with a tiny Cohen’s d suggests the effect, while real, may not be biologically meaningful.
The t-Test in BioLang
Independent Two-Sample t-Test: Gene Expression
# BRCA1 expression (log2-CPM) in tumor vs normal breast tissue
let tumor = [3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5]
let normal = [6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1]
# Default: Welch's t-test (unequal variances)
let result = ttest(tumor, normal)
print("=== Welch's t-test: BRCA1 Tumor vs Normal ===")
print("t-statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.2e}")
print("Degrees of freedom: {result.df:.1}")
print("Mean tumor: {mean(tumor):.2}, Mean normal: {mean(normal):.2}")
print("Difference: {mean(tumor) - mean(normal):.2} log2-CPM")
# Effect size (Cohen's d inline)
let d = (mean(tumor) - mean(normal)) / sqrt((variance(tumor) + variance(normal)) / 2.0)
print("Cohen's d: {d:.3}")
# Visualize
let bp_table = table({"Tumor": tumor, "Normal": normal})
boxplot(bp_table, {title: "BRCA1 Expression: Tumor vs Normal"})
Checking Assumptions
let tumor = [3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5]
let normal = [6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1]
# 1. Normality check: use QQ plots for visual assessment
# (no built-in Shapiro-Wilk; use QQ plots + summary stats)
let s_tumor = summary(tumor)
let s_normal = summary(normal)
print("Tumor summary: {s_tumor}")
print("Normal summary: {s_normal}")
# 2. Equal variance check: compare variances from summary()
let var_ratio = variance(tumor) / variance(normal)
print("Variance ratio (tumor/normal): {var_ratio:.3}")
if var_ratio > 2.0 or var_ratio < 0.5 {
print("Variances appear unequal -> use Welch's t-test (the default)")
} else {
print("Variances appear similar -> pooled t-test is also valid")
}
# 3. QQ plots for visual normality assessment
qq_plot(tumor, {title: "QQ Plot: Tumor BRCA1 Expression"})
qq_plot(normal, {title: "QQ Plot: Normal BRCA1 Expression"})
Paired t-Test: Before/After Treatment
# Tumor volume (mm^3) before and after 6 weeks of treatment
# Same 10 patients measured at both time points
let before = [245, 312, 198, 367, 289, 421, 156, 334, 278, 305]
let after = [180, 245, 165, 298, 220, 350, 132, 270, 210, 248]
# Paired t-test: accounts for patient-to-patient variability
let result = ttest_paired(before, after)
print("=== Paired t-test: Tumor Volume Before vs After Treatment ===")
print("t-statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.6}")
# Show the paired differences
let diffs = zip(before, after) |> map(|pair| pair[0] - pair[1])
print("Mean reduction: {mean(diffs):.1} mm^3")
print("Individual reductions: {diffs}")
# Compare: what if we wrongly used an independent t-test?
let wrong_result = ttest(before, after)
print("\nWrong (independent) t-test p-value: {wrong_result.p_value:.6}")
print("Correct (paired) t-test p-value: {result.p_value:.6}")
print("Paired test is more powerful because it removes inter-patient variability")
# Visualize paired differences
histogram(diffs, {title: "Distribution of Tumor Volume Reductions", x_label: "Reduction (mm^3)", bins: 8})
One-Sample t-Test
# Is the GC content of our assembled genome different from the expected 41%?
let gc_per_contig = [40.2, 41.5, 39.8, 42.1, 40.7, 41.3, 39.5, 42.4,
40.1, 41.8, 40.5, 41.0, 39.9, 41.6, 40.3]
let result = ttest_one(gc_per_contig, 41.0)
print("=== One-sample t-test: GC Content vs Expected 41% ===")
print("Sample mean: {mean(gc_per_contig):.2}%")
print("t-statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.4}")
if result.p_value > 0.05 {
print("No significant deviation from expected GC content")
}
Complete Workflow: Multiple Genes
# Test multiple genes at once
let genes = ["BRCA1", "TP53", "MYC", "GAPDH", "EGFR"]
let tumor_expr = [
[3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5],
[2.1, 1.8, 2.5, 1.4, 2.2, 1.9, 2.0, 1.6, 2.3, 1.7, 2.4, 1.5],
[9.2, 10.1, 8.8, 9.5, 10.3, 9.7, 8.6, 9.9, 10.5, 9.1, 9.8, 10.2],
[8.1, 8.3, 7.9, 8.2, 8.0, 8.4, 7.8, 8.1, 8.3, 8.0, 8.2, 7.9],
[7.5, 8.2, 7.8, 8.5, 7.1, 8.0, 7.6, 8.3, 7.9, 8.1, 7.4, 8.4]
]
let normal_expr = [
[6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1],
[5.8, 6.2, 5.5, 6.0, 5.9, 6.3, 5.7, 6.1, 5.6, 6.4, 5.8, 6.0],
[5.1, 5.4, 4.9, 5.3, 5.6, 5.0, 5.2, 5.5, 4.8, 5.7, 5.1, 5.4],
[8.0, 8.2, 7.8, 8.3, 8.1, 8.0, 8.4, 7.9, 8.2, 8.1, 8.3, 8.0],
[5.0, 5.3, 4.8, 5.1, 5.5, 4.9, 5.2, 5.4, 4.7, 5.6, 5.0, 5.3]
]
print("Gene | t-stat | p-value | Cohen's d | Interpretation")
print("-----------|--------|------------|-----------|---------------")
for i in 0..len(genes) {
let result = ttest(tumor_expr[i], normal_expr[i])
let d = (mean(tumor_expr[i]) - mean(normal_expr[i])) / sqrt((variance(tumor_expr[i]) + variance(normal_expr[i])) / 2.0)
let interp = if abs(d) > 0.8 then "Large" else if abs(d) > 0.5 then "Medium" else "Small"
print("{genes[i]:<10} | {result.statistic:>6.2} | {result.p_value:>10.2e} | {d:>9.3} | {interp}")
}
Python:
from scipy import stats
import numpy as np
tumor = [3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5]
normal = [6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1]
# Welch's t-test (default)
t, p = stats.ttest_ind(tumor, normal, equal_var=False)
print(f"Welch's t = {t:.4f}, p = {p:.2e}")
# Paired t-test
before = [245, 312, 198, 367, 289, 421, 156, 334, 278, 305]
after = [180, 245, 165, 298, 220, 350, 132, 270, 210, 248]
t, p = stats.ttest_rel(before, after)
print(f"Paired t = {t:.4f}, p = {p:.6f}")
# Cohen's d (manual)
pooled_std = np.sqrt((np.std(tumor, ddof=1)**2 + np.std(normal, ddof=1)**2) / 2)
d = (np.mean(tumor) - np.mean(normal)) / pooled_std
print(f"Cohen's d = {d:.3f}")
R:
tumor <- c(3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5)
normal <- c(6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1)
# Welch's t-test (default in R)
t.test(tumor, normal)
# Paired t-test
before <- c(245, 312, 198, 367, 289, 421, 156, 334, 278, 305)
after <- c(180, 245, 165, 298, 220, 350, 132, 270, 210, 248)
t.test(before, after, paired = TRUE)
# Cohen's d
library(effsize)
cohen.d(tumor, normal)
Exercises
Exercise 1: Choose the Right t-Test
For each scenario, state which t-test variant is appropriate and why:
a) Comparing white blood cell counts between 20 patients with sepsis and 25 healthy volunteers b) Measuring gene expression in liver biopsies taken before and after drug treatment (same 15 patients) c) Testing whether mean read length from your sequencer matches the expected 150 bp
Exercise 2: Full t-Test Workflow
Hemoglobin levels (g/dL) in two groups:
- Anemia patients: [9.2, 8.8, 10.1, 9.5, 8.3, 9.7, 8.6, 9.0, 10.3, 8.9]
- Healthy controls: [13.5, 14.2, 12.8, 13.9, 14.5, 13.1, 14.0, 13.6, 12.9, 14.3]
let anemia = [9.2, 8.8, 10.1, 9.5, 8.3, 9.7, 8.6, 9.0, 10.3, 8.9]
let healthy = [13.5, 14.2, 12.8, 13.9, 14.5, 13.1, 14.0, 13.6, 12.9, 14.3]
# TODO: 1. Check normality with qq_plot() on each group
# TODO: 2. Check equal variances by comparing variance() per group
# TODO: 3. Run the appropriate t-test with ttest()
# TODO: 4. Compute Cohen's d inline: (mean(a)-mean(b)) / sqrt((variance(a)+variance(b))/2)
# TODO: 5. Create a boxplot
# TODO: 6. Interpret results in a biological context
Exercise 3: Paired vs Independent
Run both a paired and independent t-test on the tumor volume data below. Compare the p-values and explain why they differ.
let before = [245, 312, 198, 367, 289, 421, 156, 334, 278, 305]
let after = [180, 245, 165, 298, 220, 350, 132, 270, 210, 248]
# TODO: Run ttest_paired and ttest
# TODO: Which gives a smaller p-value? Why?
# TODO: What does the paired test "remove" that the independent test cannot?
Exercise 4: When Assumptions Fail
The following data are highly skewed (as often seen in cytokine measurements):
let treatment = [2.1, 1.8, 45.2, 3.5, 2.9, 1.2, 38.7, 4.1, 2.3, 1.5]
let control = [0.8, 0.5, 0.9, 0.3, 1.1, 0.7, 0.4, 0.6, 1.0, 0.2]
# TODO: Test normality with qq_plot()
# TODO: Run the t-test with ttest() anyway — what does it say?
# TODO: Try log-transforming the data and re-testing
# TODO: Preview: tomorrow we'll learn non-parametric alternatives
Key Takeaways
- The t-test compares two group means, accounting for variability and sample size
- Welch’s t-test (unequal variances) should be the default — it is robust even when variances are equal
- Paired t-tests are more powerful when observations are naturally matched (same patient, same timepoint)
- Always check assumptions: Shapiro-Wilk for normality, Levene’s for equal variances, QQ plots for visual inspection
- Cohen’s d quantifies effect size independently of sample size: 0.2 = small, 0.5 = medium, 0.8 = large
- A significant t-test with a small Cohen’s d may not be biologically meaningful
- A non-significant t-test with a large Cohen’s d suggests you need more samples
What’s Next
What happens when your data violate the normality assumption? Cytokine levels, bacterial abundances, and many other biological measurements are wildly skewed. Tomorrow we introduce non-parametric tests — rank-based alternatives to the t-test that make no assumptions about the shape of your data distribution.