Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 8: Comparing Two Groups — The t-Test

The Problem

Dr. Sofia Reyes is a cancer biologist studying BRCA1 expression in breast tissue. She has RNA-seq data from 12 tumor samples and 12 matched normal samples from the same patients. The mean BRCA1 expression in tumors is 4.2 log2-CPM versus 6.8 log2-CPM in normals — a 2.6-fold reduction. But with only 12 samples per group and considerable biological variability, can she confidently claim BRCA1 is downregulated in tumors?

She cannot use a z-test because the population standard deviation is unknown — she must estimate it from the data itself. She needs the t-test, the most widely used statistical test in biomedical research. But which version? Her samples are paired (tumor and normal from the same patient), which adds another consideration. And before running any test, she should verify that the data meet the test’s assumptions.

This chapter covers the t-test in all its forms: independent, Welch’s, paired, and one-sample. You will learn when each is appropriate, how to check assumptions, and how to quantify the magnitude of differences with Cohen’s d.

What Is the t-Test?

The t-test asks: “Is the difference between two group means larger than what we would expect from random sampling variation alone?”

Think of it this way: you have two piles of measurements. The t-test weighs how far apart the piles’ centers are, relative to how spread out each pile is. If the piles are far apart and tight, the difference is convincing. If they overlap substantially, it is not.

The t-statistic = (difference in means) / (standard error of the difference)

A larger t-statistic means more evidence of a real difference.

Easy vs Hard to Distinguish: Overlap Determines Significance Large Effect (Easy) Group A Group B Large d Small overlap, small p-value Small Effect (Hard) A B Small d Large overlap, large p-value Cohen's d = difference in means / pooled standard deviation

The Four Flavors of t-Test

TestWhen to UseFormula
One-sampleCompare sample mean to a known valuet = (x-bar - mu0) / (s / sqrt(n))
Independent two-sampleCompare means of two unrelated groupst = (x-bar1 - x-bar2) / (s_p x sqrt(1/n1 + 1/n2))
Welch’sTwo unrelated groups, unequal variancest = (x-bar1 - x-bar2) / sqrt(s1^2/n1 + s2^2/n2)
PairedMatched or before/after measurementst = d-bar / (s_d / sqrt(n))

Independent Two-Sample t-Test

Assumptions

  1. Independence: Observations within and between groups are independent
  2. Normality: Data in each group are approximately normally distributed
  3. Equal variances: Both groups have similar spread (homoscedasticity)

The Pooled Standard Error

When variances are assumed equal, we pool them for a better estimate:

s_p = sqrt(((n1-1)s1^2 + (n2-1)s2^2) / (n1 + n2 - 2))

Degrees of freedom: df = n1 + n2 - 2

Welch’s t-Test: The Safer Default

Welch’s t-test does not assume equal variances. It uses each group’s own variance estimate and adjusts the degrees of freedom downward with the Welch-Satterthwaite equation.

Key insight: Welch’s t-test is almost always the better default choice. It performs nearly as well as the pooled t-test when variances ARE equal, and much better when they are not. Most modern statistical software (including R’s t.test()) uses Welch’s version by default.

Paired t-Test: Matched Samples

When observations are naturally paired — tumor/normal from the same patient, before/after treatment on the same subject — the paired t-test is far more powerful because it controls for inter-subject variability.

The trick: compute the difference for each pair, then perform a one-sample t-test on the differences:

t = d-bar / (s_d / sqrt(n))

Where d-bar is the mean of the paired differences and s_d is their standard deviation.

DesignPairingCorrect Test
Tumor vs normal from same patientPairedPaired t-test
Drug vs placebo in different patientsIndependentWelch’s t-test
Before vs after treatment, same patientsPairedPaired t-test
Wild-type vs knockout miceIndependentWelch’s t-test
Left eye vs right eye of same individualsPairedPaired t-test
Paired Design: Before/After Connected by Patient Each arrow shows one patient's change -- every patient improved Before Treatment After Treatment 400+ 300 200 100 Tumor volume (mm^3) -65 -67 -33 -69 -71 -69 Diff Paired test analyzes the differences, removing between-patient variability

Common pitfall: Using an independent t-test on paired data wastes statistical power. If you have natural pairs, always use the paired test. Conversely, using a paired test on unpaired data gives wrong results.

Which t-Test? Decision Flowchart Comparing 2 groups? Are observations paired? (same subject, before/after, matched) Yes Paired t-test No Data approximately normal? No Mann-Whitney (Day 9) Yes Variances approximately equal? Yes Pooled t-test No Welch's t-test Recommended default

Checking Assumptions

Normality: Shapiro-Wilk Test

The Shapiro-Wilk test checks whether data could have come from a normal distribution.

  • H0: Data are normally distributed
  • If p > 0.05, normality assumption is reasonable
  • If p < 0.05, data are significantly non-normal

Also use QQ plots: if points fall along the diagonal line, data are approximately normal.

Equal Variances: Levene’s Test

Levene’s test checks whether two groups have equal variances.

  • H0: Variances are equal
  • If p > 0.05, equal variance assumption is reasonable
  • If p < 0.05, use Welch’s t-test (or just always use Welch’s)

Cohen’s d: Quantifying Effect Size

A p-value tells you whether a difference exists. Cohen’s d tells you how large it is, in standard deviation units:

d = (x-bar1 - x-bar2) / s_pooled

Cohen’s dInterpretationBiological Example
0.2SmallSubtle expression change
0.5MediumModerate drug effect
0.8LargeStrong phenotypic difference
> 1.2Very largeKnockout vs wild-type

Key insight: A large p-value with a large Cohen’s d suggests you are underpowered — you may have a real effect but too few samples to detect it. A small p-value with a tiny Cohen’s d suggests the effect, while real, may not be biologically meaningful.

The t-Test in BioLang

Independent Two-Sample t-Test: Gene Expression

# BRCA1 expression (log2-CPM) in tumor vs normal breast tissue
let tumor  = [3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5]
let normal = [6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1]

# Default: Welch's t-test (unequal variances)
let result = ttest(tumor, normal)
print("=== Welch's t-test: BRCA1 Tumor vs Normal ===")
print("t-statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.2e}")
print("Degrees of freedom: {result.df:.1}")
print("Mean tumor: {mean(tumor):.2}, Mean normal: {mean(normal):.2}")
print("Difference: {mean(tumor) - mean(normal):.2} log2-CPM")

# Effect size (Cohen's d inline)
let d = (mean(tumor) - mean(normal)) / sqrt((variance(tumor) + variance(normal)) / 2.0)
print("Cohen's d: {d:.3}")

# Visualize
let bp_table = table({"Tumor": tumor, "Normal": normal})
boxplot(bp_table, {title: "BRCA1 Expression: Tumor vs Normal"})

Checking Assumptions

let tumor  = [3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5]
let normal = [6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1]

# 1. Normality check: use QQ plots for visual assessment
# (no built-in Shapiro-Wilk; use QQ plots + summary stats)
let s_tumor = summary(tumor)
let s_normal = summary(normal)
print("Tumor summary:  {s_tumor}")
print("Normal summary: {s_normal}")

# 2. Equal variance check: compare variances from summary()
let var_ratio = variance(tumor) / variance(normal)
print("Variance ratio (tumor/normal): {var_ratio:.3}")
if var_ratio > 2.0 or var_ratio < 0.5 {
  print("Variances appear unequal -> use Welch's t-test (the default)")
} else {
  print("Variances appear similar -> pooled t-test is also valid")
}

# 3. QQ plots for visual normality assessment
qq_plot(tumor, {title: "QQ Plot: Tumor BRCA1 Expression"})
qq_plot(normal, {title: "QQ Plot: Normal BRCA1 Expression"})

Paired t-Test: Before/After Treatment

# Tumor volume (mm^3) before and after 6 weeks of treatment
# Same 10 patients measured at both time points
let before = [245, 312, 198, 367, 289, 421, 156, 334, 278, 305]
let after  = [180, 245, 165, 298, 220, 350, 132, 270, 210, 248]

# Paired t-test: accounts for patient-to-patient variability
let result = ttest_paired(before, after)
print("=== Paired t-test: Tumor Volume Before vs After Treatment ===")
print("t-statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.6}")

# Show the paired differences
let diffs = zip(before, after) |> map(|pair| pair[0] - pair[1])
print("Mean reduction: {mean(diffs):.1} mm^3")
print("Individual reductions: {diffs}")

# Compare: what if we wrongly used an independent t-test?
let wrong_result = ttest(before, after)
print("\nWrong (independent) t-test p-value: {wrong_result.p_value:.6}")
print("Correct (paired) t-test p-value: {result.p_value:.6}")
print("Paired test is more powerful because it removes inter-patient variability")

# Visualize paired differences
histogram(diffs, {title: "Distribution of Tumor Volume Reductions", x_label: "Reduction (mm^3)", bins: 8})

One-Sample t-Test

# Is the GC content of our assembled genome different from the expected 41%?
let gc_per_contig = [40.2, 41.5, 39.8, 42.1, 40.7, 41.3, 39.5, 42.4,
                     40.1, 41.8, 40.5, 41.0, 39.9, 41.6, 40.3]

let result = ttest_one(gc_per_contig, 41.0)
print("=== One-sample t-test: GC Content vs Expected 41% ===")
print("Sample mean: {mean(gc_per_contig):.2}%")
print("t-statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.4}")

if result.p_value > 0.05 {
  print("No significant deviation from expected GC content")
}

Complete Workflow: Multiple Genes

# Test multiple genes at once
let genes = ["BRCA1", "TP53", "MYC", "GAPDH", "EGFR"]

let tumor_expr = [
  [3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5],
  [2.1, 1.8, 2.5, 1.4, 2.2, 1.9, 2.0, 1.6, 2.3, 1.7, 2.4, 1.5],
  [9.2, 10.1, 8.8, 9.5, 10.3, 9.7, 8.6, 9.9, 10.5, 9.1, 9.8, 10.2],
  [8.1, 8.3, 7.9, 8.2, 8.0, 8.4, 7.8, 8.1, 8.3, 8.0, 8.2, 7.9],
  [7.5, 8.2, 7.8, 8.5, 7.1, 8.0, 7.6, 8.3, 7.9, 8.1, 7.4, 8.4]
]

let normal_expr = [
  [6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1],
  [5.8, 6.2, 5.5, 6.0, 5.9, 6.3, 5.7, 6.1, 5.6, 6.4, 5.8, 6.0],
  [5.1, 5.4, 4.9, 5.3, 5.6, 5.0, 5.2, 5.5, 4.8, 5.7, 5.1, 5.4],
  [8.0, 8.2, 7.8, 8.3, 8.1, 8.0, 8.4, 7.9, 8.2, 8.1, 8.3, 8.0],
  [5.0, 5.3, 4.8, 5.1, 5.5, 4.9, 5.2, 5.4, 4.7, 5.6, 5.0, 5.3]
]

print("Gene       | t-stat | p-value    | Cohen's d | Interpretation")
print("-----------|--------|------------|-----------|---------------")

for i in 0..len(genes) {
  let result = ttest(tumor_expr[i], normal_expr[i])
  let d = (mean(tumor_expr[i]) - mean(normal_expr[i])) / sqrt((variance(tumor_expr[i]) + variance(normal_expr[i])) / 2.0)
  let interp = if abs(d) > 0.8 then "Large" else if abs(d) > 0.5 then "Medium" else "Small"
  print("{genes[i]:<10} | {result.statistic:>6.2} | {result.p_value:>10.2e} | {d:>9.3} | {interp}")
}

Python:

from scipy import stats
import numpy as np

tumor  = [3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5]
normal = [6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1]

# Welch's t-test (default)
t, p = stats.ttest_ind(tumor, normal, equal_var=False)
print(f"Welch's t = {t:.4f}, p = {p:.2e}")

# Paired t-test
before = [245, 312, 198, 367, 289, 421, 156, 334, 278, 305]
after  = [180, 245, 165, 298, 220, 350, 132, 270, 210, 248]
t, p = stats.ttest_rel(before, after)
print(f"Paired t = {t:.4f}, p = {p:.6f}")

# Cohen's d (manual)
pooled_std = np.sqrt((np.std(tumor, ddof=1)**2 + np.std(normal, ddof=1)**2) / 2)
d = (np.mean(tumor) - np.mean(normal)) / pooled_std
print(f"Cohen's d = {d:.3f}")

R:

tumor  <- c(3.8, 4.5, 4.1, 3.2, 4.8, 3.9, 4.3, 5.1, 3.6, 4.0, 4.7, 3.5)
normal <- c(6.2, 7.1, 6.5, 7.4, 6.8, 7.0, 6.3, 7.2, 6.9, 6.6, 7.3, 6.1)

# Welch's t-test (default in R)
t.test(tumor, normal)

# Paired t-test
before <- c(245, 312, 198, 367, 289, 421, 156, 334, 278, 305)
after  <- c(180, 245, 165, 298, 220, 350, 132, 270, 210, 248)
t.test(before, after, paired = TRUE)

# Cohen's d
library(effsize)
cohen.d(tumor, normal)

Exercises

Exercise 1: Choose the Right t-Test

For each scenario, state which t-test variant is appropriate and why:

a) Comparing white blood cell counts between 20 patients with sepsis and 25 healthy volunteers b) Measuring gene expression in liver biopsies taken before and after drug treatment (same 15 patients) c) Testing whether mean read length from your sequencer matches the expected 150 bp

Exercise 2: Full t-Test Workflow

Hemoglobin levels (g/dL) in two groups:

  • Anemia patients: [9.2, 8.8, 10.1, 9.5, 8.3, 9.7, 8.6, 9.0, 10.3, 8.9]
  • Healthy controls: [13.5, 14.2, 12.8, 13.9, 14.5, 13.1, 14.0, 13.6, 12.9, 14.3]
let anemia  = [9.2, 8.8, 10.1, 9.5, 8.3, 9.7, 8.6, 9.0, 10.3, 8.9]
let healthy = [13.5, 14.2, 12.8, 13.9, 14.5, 13.1, 14.0, 13.6, 12.9, 14.3]

# TODO: 1. Check normality with qq_plot() on each group
# TODO: 2. Check equal variances by comparing variance() per group
# TODO: 3. Run the appropriate t-test with ttest()
# TODO: 4. Compute Cohen's d inline: (mean(a)-mean(b)) / sqrt((variance(a)+variance(b))/2)
# TODO: 5. Create a boxplot
# TODO: 6. Interpret results in a biological context

Exercise 3: Paired vs Independent

Run both a paired and independent t-test on the tumor volume data below. Compare the p-values and explain why they differ.

let before = [245, 312, 198, 367, 289, 421, 156, 334, 278, 305]
let after  = [180, 245, 165, 298, 220, 350, 132, 270, 210, 248]

# TODO: Run ttest_paired and ttest
# TODO: Which gives a smaller p-value? Why?
# TODO: What does the paired test "remove" that the independent test cannot?

Exercise 4: When Assumptions Fail

The following data are highly skewed (as often seen in cytokine measurements):

let treatment = [2.1, 1.8, 45.2, 3.5, 2.9, 1.2, 38.7, 4.1, 2.3, 1.5]
let control   = [0.8, 0.5, 0.9, 0.3, 1.1, 0.7, 0.4, 0.6, 1.0, 0.2]

# TODO: Test normality with qq_plot()
# TODO: Run the t-test with ttest() anyway — what does it say?
# TODO: Try log-transforming the data and re-testing
# TODO: Preview: tomorrow we'll learn non-parametric alternatives

Key Takeaways

  • The t-test compares two group means, accounting for variability and sample size
  • Welch’s t-test (unequal variances) should be the default — it is robust even when variances are equal
  • Paired t-tests are more powerful when observations are naturally matched (same patient, same timepoint)
  • Always check assumptions: Shapiro-Wilk for normality, Levene’s for equal variances, QQ plots for visual inspection
  • Cohen’s d quantifies effect size independently of sample size: 0.2 = small, 0.5 = medium, 0.8 = large
  • A significant t-test with a small Cohen’s d may not be biologically meaningful
  • A non-significant t-test with a large Cohen’s d suggests you need more samples

What’s Next

What happens when your data violate the normality assumption? Cytokine levels, bacterial abundances, and many other biological measurements are wildly skewed. Tomorrow we introduce non-parametric tests — rank-based alternatives to the t-test that make no assumptions about the shape of your data distribution.