Day 12: The Multiple Testing Crisis — FDR and Correction

The Problem

Dr. Rachel Kim is a genomicist analyzing differential gene expression between tumor and normal tissue. She runs a Welch’s t-test on each of 20,000 genes and finds 1,200 with p < 0.05. Exciting — until she does the arithmetic: 20,000 genes multiplied by a 5% false positive rate equals 1,000 genes that would appear “significant” purely by chance, even if not a single gene were truly differentially expressed.

Her 1,200 “hits” likely contain about 1,000 false positives and perhaps 200 true findings. That is a false discovery rate of over 80%. If she publishes this list and a validation lab tries to confirm the top 50 genes, roughly 40 will fail to replicate. Her scientific reputation — and the lab’s funding — depends on solving this problem.

This is the multiple testing crisis, and it is the single most important statistical concept in genomics. Every differential expression study, every GWAS, every proteomics screen must address it. The solutions — Bonferroni correction, Benjamini-Hochberg FDR, and their relatives — are what make genome-scale analysis possible.

Making It Visceral: A Simulation

Before discussing theory, let us see the problem with our own eyes.

set_seed(42)
# Simulate 20,000 genes where NONE are truly different (complete null)

let p_values = []
for i in 1..20000 {
  # Both groups drawn from the same distribution — no real differences
  let group1 = rnorm(10, 0, 1)
  let group2 = rnorm(10, 0, 1)
  let result = ttest(group1, group2)
  p_values = append(p_values, result.p_value)
}

# Count "discoveries" at various thresholds
let sig_05 = p_values |> filter(|p| p < 0.05) |> len()
let sig_01 = p_values |> filter(|p| p < 0.01) |> len()
let sig_001 = p_values |> filter(|p| p < 0.001) |> len()

print("=== 20,000 Null Genes (No True Differences) ===")
print("'Significant' at p < 0.05:  {sig_05} (expected: ~1000)")
print("'Significant' at p < 0.01:  {sig_01} (expected: ~200)")
print("'Significant' at p < 0.001: {sig_001} (expected: ~20)")
print("\nEvery single one is a false positive!")

histogram(p_values, {title: "p-Value Distribution Under Complete Null", x_label: "p-value", bins: 50})

Under the null hypothesis, p-values are uniformly distributed between 0 and 1. Exactly 5% will fall below 0.05 by definition. With 20,000 tests, that is 1,000 false alarms.

Family-Wise Error Rate (FWER)

The family-wise error rate is the probability of making at least one Type I error across all tests:

FWER = 1 - (1 - alpha)^m

Where m is the number of tests.

Number of Tests (m)	FWER at alpha = 0.05
1	5.0%
10	40.1%
100	99.4%
1,000	~100%
20,000	~100%

Bonferroni Correction

The simplest and most conservative approach: divide alpha by the number of tests.

Adjusted alpha = alpha / m

For 20,000 tests at alpha = 0.05: adjusted alpha = 0.05 / 20,000 = 2.5 x 10^-6.

Equivalently, multiply each p-value by m and compare to the original alpha:

p_adjusted = min(p x m, 1.0)

Strengths and Weaknesses

Property	Assessment
Controls FWER	Yes, strongly
Simple to compute	Yes
Power	Very low — misses many true effects
Assumes independence	Works regardless, but overly conservative for correlated tests

Common pitfall: Bonferroni is often TOO conservative for genomics. With 20,000 correlated gene expression measurements, it throws out far too many true positives. It is appropriate when you need to be extremely cautious (drug safety) or when you have few tests.

Holm’s Step-Down Correction

Holm’s method is uniformly more powerful than Bonferroni while still controlling FWER.

Procedure:

Sort p-values from smallest to largest: p(1) <= p(2) <= … <= p(m)
For the i-th smallest p-value, compute adjusted p: p(i) x (m - i + 1)
Enforce monotonicity: each adjusted p must be >= the previous
Reject all hypotheses whose adjusted p < alpha

Holm’s method is always at least as powerful as Bonferroni, and sometimes substantially more so.

False Discovery Rate (FDR): The Breakthrough

In 1995, Benjamini and Hochberg introduced a paradigm shift. Instead of controlling the probability of any false positive (FWER), they controlled the expected proportion of false positives among rejected hypotheses.

FDR = E[V / R]

Where V = number of false positives and R = total rejections.

If you set FDR = 0.05, you accept that about 5% of your “discoveries” may be false — a much more practical threshold for genomics than guaranteeing zero false positives.

Benjamini-Hochberg (BH) Procedure

Procedure:

Sort p-values from smallest to largest: p(1) <= p(2) <= … <= p(m)
For each p(i), compute the BH threshold: (i / m) x q, where q is the desired FDR level
Find the largest i where p(i) <= (i / m) x q
Reject all hypotheses 1, 2, …, i

Equivalently, the BH-adjusted p-value (q-value) is:

q(i) = min(p(i) x m / i, 1.0) (with monotonicity enforced)

Why BH Changed Genomics

Method	Controls	Typical Threshold	Power
Bonferroni	FWER	p < 2.5e-6 (for 20K tests)	Very low
Holm	FWER	Similar to Bonferroni	Slightly higher
BH (FDR)	FDR	q < 0.05	Much higher

Key insight: BH-FDR is the standard for differential expression, GWAS, proteomics, and almost all high-throughput biology. When a paper reports “genes with FDR < 0.05,” they almost always mean Benjamini-Hochberg adjusted p-values.

Other Correction Methods

Hochberg’s Step-Up

Similar to Holm but slightly more powerful; assumes independence or positive dependence among tests.

Benjamini-Yekutieli (BY)

A conservative FDR procedure that works under any dependency structure between tests. Use when you have strong correlations (e.g., genes in the same pathway).

Choosing a Method

Method	Controls	Best For
Bonferroni	FWER	Few tests, safety-critical decisions
Holm	FWER	Same as Bonferroni but always more powerful
BH	FDR	Standard genomics, proteomics, any large-scale screen
Hochberg	FWER	Independent or positively dependent tests
BY	FDR	Strongly correlated tests, conservative FDR

The Volcano Plot

The volcano plot is the most iconic visualization in differential expression analysis. It plots:

x-axis: log2 fold change (effect size)
y-axis: -log10(adjusted p-value) (statistical significance)

Genes in the upper corners are both statistically significant AND biologically meaningful — the ones you actually care about.

Region	Interpretation
Upper-right	Significantly upregulated (high FC, low p)
Upper-left	Significantly downregulated (high FC, low p)
Bottom center	Not significant (high p, any FC)
Upper center	Significant but small effect (low FC, low p)

Multiple Testing Correction in BioLang

Simulating the Crisis and Applying Corrections

set_seed(42)
# Simulate 20,000 genes: 18,000 null + 2,000 truly differential

let p_values = []
let is_true = []  # Ground truth: 1 = truly differential, 0 = null
let fold_changes = []

for i in 1..20000 {
  let group1 = rnorm(10, 0, 1)
  if i <= 2000 {
    # True differential: shifted mean
    let shift = rnorm(1, 2.0, 0.5)[0]
    let group2 = rnorm(10, shift, 1)
    is_true = append(is_true, 1)
  } else {
    # Null gene: no difference
    let group2 = rnorm(10, 0, 1)
    is_true = append(is_true, 0)
  }
  let result = ttest(group1, group2)
  p_values = append(p_values, result.p_value)
  fold_changes = append(fold_changes, mean(group2) - mean(group1))
}

print("Total genes: {len(p_values)}")
print("True DE genes: {is_true |> filter(|x| x == 1) |> len()}")
print("Null genes: {is_true |> filter(|x| x == 0) |> len()}")

Applying All Correction Methods

# Apply multiple correction methods
let p_bonf = p_adjust(p_values, "bonferroni")
let p_holm = p_adjust(p_values, "holm")
let p_bh   = p_adjust(p_values, "BH")
let p_hoch = p_adjust(p_values, "hochberg")
let p_by   = p_adjust(p_values, "BY")

# Count discoveries at adjusted p < 0.05
let count_sig = |adj_p| adj_p |> filter(|p| p < 0.05) |> len()

print("=== Discoveries at Adjusted p < 0.05 ===")
print("Method       | Total Discoveries | True Pos | False Pos | FDR")
print("-------------|-------------------|----------|-----------|--------")

for method_name, adj_p in [
  ["Unadjusted ", p_values],
  ["Bonferroni ", p_bonf],
  ["Holm       ", p_holm],
  ["BH (FDR)   ", p_bh],
  ["Hochberg   ", p_hoch],
  ["BY         ", p_by]
] {
  let discoveries = []
  let true_pos = 0
  let false_pos = 0
  for j in 0..len(adj_p) {
    if adj_p[j] < 0.05 {
      discoveries = append(discoveries, j)
      if is_true[j] == 1 { true_pos = true_pos + 1 }
      else { false_pos = false_pos + 1 }
    }
  }
  let total = len(discoveries)
  let fdr = if total > 0 then false_pos / total else 0.0
  print("{method_name} | {total:>17} | {true_pos:>8} | {false_pos:>9} | {fdr:>6.3}")
}

Drawing a Volcano Plot

# Volcano plot: the most important visualization in DE analysis
let log2_fc = fold_changes
let neg_log10_p = p_bh |> map(|p| if p > 0 then -log10(p) else 10)

# Classify genes
let colors = []
for i in 0..len(p_bh) {
  if p_bh[i] < 0.05 and abs(log2_fc[i]) > 1.0 {
    colors = append(colors, "significant")
  } else {
    colors = append(colors, "not_significant")
  }
}

volcano(log2_fc, neg_log10_p, {title: "Differential Expression: Tumor vs Normal", x_label: "log2 Fold Change", y_label: "-log10(FDR-adjusted p-value)", fc_threshold: 1.0, p_threshold: 0.05, highlight: colors})

# Count genes in each quadrant
let sig_up = 0
let sig_down = 0
let not_sig = 0
for i in 0..len(p_bh) {
  if p_bh[i] < 0.05 and log2_fc[i] > 1.0 { sig_up = sig_up + 1 }
  else if p_bh[i] < 0.05 and log2_fc[i] < -1.0 { sig_down = sig_down + 1 }
  else { not_sig = not_sig + 1 }
}
print("Significantly upregulated:   {sig_up}")
print("Significantly downregulated: {sig_down}")
print("Not significant:             {not_sig}")

Visualizing the BH Procedure

set_seed(42)
# Step-by-step BH procedure visualization

# Smaller example for clarity: 100 tests, 10 truly different
let p_vals = []
for i in 1..100 {
  let g1 = rnorm(10, 0, 1)
  let g2 = if i <= 10 {
    rnorm(10, 3, 1)
  } else {
    rnorm(10, 0, 1)
  }
  p_vals = append(p_vals, ttest(g1, g2).p_value)
}

# Sort p-values
let sorted_p = sort(p_vals)
let bh_thresholds = range(1, 101) |> map(|i| i / 100 * 0.05)

# Plot sorted p-values against BH thresholds
scatter(range(1, 101), sorted_p, {title: "BH Procedure: Sorted p-values vs Threshold Line", x_label: "Rank", y_label: "p-value", overlay_lines: [[range(1, 101), bh_thresholds]]})

let bh_adjusted = p_adjust(p_vals, "BH")
let n_sig = bh_adjusted |> filter(|p| p < 0.05) |> len()
print("BH discoveries (FDR < 0.05): {n_sig}")

p-Value Histograms: Diagnostic Tool

set_seed(42)
# A well-behaved p-value histogram tells you a lot
# Uniform = all null; spike at 0 = true signal exists


# Scenario 1: All null (should be uniform)
let null_ps = []
for i in 1..5000 {
  let g1 = rnorm(10, 0, 1)
  let g2 = rnorm(10, 0, 1)
  null_ps = append(null_ps, ttest(g1, g2).p_value)
}

# Scenario 2: 20% true signal (spike near 0 + uniform)
let mixed_ps = []
for i in 1..5000 {
  let g1 = rnorm(10, 0, 1)
  let g2 = if i <= 1000 {
    rnorm(10, 2, 1)
  } else {
    rnorm(10, 0, 1)
  }
  mixed_ps = append(mixed_ps, ttest(g1, g2).p_value)
}

histogram(null_ps, {title: "All Null: Uniform p-value Distribution", x_label: "p-value", bins: 20})

histogram(mixed_ps, {title: "20% True Signal: Spike Near Zero + Uniform Background", x_label: "p-value", bins: 20})

Python:

from statsmodels.stats.multitest import multipletests
import numpy as np

# Simulate p-values (example with 1000 tests)
np.random.seed(42)
p_values = np.random.uniform(0, 1, 1000)
p_values[:100] = np.random.uniform(0, 0.01, 100)  # 100 true signals

# Apply corrections
reject_bonf, padj_bonf, _, _ = multipletests(p_values, method='bonferroni')
reject_holm, padj_holm, _, _ = multipletests(p_values, method='holm')
reject_bh, padj_bh, _, _ = multipletests(p_values, method='fdr_bh')

print(f"Bonferroni: {reject_bonf.sum()} discoveries")
print(f"Holm:       {reject_holm.sum()} discoveries")
print(f"BH (FDR):   {reject_bh.sum()} discoveries")

# Simulate p-values
set.seed(42)
p_values <- c(runif(100, 0, 0.01), runif(900, 0, 1))

# Apply corrections (all built-in)
p_bonf <- p.adjust(p_values, method = "bonferroni")
p_holm <- p.adjust(p_values, method = "holm")
p_bh   <- p.adjust(p_values, method = "BH")
p_by   <- p.adjust(p_values, method = "BY")

cat("Bonferroni:", sum(p_bonf < 0.05), "\n")
cat("Holm:      ", sum(p_holm < 0.05), "\n")
cat("BH (FDR):  ", sum(p_bh < 0.05), "\n")
cat("BY:        ", sum(p_by < 0.05), "\n")

# Volcano plot (using EnhancedVolcano)
library(EnhancedVolcano)
EnhancedVolcano(results, lab = rownames(results),
  x = 'log2FoldChange', y = 'padj',
  pCutoff = 0.05, FCcutoff = 1.0)

Exercises

Exercise 1: Simulate and Count

Simulate 10,000 tests where all genes are null (no true differences). Verify that approximately 500 have p < 0.05, 100 have p < 0.01, and 10 have p < 0.001.


# TODO: Simulate 10,000 null t-tests
# TODO: Count p < 0.05, p < 0.01, p < 0.001
# TODO: Plot the p-value histogram (should be uniform)

Exercise 2: Compare Correction Power

Simulate 5,000 genes (500 truly DE with log2FC = 1.5, 4,500 null). Apply Bonferroni, Holm, and BH. For each, compute: (a) total discoveries, (b) true positives, (c) false positives, (d) actual FDR.


# TODO: Simulate 5,000 genes (500 DE + 4,500 null)
# TODO: Apply all three corrections
# TODO: Build a comparison table
# TODO: Which method gives the best balance of power and error control?

Exercise 3: Build a Volcano Plot

Using the simulation from Exercise 2, create a volcano plot. Color genes as “significant” (BH-adjusted p < 0.05 AND |log2FC| > 1) versus “not significant.”

# TODO: Use fold changes and BH-adjusted p-values from Exercise 2
# TODO: Create a volcano plot with appropriate thresholds
# TODO: Count genes in each quadrant

Exercise 4: p-Value Histogram Diagnostics

Generate p-value histograms for three scenarios: (a) 100% null genes, (b) 10% true DE genes, (c) 50% true DE genes. Describe the characteristic shape of each and explain what you would look for in real data.


# TODO: Scenario A — all null
# TODO: Scenario B — 10% DE
# TODO: Scenario C — 50% DE
# TODO: Create histograms for each
# TODO: Describe the shape and what it tells you

Exercise 5: When Does Bonferroni Make Sense?

You are testing 5 pre-specified candidate genes (not genome-wide). Apply Bonferroni and BH to these 5 p-values: [0.008, 0.023, 0.041, 0.062, 0.110]. How do the results differ? When would you prefer Bonferroni here?

let p_vals = [0.008, 0.023, 0.041, 0.062, 0.110]

# TODO: Apply Bonferroni and BH
# TODO: Which genes are significant under each method?
# TODO: Argue for which method is more appropriate for 5 candidate genes

Key Takeaways

Testing m hypotheses at alpha = 0.05 yields approximately m x 0.05 false positives — devastating for genomics
Bonferroni correction (p x m) controls FWER but is often too conservative for large-scale studies
Holm’s step-down method is always at least as powerful as Bonferroni and should be preferred
Benjamini-Hochberg (BH) FDR correction is the standard for genomics: it controls the expected proportion of false discoveries rather than the probability of any false discovery
At FDR = 0.05, you accept that about 5% of discoveries may be false — a practical trade-off for high-throughput biology
p-value histograms are essential diagnostics: uniform = all null, spike at zero = true signal present
The volcano plot (-log10 adjusted p vs log2 fold change) is the standard visualization for differential expression: genes in the upper corners are both significant and biologically meaningful
Always report which correction method you used — “p < 0.05” means very different things with and without adjustment

What’s Next

With the multiple testing crisis solved, we have completed the core toolkit of statistical hypothesis testing. Week 3 shifts to modeling and relationships: tomorrow we explore correlation — how to measure and test whether two continuous variables move together, from gene expression co-regulation to dose-response curves. This transition from “are groups different?” to “how are variables related?” opens the door to regression, prediction, and the modeling approaches that power modern biostatistics.

Keyboard shortcuts

Practical Biostatistics in 30 Days