Day 12: The Multiple Testing Crisis — FDR and Correction
The Problem
Dr. Rachel Kim is a genomicist analyzing differential gene expression between tumor and normal tissue. She runs a Welch’s t-test on each of 20,000 genes and finds 1,200 with p < 0.05. Exciting — until she does the arithmetic: 20,000 genes multiplied by a 5% false positive rate equals 1,000 genes that would appear “significant” purely by chance, even if not a single gene were truly differentially expressed.
Her 1,200 “hits” likely contain about 1,000 false positives and perhaps 200 true findings. That is a false discovery rate of over 80%. If she publishes this list and a validation lab tries to confirm the top 50 genes, roughly 40 will fail to replicate. Her scientific reputation — and the lab’s funding — depends on solving this problem.
This is the multiple testing crisis, and it is the single most important statistical concept in genomics. Every differential expression study, every GWAS, every proteomics screen must address it. The solutions — Bonferroni correction, Benjamini-Hochberg FDR, and their relatives — are what make genome-scale analysis possible.
Making It Visceral: A Simulation
Before discussing theory, let us see the problem with our own eyes.
set_seed(42)
# Simulate 20,000 genes where NONE are truly different (complete null)
let p_values = []
for i in 1..20000 {
# Both groups drawn from the same distribution — no real differences
let group1 = rnorm(10, 0, 1)
let group2 = rnorm(10, 0, 1)
let result = ttest(group1, group2)
p_values = append(p_values, result.p_value)
}
# Count "discoveries" at various thresholds
let sig_05 = p_values |> filter(|p| p < 0.05) |> len()
let sig_01 = p_values |> filter(|p| p < 0.01) |> len()
let sig_001 = p_values |> filter(|p| p < 0.001) |> len()
print("=== 20,000 Null Genes (No True Differences) ===")
print("'Significant' at p < 0.05: {sig_05} (expected: ~1000)")
print("'Significant' at p < 0.01: {sig_01} (expected: ~200)")
print("'Significant' at p < 0.001: {sig_001} (expected: ~20)")
print("\nEvery single one is a false positive!")
histogram(p_values, {title: "p-Value Distribution Under Complete Null", x_label: "p-value", bins: 50})
Under the null hypothesis, p-values are uniformly distributed between 0 and 1. Exactly 5% will fall below 0.05 by definition. With 20,000 tests, that is 1,000 false alarms.
Family-Wise Error Rate (FWER)
The family-wise error rate is the probability of making at least one Type I error across all tests:
FWER = 1 - (1 - alpha)^m
Where m is the number of tests.
| Number of Tests (m) | FWER at alpha = 0.05 |
|---|---|
| 1 | 5.0% |
| 10 | 40.1% |
| 100 | 99.4% |
| 1,000 | ~100% |
| 20,000 | ~100% |
Bonferroni Correction
The simplest and most conservative approach: divide alpha by the number of tests.
Adjusted alpha = alpha / m
For 20,000 tests at alpha = 0.05: adjusted alpha = 0.05 / 20,000 = 2.5 x 10^-6.
Equivalently, multiply each p-value by m and compare to the original alpha:
p_adjusted = min(p x m, 1.0)
Strengths and Weaknesses
| Property | Assessment |
|---|---|
| Controls FWER | Yes, strongly |
| Simple to compute | Yes |
| Power | Very low — misses many true effects |
| Assumes independence | Works regardless, but overly conservative for correlated tests |
Common pitfall: Bonferroni is often TOO conservative for genomics. With 20,000 correlated gene expression measurements, it throws out far too many true positives. It is appropriate when you need to be extremely cautious (drug safety) or when you have few tests.
Holm’s Step-Down Correction
Holm’s method is uniformly more powerful than Bonferroni while still controlling FWER.
Procedure:
- Sort p-values from smallest to largest: p(1) <= p(2) <= … <= p(m)
- For the i-th smallest p-value, compute adjusted p: p(i) x (m - i + 1)
- Enforce monotonicity: each adjusted p must be >= the previous
- Reject all hypotheses whose adjusted p < alpha
Holm’s method is always at least as powerful as Bonferroni, and sometimes substantially more so.
False Discovery Rate (FDR): The Breakthrough
In 1995, Benjamini and Hochberg introduced a paradigm shift. Instead of controlling the probability of any false positive (FWER), they controlled the expected proportion of false positives among rejected hypotheses.
FDR = E[V / R]
Where V = number of false positives and R = total rejections.
If you set FDR = 0.05, you accept that about 5% of your “discoveries” may be false — a much more practical threshold for genomics than guaranteeing zero false positives.
Benjamini-Hochberg (BH) Procedure
Procedure:
- Sort p-values from smallest to largest: p(1) <= p(2) <= … <= p(m)
- For each p(i), compute the BH threshold: (i / m) x q, where q is the desired FDR level
- Find the largest i where p(i) <= (i / m) x q
- Reject all hypotheses 1, 2, …, i
Equivalently, the BH-adjusted p-value (q-value) is:
q(i) = min(p(i) x m / i, 1.0) (with monotonicity enforced)
Why BH Changed Genomics
| Method | Controls | Typical Threshold | Power |
|---|---|---|---|
| Bonferroni | FWER | p < 2.5e-6 (for 20K tests) | Very low |
| Holm | FWER | Similar to Bonferroni | Slightly higher |
| BH (FDR) | FDR | q < 0.05 | Much higher |
Key insight: BH-FDR is the standard for differential expression, GWAS, proteomics, and almost all high-throughput biology. When a paper reports “genes with FDR < 0.05,” they almost always mean Benjamini-Hochberg adjusted p-values.
Other Correction Methods
Hochberg’s Step-Up
Similar to Holm but slightly more powerful; assumes independence or positive dependence among tests.
Benjamini-Yekutieli (BY)
A conservative FDR procedure that works under any dependency structure between tests. Use when you have strong correlations (e.g., genes in the same pathway).
Choosing a Method
| Method | Controls | Best For |
|---|---|---|
| Bonferroni | FWER | Few tests, safety-critical decisions |
| Holm | FWER | Same as Bonferroni but always more powerful |
| BH | FDR | Standard genomics, proteomics, any large-scale screen |
| Hochberg | FWER | Independent or positively dependent tests |
| BY | FDR | Strongly correlated tests, conservative FDR |
The Volcano Plot
The volcano plot is the most iconic visualization in differential expression analysis. It plots:
- x-axis: log2 fold change (effect size)
- y-axis: -log10(adjusted p-value) (statistical significance)
Genes in the upper corners are both statistically significant AND biologically meaningful — the ones you actually care about.
| Region | Interpretation |
|---|---|
| Upper-right | Significantly upregulated (high FC, low p) |
| Upper-left | Significantly downregulated (high FC, low p) |
| Bottom center | Not significant (high p, any FC) |
| Upper center | Significant but small effect (low FC, low p) |
Multiple Testing Correction in BioLang
Simulating the Crisis and Applying Corrections
set_seed(42)
# Simulate 20,000 genes: 18,000 null + 2,000 truly differential
let p_values = []
let is_true = [] # Ground truth: 1 = truly differential, 0 = null
let fold_changes = []
for i in 1..20000 {
let group1 = rnorm(10, 0, 1)
if i <= 2000 {
# True differential: shifted mean
let shift = rnorm(1, 2.0, 0.5)[0]
let group2 = rnorm(10, shift, 1)
is_true = append(is_true, 1)
} else {
# Null gene: no difference
let group2 = rnorm(10, 0, 1)
is_true = append(is_true, 0)
}
let result = ttest(group1, group2)
p_values = append(p_values, result.p_value)
fold_changes = append(fold_changes, mean(group2) - mean(group1))
}
print("Total genes: {len(p_values)}")
print("True DE genes: {is_true |> filter(|x| x == 1) |> len()}")
print("Null genes: {is_true |> filter(|x| x == 0) |> len()}")
Applying All Correction Methods
# Apply multiple correction methods
let p_bonf = p_adjust(p_values, "bonferroni")
let p_holm = p_adjust(p_values, "holm")
let p_bh = p_adjust(p_values, "BH")
let p_hoch = p_adjust(p_values, "hochberg")
let p_by = p_adjust(p_values, "BY")
# Count discoveries at adjusted p < 0.05
let count_sig = |adj_p| adj_p |> filter(|p| p < 0.05) |> len()
print("=== Discoveries at Adjusted p < 0.05 ===")
print("Method | Total Discoveries | True Pos | False Pos | FDR")
print("-------------|-------------------|----------|-----------|--------")
for method_name, adj_p in [
["Unadjusted ", p_values],
["Bonferroni ", p_bonf],
["Holm ", p_holm],
["BH (FDR) ", p_bh],
["Hochberg ", p_hoch],
["BY ", p_by]
] {
let discoveries = []
let true_pos = 0
let false_pos = 0
for j in 0..len(adj_p) {
if adj_p[j] < 0.05 {
discoveries = append(discoveries, j)
if is_true[j] == 1 { true_pos = true_pos + 1 }
else { false_pos = false_pos + 1 }
}
}
let total = len(discoveries)
let fdr = if total > 0 then false_pos / total else 0.0
print("{method_name} | {total:>17} | {true_pos:>8} | {false_pos:>9} | {fdr:>6.3}")
}
Drawing a Volcano Plot
# Volcano plot: the most important visualization in DE analysis
let log2_fc = fold_changes
let neg_log10_p = p_bh |> map(|p| if p > 0 then -log10(p) else 10)
# Classify genes
let colors = []
for i in 0..len(p_bh) {
if p_bh[i] < 0.05 and abs(log2_fc[i]) > 1.0 {
colors = append(colors, "significant")
} else {
colors = append(colors, "not_significant")
}
}
volcano(log2_fc, neg_log10_p, {title: "Differential Expression: Tumor vs Normal", x_label: "log2 Fold Change", y_label: "-log10(FDR-adjusted p-value)", fc_threshold: 1.0, p_threshold: 0.05, highlight: colors})
# Count genes in each quadrant
let sig_up = 0
let sig_down = 0
let not_sig = 0
for i in 0..len(p_bh) {
if p_bh[i] < 0.05 and log2_fc[i] > 1.0 { sig_up = sig_up + 1 }
else if p_bh[i] < 0.05 and log2_fc[i] < -1.0 { sig_down = sig_down + 1 }
else { not_sig = not_sig + 1 }
}
print("Significantly upregulated: {sig_up}")
print("Significantly downregulated: {sig_down}")
print("Not significant: {not_sig}")
Visualizing the BH Procedure
set_seed(42)
# Step-by-step BH procedure visualization
# Smaller example for clarity: 100 tests, 10 truly different
let p_vals = []
for i in 1..100 {
let g1 = rnorm(10, 0, 1)
let g2 = if i <= 10 {
rnorm(10, 3, 1)
} else {
rnorm(10, 0, 1)
}
p_vals = append(p_vals, ttest(g1, g2).p_value)
}
# Sort p-values
let sorted_p = sort(p_vals)
let bh_thresholds = range(1, 101) |> map(|i| i / 100 * 0.05)
# Plot sorted p-values against BH thresholds
scatter(range(1, 101), sorted_p, {title: "BH Procedure: Sorted p-values vs Threshold Line", x_label: "Rank", y_label: "p-value", overlay_lines: [[range(1, 101), bh_thresholds]]})
let bh_adjusted = p_adjust(p_vals, "BH")
let n_sig = bh_adjusted |> filter(|p| p < 0.05) |> len()
print("BH discoveries (FDR < 0.05): {n_sig}")
p-Value Histograms: Diagnostic Tool
set_seed(42)
# A well-behaved p-value histogram tells you a lot
# Uniform = all null; spike at 0 = true signal exists
# Scenario 1: All null (should be uniform)
let null_ps = []
for i in 1..5000 {
let g1 = rnorm(10, 0, 1)
let g2 = rnorm(10, 0, 1)
null_ps = append(null_ps, ttest(g1, g2).p_value)
}
# Scenario 2: 20% true signal (spike near 0 + uniform)
let mixed_ps = []
for i in 1..5000 {
let g1 = rnorm(10, 0, 1)
let g2 = if i <= 1000 {
rnorm(10, 2, 1)
} else {
rnorm(10, 0, 1)
}
mixed_ps = append(mixed_ps, ttest(g1, g2).p_value)
}
histogram(null_ps, {title: "All Null: Uniform p-value Distribution", x_label: "p-value", bins: 20})
histogram(mixed_ps, {title: "20% True Signal: Spike Near Zero + Uniform Background", x_label: "p-value", bins: 20})
Python:
from statsmodels.stats.multitest import multipletests
import numpy as np
# Simulate p-values (example with 1000 tests)
np.random.seed(42)
p_values = np.random.uniform(0, 1, 1000)
p_values[:100] = np.random.uniform(0, 0.01, 100) # 100 true signals
# Apply corrections
reject_bonf, padj_bonf, _, _ = multipletests(p_values, method='bonferroni')
reject_holm, padj_holm, _, _ = multipletests(p_values, method='holm')
reject_bh, padj_bh, _, _ = multipletests(p_values, method='fdr_bh')
print(f"Bonferroni: {reject_bonf.sum()} discoveries")
print(f"Holm: {reject_holm.sum()} discoveries")
print(f"BH (FDR): {reject_bh.sum()} discoveries")
R:
# Simulate p-values
set.seed(42)
p_values <- c(runif(100, 0, 0.01), runif(900, 0, 1))
# Apply corrections (all built-in)
p_bonf <- p.adjust(p_values, method = "bonferroni")
p_holm <- p.adjust(p_values, method = "holm")
p_bh <- p.adjust(p_values, method = "BH")
p_by <- p.adjust(p_values, method = "BY")
cat("Bonferroni:", sum(p_bonf < 0.05), "\n")
cat("Holm: ", sum(p_holm < 0.05), "\n")
cat("BH (FDR): ", sum(p_bh < 0.05), "\n")
cat("BY: ", sum(p_by < 0.05), "\n")
# Volcano plot (using EnhancedVolcano)
library(EnhancedVolcano)
EnhancedVolcano(results, lab = rownames(results),
x = 'log2FoldChange', y = 'padj',
pCutoff = 0.05, FCcutoff = 1.0)
Exercises
Exercise 1: Simulate and Count
Simulate 10,000 tests where all genes are null (no true differences). Verify that approximately 500 have p < 0.05, 100 have p < 0.01, and 10 have p < 0.001.
# TODO: Simulate 10,000 null t-tests
# TODO: Count p < 0.05, p < 0.01, p < 0.001
# TODO: Plot the p-value histogram (should be uniform)
Exercise 2: Compare Correction Power
Simulate 5,000 genes (500 truly DE with log2FC = 1.5, 4,500 null). Apply Bonferroni, Holm, and BH. For each, compute: (a) total discoveries, (b) true positives, (c) false positives, (d) actual FDR.
# TODO: Simulate 5,000 genes (500 DE + 4,500 null)
# TODO: Apply all three corrections
# TODO: Build a comparison table
# TODO: Which method gives the best balance of power and error control?
Exercise 3: Build a Volcano Plot
Using the simulation from Exercise 2, create a volcano plot. Color genes as “significant” (BH-adjusted p < 0.05 AND |log2FC| > 1) versus “not significant.”
# TODO: Use fold changes and BH-adjusted p-values from Exercise 2
# TODO: Create a volcano plot with appropriate thresholds
# TODO: Count genes in each quadrant
Exercise 4: p-Value Histogram Diagnostics
Generate p-value histograms for three scenarios: (a) 100% null genes, (b) 10% true DE genes, (c) 50% true DE genes. Describe the characteristic shape of each and explain what you would look for in real data.
# TODO: Scenario A — all null
# TODO: Scenario B — 10% DE
# TODO: Scenario C — 50% DE
# TODO: Create histograms for each
# TODO: Describe the shape and what it tells you
Exercise 5: When Does Bonferroni Make Sense?
You are testing 5 pre-specified candidate genes (not genome-wide). Apply Bonferroni and BH to these 5 p-values: [0.008, 0.023, 0.041, 0.062, 0.110]. How do the results differ? When would you prefer Bonferroni here?
let p_vals = [0.008, 0.023, 0.041, 0.062, 0.110]
# TODO: Apply Bonferroni and BH
# TODO: Which genes are significant under each method?
# TODO: Argue for which method is more appropriate for 5 candidate genes
Key Takeaways
- Testing m hypotheses at alpha = 0.05 yields approximately m x 0.05 false positives — devastating for genomics
- Bonferroni correction (p x m) controls FWER but is often too conservative for large-scale studies
- Holm’s step-down method is always at least as powerful as Bonferroni and should be preferred
- Benjamini-Hochberg (BH) FDR correction is the standard for genomics: it controls the expected proportion of false discoveries rather than the probability of any false discovery
- At FDR = 0.05, you accept that about 5% of discoveries may be false — a practical trade-off for high-throughput biology
- p-value histograms are essential diagnostics: uniform = all null, spike at zero = true signal present
- The volcano plot (-log10 adjusted p vs log2 fold change) is the standard visualization for differential expression: genes in the upper corners are both significant and biologically meaningful
- Always report which correction method you used — “p < 0.05” means very different things with and without adjustment
What’s Next
With the multiple testing crisis solved, we have completed the core toolkit of statistical hypothesis testing. Week 3 shifts to modeling and relationships: tomorrow we explore correlation — how to measure and test whether two continuous variables move together, from gene expression co-regulation to dose-response curves. This transition from “are groups different?” to “how are variables related?” opens the door to regression, prediction, and the modeling approaches that power modern biostatistics.