Day 18: Experimental Design and Statistical Power
The Problem
Dr. Ana Reyes is a junior PI writing her first R01 grant. She proposes a study comparing gene expression between psoriatic skin and normal skin using RNA-seq, planning 3 samples per group because “that’s what the lab down the hall used.”
The grant comes back with this reviewer comment:
“The proposed sample size of n=3 per group is inadequate. The applicant provides no power analysis to justify this number. With 3 replicates, the study is severely underpowered to detect anything less than a 4-fold change, which is biologically unrealistic for most genes. We recommend at least 8-10 biological replicates per condition based on published power analyses for RNA-seq DE studies.”
Grant rejected. Six months of proposal writing, wasted — because she didn’t plan the sample size.
Statistical power determines whether your study can actually detect the effect you’re looking for. Getting it wrong wastes time, money, animals, and patient samples.
What Is Statistical Power?
Power analysis sits at the intersection of four interconnected quantities:
| Quantity | Symbol | Definition | Typical Value |
|---|---|---|---|
| Significance level | α | Probability of false positive (Type I error) | 0.05 |
| Power | 1 - β | Probability of detecting a true effect | 0.80 (80%) |
| Effect size | d, Δ | Magnitude of the real biological difference | Varies |
| Sample size | n | Number of observations per group | What you solve for |
The fundamental relationship: Given any three, you can calculate the fourth.
Type I and Type II Errors
| H₀ True (no real effect) | H₀ False (real effect exists) | |
|---|---|---|
| Reject H₀ | Type I error (α) — false positive | Correct! (Power = 1 - β) |
| Fail to reject H₀ | Correct! (1 - α) | Type II error (β) — false negative |
Key insight: An underpowered study has a high β — it frequently misses real effects. This doesn’t just waste resources; it can lead to the false conclusion that an effect doesn’t exist, discouraging further research on a real phenomenon.
Why 80% Power?
The convention of 80% power means accepting a 20% chance of missing a real effect. Some contexts demand higher:
| Context | Minimum Power | Rationale |
|---|---|---|
| Exploratory study | 70-80% | Acceptable miss rate for discovery |
| Confirmatory clinical trial | 80-90% | Regulatory requirement |
| Safety/non-inferiority trial | 90-95% | Must not miss harmful effects |
| Rare disease (limited patients) | 60-70% | Pragmatic constraint |
Effect Size: The Missing Ingredient
Effect size is the hardest quantity to estimate because it requires knowledge about the biology before doing the experiment. Sources:
| Source | Approach |
|---|---|
| Pilot data | Small preliminary experiment (best source) |
| Literature | Previous studies on similar questions |
| Clinical significance | “What’s the smallest difference that matters?” |
| Conventions | Cohen’s standards (d = 0.2 small, 0.5 medium, 0.8 large) |
Common pitfall: Using an inflated effect size from a small pilot study. Small studies overestimate effects (the “winner’s curse”). If your pilot with n=5 shows d=1.5, the true effect is probably smaller. Be conservative.
Cohen’s d for Two-Group Comparisons
$$d = \frac{|\mu_1 - \mu_2|}{\sigma_{pooled}}$$
| Cohen’s d | Interpretation | Biological Example |
|---|---|---|
| 0.2 | Small | Subtle expression change between tissues |
| 0.5 | Medium | DE gene in RNA-seq (2-fold change) |
| 0.8 | Large | Drug vs. placebo in responsive patients |
| 1.2 | Very large | Knockout vs. wild-type for target gene |
Power for Common Designs
Two-Group Comparison (t-test)
The most basic design: compare means between two independent groups.
Required sample size per group (approximate, for α=0.05, power=0.80):
| Effect Size (d) | n per group |
|---|---|
| 0.2 (small) | 394 |
| 0.5 (medium) | 64 |
| 0.8 (large) | 26 |
| 1.0 | 17 |
| 1.5 | 9 |
| 2.0 | 6 |
Key insight: Detecting a small effect requires nearly 400 samples per group! This is why GWAS studies need thousands of subjects — individual genetic variants typically have very small effects (d ≈ 0.1-0.2).
Paired Design (Paired t-test)
When the same subjects are measured before and after treatment, pairing removes between-subject variability. Power increases dramatically:
| Correlation between pairs | Power improvement |
|---|---|
| 0.3 | ~30% fewer samples needed |
| 0.5 | ~50% fewer samples needed |
| 0.7 | ~70% fewer samples needed |
RNA-seq Differential Expression
RNA-seq power depends on additional factors:
| Factor | Effect on Power |
|---|---|
| Sequencing depth | More reads → more power for low-expression genes |
| Biological replicates | THE major driver of power |
| Fold change threshold | Larger FC → easier to detect |
| Dispersion | Higher variability → need more samples |
Rules of thumb for RNA-seq DE:
- Minimum: 3 biological replicates (detects only >4-fold changes)
- Good: 6-8 replicates (detects 2-fold changes)
- Ideal: 12+ replicates (detects 1.5-fold changes)
- Technical replicates have diminishing returns — invest in biological replicates
Common pitfall: Confusing technical replicates (sequencing the same library twice) with biological replicates (independent biological samples). Only biological replicates give you power to generalize. Ten technical replicates of one sample give you n=1, not n=10.
GWAS
| Study Type | Typical n | Detectable Effect |
|---|---|---|
| Candidate gene | 500-1000 | Large OR (>2.0) |
| Moderate GWAS | 5,000-10,000 | Medium OR (1.3-1.5) |
| Large GWAS | 50,000-500,000 | Small OR (1.05-1.1) |
| UK Biobank scale | 500,000+ | Tiny effects |
Power Curves
Power curves show how power varies with sample size for different effect sizes. They’re the most informative visualization for study planning — you can see the “sweet spot” where adding more samples gives diminishing returns.
The Cost of Underpowered Studies
An underpowered study is not just a failed study — it’s actively harmful:
- Waste: Money, time, and irreplaceable biological samples consumed for inconclusive results
- Publication bias: Only “lucky” underpowered studies (that happen to find p < 0.05) get published, inflating reported effect sizes
- False negatives: Real treatments or biomarkers get abandoned
- Ethical cost: Patients enrolled in clinical trials with no realistic chance of detecting a benefit
Clinical relevance: The FDA and EMA require power analyses for all clinical trial protocols. Journal reviewers increasingly require them for observational studies too. “How many samples do you need?” is the first question of good experimental design.
Experimental Design in BioLang
Basic Power Analysis for t-test
set_seed(42)
# How many samples to detect a 2-fold change in gene expression?
# Parameters
let alpha = 0.05
let power_target = 0.80
let effect_size = 0.8 # Cohen's d for ~2-fold change
# Simulate power at different sample sizes
let sample_sizes = [5, 10, 15, 20, 30, 50, 75, 100]
let n_simulations = 1000
print("=== Power Analysis: Two-Sample t-test ===")
print("Effect size (Cohen's d): {effect_size}")
print("Alpha: {alpha}")
print("")
print("n per group Estimated Power")
for n in sample_sizes {
let significant = 0
for sim in 0..n_simulations {
# Simulate two groups with known effect
let group1 = rnorm(n, 0, 1)
let group2 = rnorm(n, effect_size, 1)
let result = ttest(group1, group2)
if result.p_value < alpha {
significant = significant + 1
}
}
let power = significant / n_simulations
let marker = if power >= power_target { " <-- sufficient" } else { "" }
print("{n} {power |> round(3)}{marker}")
}
Power Curves for Different Effect Sizes
set_seed(42)
# Visualize power as a function of sample size
let sample_sizes = [5, 10, 15, 20, 25, 30, 40, 50, 75, 100]
let effect_sizes = [0.3, 0.5, 0.8, 1.2]
let n_sims = 500
let power_curves = {}
for d in effect_sizes {
let powers = []
for n in sample_sizes {
let sig_count = 0
for s in 0..n_sims {
let g1 = rnorm(n, 0, 1)
let g2 = rnorm(n, d, 1)
if ttest(g1, g2).p_value < 0.05 {
sig_count = sig_count + 1
}
}
powers = powers + [sig_count / n_sims]
}
power_curves["{d}"] = powers
}
# Plot power curves
let curve_table = table({
"n": sample_sizes,
"d_0.3": power_curves["0.3"],
"d_0.5": power_curves["0.5"],
"d_0.8": power_curves["0.8"],
"d_1.2": power_curves["1.2"]
})
plot(curve_table, {type: "line", x: "n",
title: "Power Curves: Two-Sample t-test",
x_label: "Sample Size per Group", y_label: "Statistical Power"})
RNA-seq Experiment Design
set_seed(42)
# How many biological replicates for RNA-seq DE?
# Simulate RNA-seq-like data
let fold_changes = [1.5, 2.0, 3.0, 4.0]
let replicates = [3, 5, 8, 12, 20]
let n_sims = 200
print("=== RNA-seq Power by Fold Change and Replicates ===")
print("FC n=3 n=5 n=8 n=12 n=20")
for fc in fold_changes {
let powers = []
for n in replicates {
let detected = 0
for sim in 0..n_sims {
# Simulate one gene with known fold change
let control = rnorm(n, 10, 2)
let treatment = rnorm(n, 10 * fc, 2 * fc)
# Log-transform (as in real RNA-seq analysis)
let log_ctrl = control |> map(|x| log2(max(x, 0.1)))
let log_treat = treatment |> map(|x| log2(max(x, 0.1)))
let p = ttest(log_ctrl, log_treat).p_value
if p < 0.05 { detected = detected + 1 }
}
powers = powers + [detected / n_sims]
}
print("{fc} " ++ powers |> map(|p| "{(p * 100) |> round(0)}%") |> join(" "))
}
# Key takeaway: n=3 barely detects 4-fold changes;
# n=8 reliably detects 2-fold changes
Paired vs. Unpaired Design
set_seed(42)
# Show the power advantage of paired designs
let n_sims = 1000
let n = 20
let effect = 0.5 # medium effect
let subject_sd = 2.0 # between-subject variability
let within_sd = 0.5 # within-subject variability
let power_unpaired = 0
let power_paired = 0
for sim in 0..n_sims {
# Unpaired: independent groups
let group1 = rnorm(n, 0, subject_sd)
let group2 = rnorm(n, effect, subject_sd)
if ttest(group1, group2).p_value < 0.05 {
power_unpaired = power_unpaired + 1
}
# Paired: same subjects, before and after
let baseline = rnorm(n, 0, subject_sd)
let after = baseline + effect + rnorm(n, 0, within_sd)
let diff = after - baseline
if ttest_one(diff, 0).p_value < 0.05 {
power_paired = power_paired + 1
}
}
print("=== Paired vs Unpaired Design (n={n}, d={effect}) ===")
print("Unpaired power: {(power_unpaired / n_sims * 100) |> round(1)}%")
print("Paired power: {(power_paired / n_sims * 100) |> round(1)}%")
print("Pairing advantage: {((power_paired - power_unpaired) / n_sims * 100) |> round(1)} percentage points")
Multi-Group Design (ANOVA)
set_seed(42)
# Power for detecting differences among 4 treatment groups
let n_sims = 500
let k = 4 # number of groups
let group_means = [0, 0.3, 0.6, 0.9] # increasing effect
let sample_sizes = [5, 10, 15, 20, 30]
print("=== ANOVA Power (k={k} groups) ===")
for n in sample_sizes {
let sig = 0
for sim in 0..n_sims {
let groups = []
for i in 0..k {
groups = groups + [rnorm(n, group_means[i], 1)]
}
let result = anova(groups)
if result.p_value < 0.05 { sig = sig + 1 }
}
print("n = {n} per group: power = {(sig / n_sims * 100) |> round(1)}%")
}
Sample Size Recommendation Report
set_seed(42)
# Generate a complete sample size recommendation
let scenarios = [
{ name: "Conservative (d=0.5)", effect: 0.5 },
{ name: "Expected (d=0.8)", effect: 0.8 },
{ name: "Optimistic (d=1.2)", effect: 1.2 }
]
print("=== SAMPLE SIZE RECOMMENDATION REPORT ===")
print("Two-group comparison, alpha=0.05, power=80%")
print("")
for s in scenarios {
# Find minimum n for 80% power via simulation
let required_n = 0
for n in 5..200 {
let power = 0
for sim in 0..500 {
let g1 = rnorm(n, 0, 1)
let g2 = rnorm(n, s.effect, 1)
if ttest(g1, g2).p_value < 0.05 { power = power + 1 }
}
if power / 500 >= 0.80 {
required_n = n
break
}
}
print("{s.name}: n = {required_n}/group")
}
print("")
print("Recommendation: Plan for the CONSERVATIVE")
print("estimate + 10-20% for dropout/QC failures")
Python:
from scipy.stats import norm
from statsmodels.stats.power import TTestIndPower, TTestPower
# Power analysis for two-sample t-test
analysis = TTestIndPower()
# Required sample size
n = analysis.solve_power(effect_size=0.8, alpha=0.05, power=0.80)
print(f"Required n per group: {n:.0f}")
# Power curve
import matplotlib.pyplot as plt
fig = analysis.plot_power(
dep_var='nobs', nobs=range(5, 100),
effect_size=[0.3, 0.5, 0.8, 1.2])
# Simulation-based power
import numpy as np
from scipy.stats import ttest_ind
def simulate_power(n, d, n_sim=1000):
sig = sum(ttest_ind(np.random.normal(0, 1, n),
np.random.normal(d, 1, n)).pvalue < 0.05
for _ in range(n_sim))
return sig / n_sim
R:
# Power analysis for two-sample t-test
power.t.test(d = 0.8, sig.level = 0.05, power = 0.80)
# Power curve
library(pwr)
pwr.t.test(d = 0.8, sig.level = 0.05, power = 0.80)
# RNA-seq specific power
library(RNASeqPower)
rnapower(depth = 20e6, cv = 0.4, effect = 2,
alpha = 0.05, power = 0.8)
# Simulation-based
library(simr)
# simr provides power simulation for mixed models
Exercises
Exercise 1: Power for Your Study
You’re planning a study comparing tumor mutation burden between immunotherapy responders and non-responders. Pilot data suggests d ≈ 0.6 with SD = 5 mutations/Mb. How many patients per group do you need for 80% power?
# 1. Simulate power at n = 10, 20, 30, 50, 75, 100
# 2. Find the minimum n for 80% power
# 3. Add 15% for anticipated dropout
# 4. Create a power curve plot
Exercise 2: Paired vs. Unpaired
A study can either use 30 independent samples per group OR 30 paired before/after measurements. The between-subject SD is 3x the within-subject SD. Compare the power of both designs.
# 1. Simulate paired and unpaired designs with n=30
# 2. Effect size d = 0.5
# 3. Between-subject SD = 3, within-subject SD = 1
# 4. Which design achieves higher power?
# 5. How many unpaired samples would match the paired design's power?
Exercise 3: RNA-seq Planning
You’re designing an RNA-seq experiment to identify genes with at least 1.5-fold change between two conditions. Your budget allows either 6 samples at 30M reads each or 12 samples at 15M reads each. Which design has more power?
# Simulate both scenarios
# Track how many "true DE genes" each design detects
# Which is better: more depth or more replicates?
Exercise 4: The Underpowered Literature
Simulate 100 “studies” with n=10 per group and a true small effect (d=0.3). Show that:
- Most studies (>80%) fail to detect the effect
- The “significant” studies dramatically overestimate the effect size
- This creates a biased picture in the published literature
# 1. Run 100 simulated two-sample t-tests (n=10, d=0.3)
# 2. Count how many achieve p < 0.05
# 3. For the significant ones, compute Cohen's d from the data
# 4. Compare the average "published" d to the true d = 0.3
# 5. This is the "winner's curse" — published effects are inflated
Key Takeaways
- Power analysis determines how many samples you need BEFORE starting an experiment — it’s not optional, it’s essential
- The four pillars are α (false positive rate), power (1-β, false negative rate), effect size, and sample size — fix three, solve for the fourth
- Underpowered studies waste resources, inflate published effect sizes, and can falsely suggest an effect doesn’t exist
- Biological replicates drive power in genomics — technical replicates give diminishing returns
- For RNA-seq: n=3 is barely adequate (detects >4-fold), n=8 is good (2-fold), n=12+ is ideal (1.5-fold)
- Paired designs dramatically increase power by removing between-subject variability
- The winner’s curse: underpowered studies that happen to be significant overestimate the true effect
- Always use conservative effect size estimates and add buffer for dropout/QC failures
- Power curves visualize the sample size / power trade-off and help identify the sweet spot
What’s Next
Statistical significance tells you whether an effect is real, but not whether it matters. Day 19 introduces effect sizes — Cohen’s d, odds ratios, relative risk — and the critical distinction between statistical significance and practical importance.