Day 18: Experimental Design and Statistical Power

The Problem

Dr. Ana Reyes is a junior PI writing her first R01 grant. She proposes a study comparing gene expression between psoriatic skin and normal skin using RNA-seq, planning 3 samples per group because “that’s what the lab down the hall used.”

The grant comes back with this reviewer comment:

“The proposed sample size of n=3 per group is inadequate. The applicant provides no power analysis to justify this number. With 3 replicates, the study is severely underpowered to detect anything less than a 4-fold change, which is biologically unrealistic for most genes. We recommend at least 8-10 biological replicates per condition based on published power analyses for RNA-seq DE studies.”

Grant rejected. Six months of proposal writing, wasted — because she didn’t plan the sample size.

Statistical power determines whether your study can actually detect the effect you’re looking for. Getting it wrong wastes time, money, animals, and patient samples.

What Is Statistical Power?

Power analysis sits at the intersection of four interconnected quantities:

Quantity	Symbol	Definition	Typical Value
Significance level	α	Probability of false positive (Type I error)	0.05
Power	1 - β	Probability of detecting a true effect	0.80 (80%)
Effect size	d, Δ	Magnitude of the real biological difference	Varies
Sample size	n	Number of observations per group	What you solve for

The fundamental relationship: Given any three, you can calculate the fourth.

Type I and Type II Errors

	H₀ True (no real effect)	H₀ False (real effect exists)
Reject H₀	Type I error (α) — false positive	Correct! (Power = 1 - β)
Fail to reject H₀	Correct! (1 - α)	Type II error (β) — false negative

Key insight: An underpowered study has a high β — it frequently misses real effects. This doesn’t just waste resources; it can lead to the false conclusion that an effect doesn’t exist, discouraging further research on a real phenomenon.

Why 80% Power?

The convention of 80% power means accepting a 20% chance of missing a real effect. Some contexts demand higher:

Context	Minimum Power	Rationale
Exploratory study	70-80%	Acceptable miss rate for discovery
Confirmatory clinical trial	80-90%	Regulatory requirement
Safety/non-inferiority trial	90-95%	Must not miss harmful effects
Rare disease (limited patients)	60-70%	Pragmatic constraint

Effect Size: The Missing Ingredient

Effect size is the hardest quantity to estimate because it requires knowledge about the biology before doing the experiment. Sources:

Source	Approach
Pilot data	Small preliminary experiment (best source)
Literature	Previous studies on similar questions
Clinical significance	“What’s the smallest difference that matters?”
Conventions	Cohen’s standards (d = 0.2 small, 0.5 medium, 0.8 large)

Common pitfall: Using an inflated effect size from a small pilot study. Small studies overestimate effects (the “winner’s curse”). If your pilot with n=5 shows d=1.5, the true effect is probably smaller. Be conservative.

Cohen’s d for Two-Group Comparisons

$$d = \frac{|\mu_1 - \mu_2|}{\sigma_{pooled}}$$

Cohen’s d	Interpretation	Biological Example
0.2	Small	Subtle expression change between tissues
0.5	Medium	DE gene in RNA-seq (2-fold change)
0.8	Large	Drug vs. placebo in responsive patients
1.2	Very large	Knockout vs. wild-type for target gene

Power for Common Designs

Two-Group Comparison (t-test)

The most basic design: compare means between two independent groups.

Required sample size per group (approximate, for α=0.05, power=0.80):

Effect Size (d)	n per group
0.2 (small)	394
0.5 (medium)	64
0.8 (large)	26
1.0	17
1.5	9
2.0	6

Key insight: Detecting a small effect requires nearly 400 samples per group! This is why GWAS studies need thousands of subjects — individual genetic variants typically have very small effects (d ≈ 0.1-0.2).

Paired Design (Paired t-test)

When the same subjects are measured before and after treatment, pairing removes between-subject variability. Power increases dramatically:

Correlation between pairs	Power improvement
0.3	~30% fewer samples needed
0.5	~50% fewer samples needed
0.7	~70% fewer samples needed

RNA-seq Differential Expression

RNA-seq power depends on additional factors:

Factor	Effect on Power
Sequencing depth	More reads → more power for low-expression genes
Biological replicates	THE major driver of power
Fold change threshold	Larger FC → easier to detect
Dispersion	Higher variability → need more samples

Rules of thumb for RNA-seq DE:

Minimum: 3 biological replicates (detects only >4-fold changes)
Good: 6-8 replicates (detects 2-fold changes)
Ideal: 12+ replicates (detects 1.5-fold changes)
Technical replicates have diminishing returns — invest in biological replicates

Common pitfall: Confusing technical replicates (sequencing the same library twice) with biological replicates (independent biological samples). Only biological replicates give you power to generalize. Ten technical replicates of one sample give you n=1, not n=10.

GWAS

Study Type	Typical n	Detectable Effect
Candidate gene	500-1000	Large OR (>2.0)
Moderate GWAS	5,000-10,000	Medium OR (1.3-1.5)
Large GWAS	50,000-500,000	Small OR (1.05-1.1)
UK Biobank scale	500,000+	Tiny effects

Power Curves

Power curves show how power varies with sample size for different effect sizes. They’re the most informative visualization for study planning — you can see the “sweet spot” where adding more samples gives diminishing returns.

The Cost of Underpowered Studies

An underpowered study is not just a failed study — it’s actively harmful:

Waste: Money, time, and irreplaceable biological samples consumed for inconclusive results
Publication bias: Only “lucky” underpowered studies (that happen to find p < 0.05) get published, inflating reported effect sizes
False negatives: Real treatments or biomarkers get abandoned
Ethical cost: Patients enrolled in clinical trials with no realistic chance of detecting a benefit

Clinical relevance: The FDA and EMA require power analyses for all clinical trial protocols. Journal reviewers increasingly require them for observational studies too. “How many samples do you need?” is the first question of good experimental design.

Experimental Design in BioLang

Basic Power Analysis for t-test

set_seed(42)
# How many samples to detect a 2-fold change in gene expression?

# Parameters
let alpha = 0.05
let power_target = 0.80
let effect_size = 0.8   # Cohen's d for ~2-fold change

# Simulate power at different sample sizes
let sample_sizes = [5, 10, 15, 20, 30, 50, 75, 100]
let n_simulations = 1000

print("=== Power Analysis: Two-Sample t-test ===")
print("Effect size (Cohen's d): {effect_size}")
print("Alpha: {alpha}")
print("")
print("n per group    Estimated Power")

for n in sample_sizes {
    let significant = 0

    for sim in 0..n_simulations {
        # Simulate two groups with known effect
        let group1 = rnorm(n, 0, 1)
        let group2 = rnorm(n, effect_size, 1)

        let result = ttest(group1, group2)
        if result.p_value < alpha {
            significant = significant + 1
        }
    }

    let power = significant / n_simulations
    let marker = if power >= power_target { " <-- sufficient" } else { "" }
    print("{n}            {power |> round(3)}{marker}")
}

Power Curves for Different Effect Sizes

set_seed(42)
# Visualize power as a function of sample size
let sample_sizes = [5, 10, 15, 20, 25, 30, 40, 50, 75, 100]
let effect_sizes = [0.3, 0.5, 0.8, 1.2]
let n_sims = 500

let power_curves = {}

for d in effect_sizes {
    let powers = []

    for n in sample_sizes {
        let sig_count = 0
        for s in 0..n_sims {
            let g1 = rnorm(n, 0, 1)
            let g2 = rnorm(n, d, 1)
            if ttest(g1, g2).p_value < 0.05 {
                sig_count = sig_count + 1
            }
        }
        powers = powers + [sig_count / n_sims]
    }

    power_curves["{d}"] = powers
}

# Plot power curves
let curve_table = table({
    "n": sample_sizes,
    "d_0.3": power_curves["0.3"],
    "d_0.5": power_curves["0.5"],
    "d_0.8": power_curves["0.8"],
    "d_1.2": power_curves["1.2"]
})
plot(curve_table, {type: "line", x: "n",
    title: "Power Curves: Two-Sample t-test",
    x_label: "Sample Size per Group", y_label: "Statistical Power"})

RNA-seq Experiment Design

set_seed(42)
# How many biological replicates for RNA-seq DE?

# Simulate RNA-seq-like data
let fold_changes = [1.5, 2.0, 3.0, 4.0]
let replicates = [3, 5, 8, 12, 20]
let n_sims = 200

print("=== RNA-seq Power by Fold Change and Replicates ===")
print("FC       n=3     n=5     n=8    n=12    n=20")

for fc in fold_changes {
    let powers = []

    for n in replicates {
        let detected = 0

        for sim in 0..n_sims {
            # Simulate one gene with known fold change
            let control = rnorm(n, 10, 2)
            let treatment = rnorm(n, 10 * fc, 2 * fc)

            # Log-transform (as in real RNA-seq analysis)
            let log_ctrl = control |> map(|x| log2(max(x, 0.1)))
            let log_treat = treatment |> map(|x| log2(max(x, 0.1)))

            let p = ttest(log_ctrl, log_treat).p_value
            if p < 0.05 { detected = detected + 1 }
        }

        powers = powers + [detected / n_sims]
    }

    print("{fc}     " ++ powers |> map(|p| "{(p * 100) |> round(0)}%") |> join("   "))
}

# Key takeaway: n=3 barely detects 4-fold changes;
# n=8 reliably detects 2-fold changes

Paired vs. Unpaired Design

set_seed(42)
# Show the power advantage of paired designs
let n_sims = 1000
let n = 20
let effect = 0.5  # medium effect
let subject_sd = 2.0  # between-subject variability
let within_sd = 0.5    # within-subject variability

let power_unpaired = 0
let power_paired = 0

for sim in 0..n_sims {
    # Unpaired: independent groups
    let group1 = rnorm(n, 0, subject_sd)
    let group2 = rnorm(n, effect, subject_sd)
    if ttest(group1, group2).p_value < 0.05 {
        power_unpaired = power_unpaired + 1
    }

    # Paired: same subjects, before and after
    let baseline = rnorm(n, 0, subject_sd)
    let after = baseline + effect + rnorm(n, 0, within_sd)
    let diff = after - baseline
    if ttest_one(diff, 0).p_value < 0.05 {
        power_paired = power_paired + 1
    }
}

print("=== Paired vs Unpaired Design (n={n}, d={effect}) ===")
print("Unpaired power: {(power_unpaired / n_sims * 100) |> round(1)}%")
print("Paired power:   {(power_paired / n_sims * 100) |> round(1)}%")
print("Pairing advantage: {((power_paired - power_unpaired) / n_sims * 100) |> round(1)} percentage points")

Multi-Group Design (ANOVA)

set_seed(42)
# Power for detecting differences among 4 treatment groups
let n_sims = 500
let k = 4  # number of groups
let group_means = [0, 0.3, 0.6, 0.9]  # increasing effect
let sample_sizes = [5, 10, 15, 20, 30]

print("=== ANOVA Power (k={k} groups) ===")
for n in sample_sizes {
    let sig = 0

    for sim in 0..n_sims {
        let groups = []
        for i in 0..k {
            groups = groups + [rnorm(n, group_means[i], 1)]
        }

        let result = anova(groups)
        if result.p_value < 0.05 { sig = sig + 1 }
    }

    print("n = {n} per group: power = {(sig / n_sims * 100) |> round(1)}%")
}

Sample Size Recommendation Report

set_seed(42)
# Generate a complete sample size recommendation
let scenarios = [
    { name: "Conservative (d=0.5)", effect: 0.5 },
    { name: "Expected (d=0.8)", effect: 0.8 },
    { name: "Optimistic (d=1.2)", effect: 1.2 }
]

print("=== SAMPLE SIZE RECOMMENDATION REPORT ===")
print("Two-group comparison, alpha=0.05, power=80%")
print("")

for s in scenarios {
    # Find minimum n for 80% power via simulation
    let required_n = 0
    for n in 5..200 {
        let power = 0
        for sim in 0..500 {
            let g1 = rnorm(n, 0, 1)
            let g2 = rnorm(n, s.effect, 1)
            if ttest(g1, g2).p_value < 0.05 { power = power + 1 }
        }
        if power / 500 >= 0.80 {
            required_n = n
            break
        }
    }

    print("{s.name}: n = {required_n}/group")
}

print("")
print("Recommendation: Plan for the CONSERVATIVE")
print("estimate + 10-20% for dropout/QC failures")

Python:

from scipy.stats import norm
from statsmodels.stats.power import TTestIndPower, TTestPower

# Power analysis for two-sample t-test
analysis = TTestIndPower()

# Required sample size
n = analysis.solve_power(effect_size=0.8, alpha=0.05, power=0.80)
print(f"Required n per group: {n:.0f}")

# Power curve
import matplotlib.pyplot as plt
fig = analysis.plot_power(
    dep_var='nobs', nobs=range(5, 100),
    effect_size=[0.3, 0.5, 0.8, 1.2])

# Simulation-based power
import numpy as np
from scipy.stats import ttest_ind

def simulate_power(n, d, n_sim=1000):
    sig = sum(ttest_ind(np.random.normal(0, 1, n),
                        np.random.normal(d, 1, n)).pvalue < 0.05
              for _ in range(n_sim))
    return sig / n_sim

# Power analysis for two-sample t-test
power.t.test(d = 0.8, sig.level = 0.05, power = 0.80)

# Power curve
library(pwr)
pwr.t.test(d = 0.8, sig.level = 0.05, power = 0.80)

# RNA-seq specific power
library(RNASeqPower)
rnapower(depth = 20e6, cv = 0.4, effect = 2,
         alpha = 0.05, power = 0.8)

# Simulation-based
library(simr)
# simr provides power simulation for mixed models

Exercises

Exercise 1: Power for Your Study

You’re planning a study comparing tumor mutation burden between immunotherapy responders and non-responders. Pilot data suggests d ≈ 0.6 with SD = 5 mutations/Mb. How many patients per group do you need for 80% power?


# 1. Simulate power at n = 10, 20, 30, 50, 75, 100
# 2. Find the minimum n for 80% power
# 3. Add 15% for anticipated dropout
# 4. Create a power curve plot

Exercise 2: Paired vs. Unpaired

A study can either use 30 independent samples per group OR 30 paired before/after measurements. The between-subject SD is 3x the within-subject SD. Compare the power of both designs.


# 1. Simulate paired and unpaired designs with n=30
# 2. Effect size d = 0.5
# 3. Between-subject SD = 3, within-subject SD = 1
# 4. Which design achieves higher power?
# 5. How many unpaired samples would match the paired design's power?

Exercise 3: RNA-seq Planning

You’re designing an RNA-seq experiment to identify genes with at least 1.5-fold change between two conditions. Your budget allows either 6 samples at 30M reads each or 12 samples at 15M reads each. Which design has more power?


# Simulate both scenarios
# Track how many "true DE genes" each design detects
# Which is better: more depth or more replicates?

Exercise 4: The Underpowered Literature

Simulate 100 “studies” with n=10 per group and a true small effect (d=0.3). Show that:

Most studies (>80%) fail to detect the effect
The “significant” studies dramatically overestimate the effect size
This creates a biased picture in the published literature


# 1. Run 100 simulated two-sample t-tests (n=10, d=0.3)
# 2. Count how many achieve p < 0.05
# 3. For the significant ones, compute Cohen's d from the data
# 4. Compare the average "published" d to the true d = 0.3
# 5. This is the "winner's curse" — published effects are inflated

Key Takeaways

Power analysis determines how many samples you need BEFORE starting an experiment — it’s not optional, it’s essential
The four pillars are α (false positive rate), power (1-β, false negative rate), effect size, and sample size — fix three, solve for the fourth
Underpowered studies waste resources, inflate published effect sizes, and can falsely suggest an effect doesn’t exist
Biological replicates drive power in genomics — technical replicates give diminishing returns
For RNA-seq: n=3 is barely adequate (detects >4-fold), n=8 is good (2-fold), n=12+ is ideal (1.5-fold)
Paired designs dramatically increase power by removing between-subject variability
The winner’s curse: underpowered studies that happen to be significant overestimate the true effect
Always use conservative effect size estimates and add buffer for dropout/QC failures
Power curves visualize the sample size / power trade-off and help identify the sweet spot

What’s Next

Statistical significance tells you whether an effect is real, but not whether it matters. Day 19 introduces effect sizes — Cohen’s d, odds ratios, relative risk — and the critical distinction between statistical significance and practical importance.

Keyboard shortcuts

Practical Biostatistics in 30 Days