Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 18: Experimental Design and Statistical Power

The Problem

Dr. Ana Reyes is a junior PI writing her first R01 grant. She proposes a study comparing gene expression between psoriatic skin and normal skin using RNA-seq, planning 3 samples per group because “that’s what the lab down the hall used.”

The grant comes back with this reviewer comment:

“The proposed sample size of n=3 per group is inadequate. The applicant provides no power analysis to justify this number. With 3 replicates, the study is severely underpowered to detect anything less than a 4-fold change, which is biologically unrealistic for most genes. We recommend at least 8-10 biological replicates per condition based on published power analyses for RNA-seq DE studies.”

Grant rejected. Six months of proposal writing, wasted — because she didn’t plan the sample size.

Statistical power determines whether your study can actually detect the effect you’re looking for. Getting it wrong wastes time, money, animals, and patient samples.

What Is Statistical Power?

Power analysis sits at the intersection of four interconnected quantities:

QuantitySymbolDefinitionTypical Value
Significance levelαProbability of false positive (Type I error)0.05
Power1 - βProbability of detecting a true effect0.80 (80%)
Effect sized, ΔMagnitude of the real biological differenceVaries
Sample sizenNumber of observations per groupWhat you solve for

The fundamental relationship: Given any three, you can calculate the fourth.

Type I and Type II Errors

H₀ True (no real effect)H₀ False (real effect exists)
Reject H₀Type I error (α) — false positiveCorrect! (Power = 1 - β)
Fail to reject H₀Correct! (1 - α)Type II error (β) — false negative

Key insight: An underpowered study has a high β — it frequently misses real effects. This doesn’t just waste resources; it can lead to the false conclusion that an effect doesn’t exist, discouraging further research on a real phenomenon.

Statistical Power: Type I and Type II Errors Test Statistic H0 (no effect) H1 (true effect) Critical value alpha (Type I) False positive beta (Type II) False negative Power = 1 - beta Correctly detect effect

Why 80% Power?

The convention of 80% power means accepting a 20% chance of missing a real effect. Some contexts demand higher:

ContextMinimum PowerRationale
Exploratory study70-80%Acceptable miss rate for discovery
Confirmatory clinical trial80-90%Regulatory requirement
Safety/non-inferiority trial90-95%Must not miss harmful effects
Rare disease (limited patients)60-70%Pragmatic constraint

Effect Size: The Missing Ingredient

Effect size is the hardest quantity to estimate because it requires knowledge about the biology before doing the experiment. Sources:

SourceApproach
Pilot dataSmall preliminary experiment (best source)
LiteraturePrevious studies on similar questions
Clinical significance“What’s the smallest difference that matters?”
ConventionsCohen’s standards (d = 0.2 small, 0.5 medium, 0.8 large)

Common pitfall: Using an inflated effect size from a small pilot study. Small studies overestimate effects (the “winner’s curse”). If your pilot with n=5 shows d=1.5, the true effect is probably smaller. Be conservative.

Cohen’s d for Two-Group Comparisons

$$d = \frac{|\mu_1 - \mu_2|}{\sigma_{pooled}}$$

Cohen’s dInterpretationBiological Example
0.2SmallSubtle expression change between tissues
0.5MediumDE gene in RNA-seq (2-fold change)
0.8LargeDrug vs. placebo in responsive patients
1.2Very largeKnockout vs. wild-type for target gene

Power for Common Designs

Two-Group Comparison (t-test)

The most basic design: compare means between two independent groups.

Required sample size per group (approximate, for α=0.05, power=0.80):

Effect Size (d)n per group
0.2 (small)394
0.5 (medium)64
0.8 (large)26
1.017
1.59
2.06

Key insight: Detecting a small effect requires nearly 400 samples per group! This is why GWAS studies need thousands of subjects — individual genetic variants typically have very small effects (d ≈ 0.1-0.2).

Paired Design (Paired t-test)

When the same subjects are measured before and after treatment, pairing removes between-subject variability. Power increases dramatically:

Paired vs. Unpaired Design: Why Pairing Reduces Noise Unpaired (Independent Groups) Group A Group B Between-subject variance dominates the signal Paired (Same Subjects) Before After Each subject is its own control Only within-subject noise matters
Correlation between pairsPower improvement
0.3~30% fewer samples needed
0.5~50% fewer samples needed
0.7~70% fewer samples needed

RNA-seq Differential Expression

RNA-seq power depends on additional factors:

FactorEffect on Power
Sequencing depthMore reads → more power for low-expression genes
Biological replicatesTHE major driver of power
Fold change thresholdLarger FC → easier to detect
DispersionHigher variability → need more samples

Rules of thumb for RNA-seq DE:

  • Minimum: 3 biological replicates (detects only >4-fold changes)
  • Good: 6-8 replicates (detects 2-fold changes)
  • Ideal: 12+ replicates (detects 1.5-fold changes)
  • Technical replicates have diminishing returns — invest in biological replicates

Common pitfall: Confusing technical replicates (sequencing the same library twice) with biological replicates (independent biological samples). Only biological replicates give you power to generalize. Ten technical replicates of one sample give you n=1, not n=10.

GWAS

Study TypeTypical nDetectable Effect
Candidate gene500-1000Large OR (>2.0)
Moderate GWAS5,000-10,000Medium OR (1.3-1.5)
Large GWAS50,000-500,000Small OR (1.05-1.1)
UK Biobank scale500,000+Tiny effects

Power Curves

Power curves show how power varies with sample size for different effect sizes. They’re the most informative visualization for study planning — you can see the “sweet spot” where adding more samples gives diminishing returns.

Power Curves by Effect Size 80% power Sample Size per Group Statistical Power 0% 20% 40% 60% 80% 100% 0 20 40 60 80 100 d = 0.3 d = 0.5 d = 0.8 n~64 n~26 n~394 needed!

The Cost of Underpowered Studies

An underpowered study is not just a failed study — it’s actively harmful:

  1. Waste: Money, time, and irreplaceable biological samples consumed for inconclusive results
  2. Publication bias: Only “lucky” underpowered studies (that happen to find p < 0.05) get published, inflating reported effect sizes
  3. False negatives: Real treatments or biomarkers get abandoned
  4. Ethical cost: Patients enrolled in clinical trials with no realistic chance of detecting a benefit

Clinical relevance: The FDA and EMA require power analyses for all clinical trial protocols. Journal reviewers increasingly require them for observational studies too. “How many samples do you need?” is the first question of good experimental design.

Experimental Design in BioLang

Basic Power Analysis for t-test

set_seed(42)
# How many samples to detect a 2-fold change in gene expression?

# Parameters
let alpha = 0.05
let power_target = 0.80
let effect_size = 0.8   # Cohen's d for ~2-fold change

# Simulate power at different sample sizes
let sample_sizes = [5, 10, 15, 20, 30, 50, 75, 100]
let n_simulations = 1000

print("=== Power Analysis: Two-Sample t-test ===")
print("Effect size (Cohen's d): {effect_size}")
print("Alpha: {alpha}")
print("")
print("n per group    Estimated Power")

for n in sample_sizes {
    let significant = 0

    for sim in 0..n_simulations {
        # Simulate two groups with known effect
        let group1 = rnorm(n, 0, 1)
        let group2 = rnorm(n, effect_size, 1)

        let result = ttest(group1, group2)
        if result.p_value < alpha {
            significant = significant + 1
        }
    }

    let power = significant / n_simulations
    let marker = if power >= power_target { " <-- sufficient" } else { "" }
    print("{n}            {power |> round(3)}{marker}")
}

Power Curves for Different Effect Sizes

set_seed(42)
# Visualize power as a function of sample size
let sample_sizes = [5, 10, 15, 20, 25, 30, 40, 50, 75, 100]
let effect_sizes = [0.3, 0.5, 0.8, 1.2]
let n_sims = 500

let power_curves = {}

for d in effect_sizes {
    let powers = []

    for n in sample_sizes {
        let sig_count = 0
        for s in 0..n_sims {
            let g1 = rnorm(n, 0, 1)
            let g2 = rnorm(n, d, 1)
            if ttest(g1, g2).p_value < 0.05 {
                sig_count = sig_count + 1
            }
        }
        powers = powers + [sig_count / n_sims]
    }

    power_curves["{d}"] = powers
}

# Plot power curves
let curve_table = table({
    "n": sample_sizes,
    "d_0.3": power_curves["0.3"],
    "d_0.5": power_curves["0.5"],
    "d_0.8": power_curves["0.8"],
    "d_1.2": power_curves["1.2"]
})
plot(curve_table, {type: "line", x: "n",
    title: "Power Curves: Two-Sample t-test",
    x_label: "Sample Size per Group", y_label: "Statistical Power"})

RNA-seq Experiment Design

set_seed(42)
# How many biological replicates for RNA-seq DE?

# Simulate RNA-seq-like data
let fold_changes = [1.5, 2.0, 3.0, 4.0]
let replicates = [3, 5, 8, 12, 20]
let n_sims = 200

print("=== RNA-seq Power by Fold Change and Replicates ===")
print("FC       n=3     n=5     n=8    n=12    n=20")

for fc in fold_changes {
    let powers = []

    for n in replicates {
        let detected = 0

        for sim in 0..n_sims {
            # Simulate one gene with known fold change
            let control = rnorm(n, 10, 2)
            let treatment = rnorm(n, 10 * fc, 2 * fc)

            # Log-transform (as in real RNA-seq analysis)
            let log_ctrl = control |> map(|x| log2(max(x, 0.1)))
            let log_treat = treatment |> map(|x| log2(max(x, 0.1)))

            let p = ttest(log_ctrl, log_treat).p_value
            if p < 0.05 { detected = detected + 1 }
        }

        powers = powers + [detected / n_sims]
    }

    print("{fc}     " ++ powers |> map(|p| "{(p * 100) |> round(0)}%") |> join("   "))
}

# Key takeaway: n=3 barely detects 4-fold changes;
# n=8 reliably detects 2-fold changes

Paired vs. Unpaired Design

set_seed(42)
# Show the power advantage of paired designs
let n_sims = 1000
let n = 20
let effect = 0.5  # medium effect
let subject_sd = 2.0  # between-subject variability
let within_sd = 0.5    # within-subject variability

let power_unpaired = 0
let power_paired = 0

for sim in 0..n_sims {
    # Unpaired: independent groups
    let group1 = rnorm(n, 0, subject_sd)
    let group2 = rnorm(n, effect, subject_sd)
    if ttest(group1, group2).p_value < 0.05 {
        power_unpaired = power_unpaired + 1
    }

    # Paired: same subjects, before and after
    let baseline = rnorm(n, 0, subject_sd)
    let after = baseline + effect + rnorm(n, 0, within_sd)
    let diff = after - baseline
    if ttest_one(diff, 0).p_value < 0.05 {
        power_paired = power_paired + 1
    }
}

print("=== Paired vs Unpaired Design (n={n}, d={effect}) ===")
print("Unpaired power: {(power_unpaired / n_sims * 100) |> round(1)}%")
print("Paired power:   {(power_paired / n_sims * 100) |> round(1)}%")
print("Pairing advantage: {((power_paired - power_unpaired) / n_sims * 100) |> round(1)} percentage points")

Multi-Group Design (ANOVA)

set_seed(42)
# Power for detecting differences among 4 treatment groups
let n_sims = 500
let k = 4  # number of groups
let group_means = [0, 0.3, 0.6, 0.9]  # increasing effect
let sample_sizes = [5, 10, 15, 20, 30]

print("=== ANOVA Power (k={k} groups) ===")
for n in sample_sizes {
    let sig = 0

    for sim in 0..n_sims {
        let groups = []
        for i in 0..k {
            groups = groups + [rnorm(n, group_means[i], 1)]
        }

        let result = anova(groups)
        if result.p_value < 0.05 { sig = sig + 1 }
    }

    print("n = {n} per group: power = {(sig / n_sims * 100) |> round(1)}%")
}

Sample Size Recommendation Report

set_seed(42)
# Generate a complete sample size recommendation
let scenarios = [
    { name: "Conservative (d=0.5)", effect: 0.5 },
    { name: "Expected (d=0.8)", effect: 0.8 },
    { name: "Optimistic (d=1.2)", effect: 1.2 }
]

print("=== SAMPLE SIZE RECOMMENDATION REPORT ===")
print("Two-group comparison, alpha=0.05, power=80%")
print("")

for s in scenarios {
    # Find minimum n for 80% power via simulation
    let required_n = 0
    for n in 5..200 {
        let power = 0
        for sim in 0..500 {
            let g1 = rnorm(n, 0, 1)
            let g2 = rnorm(n, s.effect, 1)
            if ttest(g1, g2).p_value < 0.05 { power = power + 1 }
        }
        if power / 500 >= 0.80 {
            required_n = n
            break
        }
    }

    print("{s.name}: n = {required_n}/group")
}

print("")
print("Recommendation: Plan for the CONSERVATIVE")
print("estimate + 10-20% for dropout/QC failures")

Python:

from scipy.stats import norm
from statsmodels.stats.power import TTestIndPower, TTestPower

# Power analysis for two-sample t-test
analysis = TTestIndPower()

# Required sample size
n = analysis.solve_power(effect_size=0.8, alpha=0.05, power=0.80)
print(f"Required n per group: {n:.0f}")

# Power curve
import matplotlib.pyplot as plt
fig = analysis.plot_power(
    dep_var='nobs', nobs=range(5, 100),
    effect_size=[0.3, 0.5, 0.8, 1.2])

# Simulation-based power
import numpy as np
from scipy.stats import ttest_ind

def simulate_power(n, d, n_sim=1000):
    sig = sum(ttest_ind(np.random.normal(0, 1, n),
                        np.random.normal(d, 1, n)).pvalue < 0.05
              for _ in range(n_sim))
    return sig / n_sim

R:

# Power analysis for two-sample t-test
power.t.test(d = 0.8, sig.level = 0.05, power = 0.80)

# Power curve
library(pwr)
pwr.t.test(d = 0.8, sig.level = 0.05, power = 0.80)

# RNA-seq specific power
library(RNASeqPower)
rnapower(depth = 20e6, cv = 0.4, effect = 2,
         alpha = 0.05, power = 0.8)

# Simulation-based
library(simr)
# simr provides power simulation for mixed models

Exercises

Exercise 1: Power for Your Study

You’re planning a study comparing tumor mutation burden between immunotherapy responders and non-responders. Pilot data suggests d ≈ 0.6 with SD = 5 mutations/Mb. How many patients per group do you need for 80% power?


# 1. Simulate power at n = 10, 20, 30, 50, 75, 100
# 2. Find the minimum n for 80% power
# 3. Add 15% for anticipated dropout
# 4. Create a power curve plot

Exercise 2: Paired vs. Unpaired

A study can either use 30 independent samples per group OR 30 paired before/after measurements. The between-subject SD is 3x the within-subject SD. Compare the power of both designs.


# 1. Simulate paired and unpaired designs with n=30
# 2. Effect size d = 0.5
# 3. Between-subject SD = 3, within-subject SD = 1
# 4. Which design achieves higher power?
# 5. How many unpaired samples would match the paired design's power?

Exercise 3: RNA-seq Planning

You’re designing an RNA-seq experiment to identify genes with at least 1.5-fold change between two conditions. Your budget allows either 6 samples at 30M reads each or 12 samples at 15M reads each. Which design has more power?


# Simulate both scenarios
# Track how many "true DE genes" each design detects
# Which is better: more depth or more replicates?

Exercise 4: The Underpowered Literature

Simulate 100 “studies” with n=10 per group and a true small effect (d=0.3). Show that:

  1. Most studies (>80%) fail to detect the effect
  2. The “significant” studies dramatically overestimate the effect size
  3. This creates a biased picture in the published literature

# 1. Run 100 simulated two-sample t-tests (n=10, d=0.3)
# 2. Count how many achieve p < 0.05
# 3. For the significant ones, compute Cohen's d from the data
# 4. Compare the average "published" d to the true d = 0.3
# 5. This is the "winner's curse" — published effects are inflated

Key Takeaways

  • Power analysis determines how many samples you need BEFORE starting an experiment — it’s not optional, it’s essential
  • The four pillars are α (false positive rate), power (1-β, false negative rate), effect size, and sample size — fix three, solve for the fourth
  • Underpowered studies waste resources, inflate published effect sizes, and can falsely suggest an effect doesn’t exist
  • Biological replicates drive power in genomics — technical replicates give diminishing returns
  • For RNA-seq: n=3 is barely adequate (detects >4-fold), n=8 is good (2-fold), n=12+ is ideal (1.5-fold)
  • Paired designs dramatically increase power by removing between-subject variability
  • The winner’s curse: underpowered studies that happen to be significant overestimate the true effect
  • Always use conservative effect size estimates and add buffer for dropout/QC failures
  • Power curves visualize the sample size / power trade-off and help identify the sweet spot

What’s Next

Statistical significance tells you whether an effect is real, but not whether it matters. Day 19 introduces effect sizes — Cohen’s d, odds ratios, relative risk — and the critical distinction between statistical significance and practical importance.