Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 19: Effect Sizes — Beyond p-Values

The Problem

Two papers land on Dr. Rachel Nguyen’s desk the same morning.

Paper A: “GENE_X is significantly differentially expressed between tumor and normal tissue (p = 0.00001, n = 5,000).” She looks at the supplementary data: the mean difference is 0.02 FPKM on a scale where genes range from 0 to 50,000 FPKM. The fold change is 1.001.

Paper B: “DRUG_Y showed a non-significant trend toward tumor reduction (p = 0.08, n = 24).” She looks at the data: the median tumor volume shrank by 40% in the treatment arm.

Paper A is “highly significant” but biologically meaningless — the difference is lost in measurement noise. Paper B fails the significance threshold but describes a potentially life-changing clinical effect that just needs more patients.

The p-value alone tells you almost nothing. You need effect sizes.

P-Value vs. Effect Size: They Measure Different Things Effect Size (Cohen's d) -log10(p-value) 0 1 2 3 4 5 0 0.25 0.50 0.75 1.00 p = 0.05 Significant but trivial (large n inflates p) Significant AND meaningful Large effect but not significant (needs more samples) n=5000 n=12

The Tyranny of p-Values

In 2016, the American Statistical Association took the extraordinary step of issuing a formal statement on p-values — the first time in its 177-year history. Key points:

  1. P-values do NOT measure the probability that the hypothesis is true
  2. P-values do NOT measure the size or importance of an effect
  3. Scientific conclusions should NOT be based only on whether p < 0.05
  4. A p-value near 0.05 provides only weak evidence against the null

Key insight: A p-value is a function of effect size AND sample size. With enough data, any trivial difference becomes “significant.” With too few data, any real effect becomes “non-significant.” The p-value alone is fundamentally incomplete.

The Problem of Large n

True Effect Sizen per groupp-valueSignificant?Meaningful?
d = 0.01 (trivial)50,0000.02YesNo
d = 0.8 (large)50.12NoYes
d = 0.5 (medium)640.04YesLikely
d = 0.3 (small)300.15NoMaybe

What Are Effect Sizes?

An effect size quantifies how large a difference or association is, independent of sample size. There are two families:

Standardized Effect Sizes (unit-free)

MetricUsed ForScale
Cohen’s dTwo-group mean difference0.2 small, 0.5 medium, 0.8 large
Odds Ratio (OR)Binary outcome association1.0 = no effect
Relative Risk (RR)Binary outcome risk1.0 = no effect
Cramér’s VCategorical association0 to 1
Eta-squared (η²)ANOVA variance explained0 to 1
Regression variance explained0 to 1

Unstandardized Effect Sizes (original units)

MetricExample
Mean difference“Treatment reduces tumor volume by 2.3 cm³”
Regression slope“Each year of age increases risk by 3%”
Median survival difference“Treatment arm survived 6 months longer”

Key insight: Unstandardized effect sizes are often MORE useful than standardized ones because they’re in meaningful units. “The drug reduced blood pressure by 8 mmHg” is more interpretable than “Cohen’s d = 0.5.”

Cohen’s d: Standardized Mean Difference

$$d = \frac{\bar{X}_1 - \bar{X}2}{s{pooled}}$$

where the pooled standard deviation is:

$$s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}$$

Cohen’s benchmarks (widely used, sometimes criticized as arbitrary):

dLabelWhat It Means
0.2SmallGroups overlap ~85%
0.5MediumGroups overlap ~67%
0.8LargeGroups overlap ~53%
1.2Very largeGroups overlap ~40%
2.0HugeGroups barely overlap

Common pitfall: Cohen himself warned that “small/medium/large” are context-dependent. In pharmacogenomics, d = 0.3 might be clinically important. In quality control, d = 2.0 might be necessary. Always judge effect sizes in your domain context.

Cohen's d: How Much Do Distributions Overlap? d = 0.2 (Small Effect) ~85% overlap d = 0.8 (Large Effect) ~53% overlap Control group Treatment group 0.0 0.2 0.5 0.8 1.2+ Negligible Small Medium Large Cohen's d Scale

Odds Ratio and Relative Risk

For binary outcomes (response/no response, mutation/wild-type), two key measures:

Odds Ratio (OR)

$$OR = \frac{a \cdot d}{b \cdot c}$$

From a 2x2 table:

Outcome +Outcome -
Exposedab
Unexposedcd

Relative Risk (RR)

$$RR = \frac{a/(a+b)}{c/(c+d)}$$

OR vs. RR: Why They Differ

Baseline RiskORRRInterpretation
1% (rare)2.02.0Nearly identical (rare disease approximation)
10%2.01.8Starting to diverge
30%2.01.5Substantial difference
50%2.01.3OR greatly overstates RR

Clinical relevance: Case-control studies can only estimate OR (not RR). Cohort studies and RCTs can estimate both. When reporting to patients, RR or absolute risk difference is more intuitive: “Your risk goes from 10% to 15%” is clearer than “OR = 1.6.”

Cramér’s V: Categorical Associations

For larger contingency tables (beyond 2x2), Cramér’s V measures association strength:

$$V = \sqrt{\frac{\chi^2}{n \cdot (k-1)}}$$

where k = min(rows, columns).

VInterpretation
0.0No association
0.1Weak
0.3Moderate
0.5Strong

Eta-squared: ANOVA Effect Size

For ANOVA (comparing means across multiple groups):

$$\eta^2 = \frac{SS_{between}}{SS_{total}}$$

η²Interpretation
0.01Small
0.06Medium
0.14Large

Eta-squared tells you what fraction of total variance is explained by group membership.

Forest Plots: The Standard Display

Forest plots are the gold standard for displaying effect sizes across multiple comparisons. Each row shows:

  • A point estimate (the effect size)
  • A horizontal line (the confidence interval)
  • A vertical reference line (null effect: 0 for differences, 1 for ratios)

They’re essential for:

  • Meta-analyses (combining studies)
  • Multi-gene comparisons (e.g., DE genes ranked by effect size)
  • Subgroup analyses (effect by age, sex, stage)
  • Cox regression hazard ratios
Anatomy of a Forest Plot Study Effect Size (95% CI) d [CI] -1.0 -0.5 0 +0.5 +1.0 Point estimate Confidence interval BRCA1 0.65 [0.20, 1.10] TP53 0.82 [0.45, 1.19] EGFR 0.15 [-0.30, 0.60] KRAS 0.45 [0.05, 0.85] MYC -0.10 [-0.55, 0.35] Pooled 0.40 [0.20, 0.60] No effect (d = 0) Favors control Favors treatment

The Reporting Checklist

Every statistical result should include all four pieces:

ElementWhat It AnswersExample
Effect sizeHow large?d = 0.72
Confidence intervalHow precise?95% CI [0.35, 1.09]
p-valueCould it be chance?p = 0.003
Sample sizeHow much data?n = 45 per group

Omitting any one of these leaves the reader unable to fully evaluate the finding.

Effect Sizes in BioLang

Cohen’s d for Gene Expression

set_seed(42)
# Compare expression of a gene between tumor and normal
let n = 50

let tumor = rnorm(n, 12.5, 3.0)
let normal_expr = rnorm(n, 10.0, 2.8)

# Cohen's d — compute inline
let pooled_sd = sqrt((variance(tumor) + variance(normal_expr)) / 2.0)
let d = (mean(tumor) - mean(normal_expr)) / pooled_sd
print("=== Cohen's d ===")
print("d = {d |> round(3)}")

let interpretation = if abs(d) >= 0.8 { "large" }
    else if abs(d) >= 0.5 { "medium" }
    else if abs(d) >= 0.2 { "small" }
    else { "negligible" }

print("Interpretation: {interpretation} effect")
print("")

# Also report the raw difference
let mean_diff = mean(tumor) - mean(normal_expr)
print("Mean difference: {mean_diff |> round(2)} FPKM")
print("This means tumor expression is ~{(mean_diff / mean(normal_expr) * 100) |> round(0)}% higher")

# Complete report: effect + CI + p + n
let t = ttest(tumor, normal_expr)
print("\n=== Complete Report ===")
print("Cohen's d = {d |> round(2)}")
print("p = {t.p_value |> round(4)}, n = {n} per group")

Odds Ratio and Relative Risk

# Immunotherapy response by PD-L1 status

# 2x2 table:
#              Respond  Non-respond
# PD-L1 high     35        15        (70% response)
# PD-L1 low      20        30        (40% response)

let a = 35  # PD-L1 high + respond
let b = 15  # PD-L1 high + non-respond
let c = 20  # PD-L1 low + respond
let d_val = 30  # PD-L1 low + non-respond

# Odds ratio — compute inline
let or_val = (a * d_val) / (b * c)
print("=== Odds Ratio ===")
print("OR = {or_val |> round(2)}")

# Relative risk — compute inline
let risk_high = a / (a + b)
let risk_low = c / (c + d_val)
let rr_val = risk_high / risk_low
print("\n=== Relative Risk ===")
print("RR = {rr_val |> round(2)}")

# Absolute risk difference
let ard = risk_high - risk_low
print("\n=== Absolute Risk Difference ===")
print("Risk (PD-L1 high): {(risk_high * 100) |> round(1)}%")
print("Risk (PD-L1 low):  {(risk_low * 100) |> round(1)}%")
print("Absolute difference: {(ard * 100) |> round(1)} percentage points")
print("NNT: {(1 / ard) |> round(1)} (treat this many to get 1 extra responder)")

# Fisher's exact test for significance
let fe = fisher_exact(a, b, c, d_val)
print("Fisher's exact p-value: {fe.p_value |> round(4)}")

# Note: OR overstates the relative risk when the outcome is common

Cramér’s V for Categorical Data

# Association between tumor subtype (4 types) and treatment response (3 levels)
# observed counts in a 4x3 contingency table
let observed = [
    [30, 15, 5],   # Luminal A
    [20, 20, 10],  # Luminal B
    [10, 15, 25],  # HER2+
    [5, 10, 35]    # Triple-neg
]

# Flatten for chi-square test
let obs_flat = [30, 15, 5, 20, 20, 10, 10, 15, 25, 5, 10, 35]
let total = obs_flat |> sum
let chi2 = chi_square(obs_flat, obs_flat |> map(|x| total / len(obs_flat)))

# Compute Cramer's V inline
let n_obs = total
let k = 3  # min(rows, cols)
let v = sqrt(chi2.statistic / (n_obs * (k - 1)))

print("=== Cramer's V ===")
print("Chi-square: {chi2.statistic |> round(2)}")
print("p-value: {chi2.p_value |> round(4)}")
print("Cramer's V: {v |> round(3)}")
print("Interpretation: {if v > 0.3 { "moderate to strong" } else { "weak" }} association")

Eta-squared for ANOVA

set_seed(42)
# Expression differences across 4 cancer subtypes
let subtype_a = rnorm(30, 10, 3)
let subtype_b = rnorm(30, 12, 3)
let subtype_c = rnorm(30, 11, 3)
let subtype_d = rnorm(30, 15, 3)

let aov = anova([subtype_a, subtype_b, subtype_c, subtype_d])

# Compute eta-squared inline from ANOVA output
# eta2 = SS_between / SS_total
let eta2 = aov.ss_between / aov.ss_total

print("=== Eta-squared (ANOVA Effect Size) ===")
print("F = {aov.f_statistic |> round(2)}, p = {aov.p_value |> round(4)}")
print("eta2 = {eta2 |> round(3)}")
print("{(eta2 * 100) |> round(1)}% of expression variance is explained by subtype")

let eta_interp = if eta2 >= 0.14 { "large" }
    else if eta2 >= 0.06 { "medium" }
    else { "small" }
print("Interpretation: {eta_interp} effect")

Forest Plot for Multiple Genes

set_seed(42)
# Cohen's d for 10 differentially expressed genes
let gene_names = ["BRCA1", "TP53", "EGFR", "KRAS", "MYC",
                  "PTEN", "PIK3CA", "BRAF", "CDH1", "RB1"]
let effect_sizes = []
let ci_lowers = []
let ci_uppers = []

for i in 0..10 {
    let true_effect = rnorm(1, 0.5, 0.4)[0]
    let tumor_g = rnorm(40, 10 + true_effect, 2)
    let normal_g = rnorm(40, 10, 2)

    # Compute Cohen's d inline
    let pooled = sqrt((variance(tumor_g) + variance(normal_g)) / 2.0)
    let d = (mean(tumor_g) - mean(normal_g)) / pooled
    let se_d = sqrt(2 / 40 + d ** 2 / (4 * 40))

    effect_sizes = effect_sizes + [d]
    ci_lowers = ci_lowers + [d - 1.96 * se_d]
    ci_uppers = ci_uppers + [d + 1.96 * se_d]
}

# Forest plot — the standard effect size visualization
let forest_data = table({
    "gene": gene_names,
    "effect": effect_sizes,
    "ci_lower": ci_lowers,
    "ci_upper": ci_uppers
})
forest_plot(forest_data)

Contrasting “Significant Tiny” vs. “Non-Significant Large”

set_seed(42)
# The key lesson: significance does not equal importance

# Scenario 1: huge n, tiny effect
let n_large = 5000
let group1_large = rnorm(n_large, 10.000, 2)
let group2_large = rnorm(n_large, 10.020, 2)

let t_large = ttest(group1_large, group2_large)
let pooled_lg = sqrt((variance(group1_large) + variance(group2_large)) / 2.0)
let d_large = (mean(group1_large) - mean(group2_large)) / pooled_lg

print("=== Scenario 1: Large n, Tiny Effect ===")
print("n = {n_large} per group")
print("Mean difference: {(mean(group1_large) - mean(group2_large)) |> abs |> round(4)}")
print("Cohen's d: {d_large |> round(4)}")
print("p-value: {t_large.p_value |> round(6)}")
print("Significant? {if t_large.p_value < 0.05 { "YES" } else { "NO" }}")
print("Biologically meaningful? VERY UNLIKELY (d ~ 0.01)")

# Scenario 2: small n, large effect
let n_small = 12
let group1_small = rnorm(n_small, 10, 3)
let group2_small = rnorm(n_small, 13, 3)

let t_small = ttest(group1_small, group2_small)
let pooled_sm = sqrt((variance(group1_small) + variance(group2_small)) / 2.0)
let d_small = (mean(group1_small) - mean(group2_small)) / pooled_sm

print("\n=== Scenario 2: Small n, Large Effect ===")
print("n = {n_small} per group")
print("Mean difference: {(mean(group1_small) - mean(group2_small)) |> abs |> round(2)}")
print("Cohen's d: {d_small |> round(2)}")
print("p-value: {t_small.p_value |> round(4)}")
print("Significant? {if t_small.p_value < 0.05 { "YES" } else { "NO" }}")
print("Biologically meaningful? LIKELY (d ~ 1.0) — needs more samples")

print("\n=== Lesson ===")
print("Scenario 1 is 'significant' but meaningless")
print("Scenario 2 is 'non-significant' but potentially important")
print("ALWAYS report effect sizes alongside p-values")

Python:

from scipy.stats import norm
import numpy as np

# Cohen's d
def cohens_d(x, y):
    nx, ny = len(x), len(y)
    pooled_sd = np.sqrt(((nx-1)*np.var(x, ddof=1) + (ny-1)*np.var(y, ddof=1)) / (nx+ny-2))
    return (np.mean(x) - np.mean(y)) / pooled_sd

# Odds ratio (from statsmodels)
from statsmodels.stats.contingency_tables import Table2x2
table = np.array([[35, 15], [20, 30]])
t = Table2x2(table)
print(f"OR = {t.oddsratio:.2f}, 95% CI: {t.oddsratio_confint()}")
print(f"RR = {t.riskratio:.2f}")

# Cramér's V
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(table)
n = table.sum()
cramers_v = np.sqrt(chi2 / (n * (min(table.shape) - 1)))

# Forest plot (using matplotlib)
import matplotlib.pyplot as plt
plt.errorbar(effect_sizes, range(len(genes)), xerr=ci_widths, fmt='o')
plt.axvline(0, color='black', linestyle='--')

R:

# Cohen's d
library(effectsize)
cohens_d(tumor ~ group)

# Odds ratio
library(epitools)
oddsratio(table)
riskratio(table)

# Cramér's V
library(rcompanion)
cramerV(table)

# Eta-squared
library(effectsize)
eta_squared(aov_result)

# Forest plot
library(forestplot)
forestplot(labeltext = gene_names,
           mean = effect_sizes,
           lower = ci_lower,
           upper = ci_upper)

Exercises

Exercise 1: Cohen’s d Across Genes

Compute Cohen’s d for 20 genes comparing tumor vs. normal. Rank them by effect size and create a forest plot. Which genes have the largest biological impact (not just the smallest p-value)?

let n = 30

# Generate 20 genes with varying true effects
# Some have large effects, some small, some zero
# 1. Compute Cohen's d inline for each gene
# 2. Compute p-value via ttest() for each gene
# 3. Create a forest_plot() sorted by effect size
# 4. Find a gene that is significant (p < 0.05) but has d < 0.2
# 5. Find a gene that is non-significant but has d > 0.5

Exercise 2: OR vs. RR

Create three 2x2 tables where the outcome prevalence is 5%, 30%, and 60%. Set the true relative risk to 2.0 in all three. Show that OR ≈ RR when prevalence is low but OR >> RR when prevalence is high.

# Table 1: rare outcome (5% baseline)
# Table 2: moderate outcome (30% baseline)
# Table 3: common outcome (60% baseline)

# For each: compute OR = (a*d)/(b*c) and RR = (a/(a+b))/(c/(c+d))
# Show the divergence as prevalence increases
# Which metric should you report to patients?

Exercise 3: Complete Reporting

Given the following analysis, write a complete results paragraph with effect size, CI, p-value, and sample size. Then write it again omitting the effect size — notice how much information is lost.

set_seed(42)
let treatment = rnorm(45, 15, 4)
let control = rnorm(45, 12, 4)

# Compute: ttest(), Cohen's d inline, mean difference
# Write: "Treatment group showed significantly higher X
# (mean = __, SD = __) compared to control (mean = __, SD = __),
# with a [small/medium/large] effect (d = __),
# p = __."

Exercise 4: The Winner’s Curse Revisited

Run 100 simulations of an underpowered study (n=10, true d=0.3). For the significant results only, compute the observed d. Show that “published” effect sizes are inflated.

set_seed(42)
let true_d = 0.3
let n = 10
let n_sims = 100

# 1. Simulate 100 ttest() calls
# 2. Record which are significant (p < 0.05)
# 3. For significant studies, compute observed Cohen's d inline
# 4. Compare mean "published d" to true d = 0.3
# 5. Plot histogram of published d values

Exercise 5: Forest Plot for a Meta-Analysis

Combine results from 8 “studies” of the same drug effect (true d = 0.5) with varying sample sizes (n = 10 to 200). Show that larger studies give more precise estimates (narrower CIs) and cluster closer to the true effect.

set_seed(42)
let study_sizes = [10, 15, 25, 40, 60, 100, 150, 200]

# 1. Simulate each study
# 2. Compute Cohen's d and 95% CI inline for each
# 3. Create a forest_plot()
# 4. Which studies have CIs that include the true d = 0.5?
# 5. Do larger studies provide better estimates?

Key Takeaways

  • P-values conflate effect size with sample size — a tiny effect can be “significant” with enough data, and a large effect can be “non-significant” with too few data
  • Always report effect sizes with confidence intervals alongside p-values: this is the modern reporting standard
  • Cohen’s d quantifies standardized mean differences: 0.2 small, 0.5 medium, 0.8 large (but context matters)
  • Odds ratio ≠ relative risk when the outcome is common (>10% prevalence) — OR overstates RR
  • Cramér’s V measures categorical association strength; eta-squared measures variance explained in ANOVA
  • Forest plots are the standard visualization for effect sizes across multiple comparisons or studies
  • The winner’s curse: underpowered studies that reach significance overestimate the true effect
  • A complete result reports effect size + confidence interval + p-value + sample size — omitting any element is incomplete reporting

What’s Next

We’ve learned to quantify and test effects. But what happens when technical artifacts overwhelm biology? Day 20 tackles the critical problem of batch effects and confounders — when your PCA reveals that the dominant signal in your data is the lab that processed the sample, not the biology you’re studying.