Day 20: Batch Effects and Confounders

The Problem

Dr. David Liu is leading a multi-center study comparing gene expression in 200 breast tumors versus 100 normal breast tissues. Three hospitals contributed samples: Memorial (80 tumors, 40 normals), Hopkins (70 tumors, 30 normals), and Mayo (50 tumors, 30 normals). He runs PCA on the full expression matrix, expecting to see tumor and normal separate.

Instead, the PCA plot shows three tight clusters — one per hospital. The first principal component explains 35% of variance and perfectly separates the centers. Tumor vs. normal? It’s buried in PC4 at 4% of variance.

The dominant signal in his data is which hospital processed the sample, not the biology he’s studying. These are batch effects, and they’re one of the most insidious problems in genomics.

What Are Batch Effects?

Batch effects are systematic technical differences between groups of samples that were processed differently. They have nothing to do with biology but can completely dominate a dataset.

Source	Mechanism	Example
Processing date	Reagent lots, temperature, humidity	Monday vs. Friday RNA extractions
Technician	Handling differences	Technician A vs. B
Center/site	Different protocols, equipment	Hospital A vs. B
Sequencing lane	Flow cell chemistry, loading density	Lane 1 vs. Lane 2
Plate position	Edge effects, evaporation	Well A1 vs. H12
Reagent lot	Batch-to-batch kit variation	Kit lot 2024A vs. 2024B
Storage time	RNA degradation over time	Samples banked in 2020 vs. 2024

Key insight: Batch effects are not random noise — they are systematic biases that affect thousands of genes simultaneously. They shift entire samples in the same direction, which is why PCA detects them so readily.

How Large Are Batch Effects?

In a landmark 2010 study, Leek et al. analyzed publicly available microarray data and found:

Batch effects were present in virtually all high-throughput datasets
They often explained more variance than the biological signal of interest
They affected thousands of genes per batch, not just a handful

In RNA-seq, batch effects are typically smaller than in microarrays but still substantial — often explaining 10-30% of total variance.

Identifying Batch Effects

1. PCA Visualization

The most powerful diagnostic. Color samples by batch variable and biological variable. If batch dominates PC1/PC2, you have a problem.

2. Correlation Heatmap

Compute sample-to-sample correlation. If samples cluster by batch rather than biology, batch effects are present.

3. Box Plots by Batch

Plot the distribution of expression values (or a summary statistic) stratified by batch. Systematic shifts indicate batch effects.

4. ANOVA for Batch

For each gene, test whether expression differs significantly by batch. If hundreds or thousands of genes show batch effects, the problem is pervasive.

Confounders: A Deeper Problem

A confounder is a variable associated with BOTH the predictor and the outcome, creating a spurious association (or masking a real one).

Scenario	Confounder	Danger
Gene expression differs by sex	Tumor subtype differs by sex	Sex drives both
Drug response varies by ethnicity	Genetic variant frequency varies by ethnicity	Population stratification
Survival differs by TP53 status	Stage differs by TP53 status	Stage confounds TP53 effect
Expression differs by treatment	Samples processed on different days	Processing day = treatment

Simpson’s Paradox

The most dramatic form of confounding. A trend that appears in aggregate reverses when groups are separated:

Hospital	Drug A Survival	Drug B Survival	Better Drug
Hospital 1 (mild cases)	95%	90%	Drug A
Hospital 2 (severe cases)	40%	30%	Drug A
Combined	55%	70%	Drug B??

Drug A is better at BOTH hospitals, but Drug B appears better overall because Hospital 2 (severe cases, low survival) preferentially used Drug A.

Clinical relevance: Simpson’s paradox has real consequences. In the 1970s, Berkeley was accused of sex discrimination in graduate admissions. Overall, women had lower acceptance rates. But department by department, women were accepted at equal or higher rates — they had applied more often to competitive departments with low overall acceptance rates.

The Confounded Design Trap

The most dangerous scenario: when batch perfectly correlates with biology.

Sample	Center	Condition
1-50	Center A	All Tumor
51-100	Center B	All Normal

Here, center and condition are completely confounded. Every difference between tumor and normal could equally be a difference between Center A and Center B. No statistical method can separate them. The experiment is fatally flawed.

Common pitfall: This happens more often than you’d think. A collaborator sends tumor samples from one hospital and normal samples from another. Or samples are processed tumor on Monday, normal on Tuesday. The solution is balanced design — ensure batch variables are distributed across biological groups.

Strategies for Handling Batch Effects

1. Prevention (Best Option)

Strategy	How
Balanced design	Distribute conditions across batches equally
Randomization	Random assignment to processing order/position
Blocking	Process one sample from each condition per batch
Standard protocols	Minimize technical variation with SOPs
Single batch	Process everything together (often impractical)

2. Detection

Method	What It Reveals
PCA colored by batch	Visual clustering by technical variable
Correlation heatmap	Sample similarity by batch
Box plots by batch	Distribution shifts
ANOVA per gene	Number of genes affected

3. Correction

Method	Approach	Caveat
Include as covariate	Add batch to regression/DE model	Requires balanced design
ComBat (parametric)	Empirical Bayes batch correction	Assumes batch is known
ComBat-seq	ComBat for raw RNA-seq counts	Preserves count nature
SVA (surrogate variables)	Discovers unknown batch effects	May remove biology
RUVseq	Uses control genes	Needs negative controls

Common pitfall: Do NOT use batch correction methods when batch is perfectly confounded with biology. They will remove both the batch effect AND the biological signal. If all tumors were processed in batch 1 and all normals in batch 2, no correction method can save you — the experiment must be redesigned.

Batch Effects in BioLang

Simulating Batch-Contaminated Data

set_seed(42)
# Simulate a multi-center gene expression study
let n_samples = 300
let n_genes = 50

# Assign samples to centers and conditions
let center = []
let condition = []
for i in 0..n_samples {
    if i < 100 {
        center = center + ["Memorial"]
        condition = condition + [if i < 60 { "Tumor" } else { "Normal" }]
    } else if i < 200 {
        center = center + ["Hopkins"]
        condition = condition + [if i < 170 { "Tumor" } else { "Normal" }]
    } else {
        center = center + ["Mayo"]
        condition = condition + [if i < 250 { "Tumor" } else { "Normal" }]
    }
}

# Generate expression matrix with batch effects
let expr_matrix = []
let batch_effects = {
    "Memorial": rnorm(n_genes, 2.0, 0.5),
    "Hopkins": rnorm(n_genes, -1.5, 0.5),
    "Mayo": rnorm(n_genes, 0.5, 0.5)
}

# True biological signal (tumor vs normal)
let bio_signal = rnorm(n_genes, 0, 0.3)
# Only 10 genes are truly DE
for g in 0..10 {
    bio_signal[g] = rnorm(1, 1.5, 0.3)[0]
}

for i in 0..n_samples {
    let sample_expr = []
    for g in 0..n_genes {
        let base = 10 + rnorm(1, 0, 1)[0]
        let batch = batch_effects[center[i]][g]
        let bio = if condition[i] == "Tumor" { bio_signal[g] } else { 0 }
        sample_expr = sample_expr + [base + batch + bio]
    }
    expr_matrix = expr_matrix + [sample_expr]
}

print("Expression matrix: {n_samples} samples x {n_genes} genes")
print("Centers: Memorial=100, Hopkins=100, Mayo=100")

Detecting Batch Effects with PCA

# PCA on the expression matrix
let pca_result = pca(expr_matrix, 5)

print("=== PCA Variance Explained ===")
for i in 0..5 {
    print("PC{i+1}: {(pca_result.variance_explained[i] * 100) |> round(1)}%")
}

# Plot PC1 vs PC2, colored by center
let pca_data = table({
    "PC1": pca_result.scores |> map(|s| s[0]),
    "PC2": pca_result.scores |> map(|s| s[1]),
    "center": center,
    "condition": condition
})

pca_plot(pca_data)

# If center dominates PC1 and condition is only visible on PC3+,
# batch effects are overwhelming the biological signal

Quantifying Batch Effect with ANOVA

# For each gene, test how much variance is explained by batch vs biology
let batch_significant = 0
let bio_significant = 0

print("=== ANOVA: Batch vs Biology ===")

for g in 0..10 {
    let gene_expr = expr_matrix |> map(|row| row[g])

    # Group by center
    let memorial_expr = []
    let hopkins_expr = []
    let mayo_expr = []
    for i in 0..n_samples {
        if center[i] == "Memorial" { memorial_expr = memorial_expr + [gene_expr[i]] }
        else if center[i] == "Hopkins" { hopkins_expr = hopkins_expr + [gene_expr[i]] }
        else { mayo_expr = mayo_expr + [gene_expr[i]] }
    }
    let batch_test = anova([memorial_expr, hopkins_expr, mayo_expr])

    # Group by condition
    let tumor_expr = []
    let normal_expr = []
    for i in 0..n_samples {
        if condition[i] == "Tumor" { tumor_expr = tumor_expr + [gene_expr[i]] }
        else { normal_expr = normal_expr + [gene_expr[i]] }
    }
    let bio_test = ttest(tumor_expr, normal_expr)

    if batch_test.p_value < 0.05 { batch_significant = batch_significant + 1 }
    if bio_test.p_value < 0.05 { bio_significant = bio_significant + 1 }

    print("Gene {g+1}: batch F={batch_test.f_statistic |> round(2)} p={batch_test.p_value |> round(4)}  bio p={bio_test.p_value |> round(4)}")
}

print("\nGenes with significant batch effect: {batch_significant} / 10")
print("Genes with significant biological effect: {bio_significant} / 10")

Correlation Heatmap for Batch Detection

# Sample-to-sample correlation: compute a few representative pairs
# Full matrix would be 300x300; spot-check a few
let s1 = expr_matrix[0]   # Memorial, Tumor
let s2 = expr_matrix[50]  # Memorial, Normal
let s3 = expr_matrix[100] # Hopkins, Tumor

print("=== Sample Correlations ===")
print("Memorial Tumor vs Memorial Normal: {cor(s1, s2) |> round(3)}")
print("Memorial Tumor vs Hopkins Tumor:   {cor(s1, s3) |> round(3)}")
print("If same-center pairs are more correlated than same-condition pairs,")
print("batch effects dominate")

Box Plots by Batch

# Distribution of a batch-affected gene across centers
let gene_1_expr = expr_matrix |> map(|row| row[0])

let memorial_g1 = []
let hopkins_g1 = []
let mayo_g1 = []
for i in 0..n_samples {
    if center[i] == "Memorial" { memorial_g1 = memorial_g1 + [gene_1_expr[i]] }
    else if center[i] == "Hopkins" { hopkins_g1 = hopkins_g1 + [gene_1_expr[i]] }
    else { mayo_g1 = mayo_g1 + [gene_1_expr[i]] }
}

let bp_table = table({"Memorial": memorial_g1, "Hopkins": hopkins_g1, "Mayo": mayo_g1})
boxplot(bp_table, {title: "Gene 1 Expression by Center"})

# Systematic shifts between centers = batch effect

Correcting Batch Effects: Include as Covariate

# The simplest and most transparent correction:
# include batch as a covariate in your statistical model

# For differential expression: multiple regression
for g in 0..5 {
    let gene_expr = expr_matrix |> map(|row| row[g])

    # Encode condition: Tumor=1, Normal=0
    let cond_numeric = condition |> map(|c| if c == "Tumor" { 1.0 } else { 0.0 })

    # Encode center: dummy variables
    let is_hopkins = center |> map(|c| if c == "Hopkins" { 1.0 } else { 0.0 })
    let is_mayo = center |> map(|c| if c == "Mayo" { 1.0 } else { 0.0 })

    # Model WITHOUT batch correction
    let model_naive = lm(gene_expr, cond_numeric)

    # Model WITH batch correction (include center as covariate)
    let adj_data = table({
        "expr": gene_expr, "cond": cond_numeric,
        "hopkins": is_hopkins, "mayo": is_mayo
    })
    let model_adjusted = lm("expr ~ cond + hopkins + mayo", adj_data)

    print("Gene {g+1}:")
    print("  Naive:    slope = {model_naive.slope |> round(3)}, p = {model_naive.p_value |> round(4)}")
    print("  Adjusted: cond coef = {model_adjusted.coefficients[0] |> round(3)}")
}

Before/After Comparison

# Visualize the effect of batch correction with PCA

# Step 1: PCA on raw data (batch-contaminated)
let pca_before = pca(expr_matrix, 3)
print("=== Before Correction ===")
print("PC1 variance: {(pca_before.variance_explained[0] * 100) |> round(1)}% (likely batch)")

# Step 2: Regress out batch effect from each gene
let is_hopkins = center |> map(|c| if c == "Hopkins" { 1.0 } else { 0.0 })
let is_mayo = center |> map(|c| if c == "Mayo" { 1.0 } else { 0.0 })

let corrected_matrix = []
for i in 0..n_samples {
    let corrected_sample = []
    for g in 0..n_genes {
        let gene_expr = expr_matrix |> map(|row| row[g])
        let batch_data = table({"expr": gene_expr, "hopkins": is_hopkins, "mayo": is_mayo})
        let model = lm("expr ~ hopkins + mayo", batch_data)

        let residual = gene_expr[i] - model.coefficients[0] * is_hopkins[i]
            - model.coefficients[1] * is_mayo[i]
        corrected_sample = corrected_sample + [residual]
    }
    corrected_matrix = corrected_matrix + [corrected_sample]
}

# Step 3: PCA on corrected data
let pca_after = pca(corrected_matrix, 3)
print("\n=== After Correction ===")
print("PC1 variance: {(pca_after.variance_explained[0] * 100) |> round(1)}% (should be biology now)")

# Compare side by side
let pca_corrected = table({
    "PC1": pca_after.scores |> map(|s| s[0]),
    "PC2": pca_after.scores |> map(|s| s[1]),
    "condition": condition
})
pca_plot(pca_corrected)

Detecting the Confounded Design Trap

# Check whether batch and condition are confounded
print("=== Balance Check ===")
print("Center       Tumor    Normal   % Tumor")

let centers = ["Memorial", "Hopkins", "Mayo"]
for c in centers {
    let n_tumor = 0
    let n_normal = 0
    for i in 0..n_samples {
        if center[i] == c {
            if condition[i] == "Tumor" { n_tumor = n_tumor + 1 }
            else { n_normal = n_normal + 1 }
        }
    }
    let pct = (n_tumor / (n_tumor + n_normal) * 100) |> round(1)
    print("{c}       {n_tumor}       {n_normal}      {pct}%")
}

# If one center is 100% tumor and another is 100% normal,
# the design is fatally confounded — no statistical fix exists
print("\nDesign assessment:")
let balanced = true
for c in centers {
    let has_tumor = false
    let has_normal = false
    for i in 0..n_samples {
        if center[i] == c {
            if condition[i] == "Tumor" { has_tumor = true }
            else { has_normal = true }
        }
    }
    if !has_tumor || !has_normal {
        print("FATAL: {c} has only one condition!")
        balanced = false
    }
}
if balanced {
    print("Design is balanced — batch correction is possible")
}

Python:

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# PCA for batch detection
pca = PCA(n_components=5)
scores = pca.fit_transform(expr_matrix)
plt.scatter(scores[:, 0], scores[:, 1], c=batch_labels, cmap='Set1')
plt.title('PCA Colored by Batch')

# ComBat batch correction
from combat.pycombat import pycombat
corrected = pycombat(expr_df, batch_series)

# SVA for unknown batches
# Use the sva package via rpy2 or pydeseq2

# Include batch as covariate in DE analysis
import statsmodels.api as sm
X = sm.add_constant(pd.get_dummies(df[['condition', 'center']], drop_first=True))
model = sm.OLS(gene_expr, X).fit()

# PCA for batch detection
pca <- prcomp(t(expr_matrix), scale. = TRUE)
plot(pca$x[,1], pca$x[,2], col = batch, pch = 19)

# ComBat batch correction
library(sva)
corrected <- ComBat(dat = expr_matrix, batch = center)

# ComBat-seq for RNA-seq counts
corrected <- ComBat_seq(counts = count_matrix, batch = center)

# SVA for unknown batches
mod <- model.matrix(~ condition)
mod0 <- model.matrix(~ 1, data = pdata)
svobj <- sva(expr_matrix, mod, mod0)

# Include in DE model (limma)
design <- model.matrix(~ condition + center)
fit <- lmFit(expr_matrix, design)
fit <- eBayes(fit)
topTable(fit, coef = "conditionTumor")

Exercises

Exercise 1: Detect Batch Effects

Given a simulated expression matrix with hidden batch effects, use PCA and ANOVA to identify which technical variable is causing the problem.

let n = 120
let n_genes = 30

# Simulate with batch effect on processing_date
# Biology: treatment vs control
# Batch: 3 processing dates

# 1. Run pca() and color by treatment, then by processing date
# 2. Which variable dominates PC1?
# 3. How many genes show significant batch effects (anova)?
# 4. How many show significant treatment effects?

Exercise 2: Balanced vs. Confounded Design

Create two study designs: one where batch and condition are balanced, another where they are completely confounded. Show that the confounded design cannot be corrected.


# Design A (balanced): equal tumor/normal at each center
# Design B (confounded): all tumor at Center 1, all normal at Center 2

# 1. Simulate data for both designs with identical biological effects
# 2. Apply batch correction (include center as covariate in lm())
# 3. Compare: does Design A recover the true biological signal?
# 4. Does Design B? Why or why not?

Exercise 3: Before/After Correction

Apply batch correction to a multi-center dataset and create a before/after PCA comparison. Quantify how much the batch effect is reduced.


# 1. Simulate 200 samples from 4 centers
# 2. pca() before correction — measure batch variance (anova on PC1 scores)
# 3. Correct by regressing out center effects with lm()
# 4. pca() after correction — measure batch variance again
# 5. What percentage of the batch effect was removed?

Exercise 4: Simpson’s Paradox

Simulate a drug trial where Drug A is better within every hospital, but Drug B appears better overall due to confounding. Demonstrate the paradox numerically.


# Hospital 1: treats mild cases (80% survival baseline)
#   Drug A: 85% survival, Drug B: 75% survival
# Hospital 2: treats severe cases (30% survival baseline)
#   Drug A: 35% survival, Drug B: 25% survival
# Hospital 2 uses Drug A more often

# 1. Show Drug A wins within each hospital
# 2. Combine data — show Drug B appears better overall
# 3. Explain why (confounding by disease severity)
# 4. What analysis would give the correct answer?

Exercise 5: Designing a Batch-Robust Study

You have 60 tumor and 60 normal samples that must be processed across 3 days (40 per day). Design the processing schedule to minimize confounding, and simulate data to verify your design resists batch effects.

# Good design: 20 tumor + 20 normal per day
# Bad design: 40 tumor on day 1, 40 tumor on day 2, 60 normal on day 3

# 1. Create the good balanced assignment
# 2. Create the bad confounded assignment
# 3. Simulate data with identical batch and biological effects
# 4. Analyze both designs — which recovers the true biology?

Key Takeaways

Batch effects are systematic technical differences that can dominate biological signals — they are present in virtually all high-throughput datasets
PCA is the most powerful tool for detecting batch effects: color by batch and biological variables to see which dominates
Confounders create spurious associations (or mask real ones); Simpson’s paradox is the extreme case where aggregate trends reverse within subgroups
Prevention is the best strategy: use balanced, randomized designs that distribute biological conditions across batches
Confounded designs (batch = biology) cannot be rescued by any statistical method — the experiment must be redesigned
Correction approaches: include batch as a covariate (simplest), ComBat (empirical Bayes), SVA (discover unknown batches), or RUVseq (negative controls)
Always perform a balance check before analysis: ensure no batch variable is perfectly correlated with the biological variable of interest
Before/after PCA is the standard way to demonstrate that batch correction worked

What’s Next

With three full weeks of biostatistics under your belt, you’re ready for the advanced topics of Week 4. Day 21 introduces dimensionality reduction — PCA, t-SNE, and UMAP — the tools that turn 20,000-dimensional gene expression data into interpretable 2D visualizations.

Keyboard shortcuts

Practical Biostatistics in 30 Days