Day 20: Batch Effects and Confounders
The Problem
Dr. David Liu is leading a multi-center study comparing gene expression in 200 breast tumors versus 100 normal breast tissues. Three hospitals contributed samples: Memorial (80 tumors, 40 normals), Hopkins (70 tumors, 30 normals), and Mayo (50 tumors, 30 normals). He runs PCA on the full expression matrix, expecting to see tumor and normal separate.
Instead, the PCA plot shows three tight clusters — one per hospital. The first principal component explains 35% of variance and perfectly separates the centers. Tumor vs. normal? It’s buried in PC4 at 4% of variance.
The dominant signal in his data is which hospital processed the sample, not the biology he’s studying. These are batch effects, and they’re one of the most insidious problems in genomics.
What Are Batch Effects?
Batch effects are systematic technical differences between groups of samples that were processed differently. They have nothing to do with biology but can completely dominate a dataset.
| Source | Mechanism | Example |
|---|---|---|
| Processing date | Reagent lots, temperature, humidity | Monday vs. Friday RNA extractions |
| Technician | Handling differences | Technician A vs. B |
| Center/site | Different protocols, equipment | Hospital A vs. B |
| Sequencing lane | Flow cell chemistry, loading density | Lane 1 vs. Lane 2 |
| Plate position | Edge effects, evaporation | Well A1 vs. H12 |
| Reagent lot | Batch-to-batch kit variation | Kit lot 2024A vs. 2024B |
| Storage time | RNA degradation over time | Samples banked in 2020 vs. 2024 |
Key insight: Batch effects are not random noise — they are systematic biases that affect thousands of genes simultaneously. They shift entire samples in the same direction, which is why PCA detects them so readily.
How Large Are Batch Effects?
In a landmark 2010 study, Leek et al. analyzed publicly available microarray data and found:
- Batch effects were present in virtually all high-throughput datasets
- They often explained more variance than the biological signal of interest
- They affected thousands of genes per batch, not just a handful
In RNA-seq, batch effects are typically smaller than in microarrays but still substantial — often explaining 10-30% of total variance.
Identifying Batch Effects
1. PCA Visualization
The most powerful diagnostic. Color samples by batch variable and biological variable. If batch dominates PC1/PC2, you have a problem.
2. Correlation Heatmap
Compute sample-to-sample correlation. If samples cluster by batch rather than biology, batch effects are present.
3. Box Plots by Batch
Plot the distribution of expression values (or a summary statistic) stratified by batch. Systematic shifts indicate batch effects.
4. ANOVA for Batch
For each gene, test whether expression differs significantly by batch. If hundreds or thousands of genes show batch effects, the problem is pervasive.
Confounders: A Deeper Problem
A confounder is a variable associated with BOTH the predictor and the outcome, creating a spurious association (or masking a real one).
| Scenario | Confounder | Danger |
|---|---|---|
| Gene expression differs by sex | Tumor subtype differs by sex | Sex drives both |
| Drug response varies by ethnicity | Genetic variant frequency varies by ethnicity | Population stratification |
| Survival differs by TP53 status | Stage differs by TP53 status | Stage confounds TP53 effect |
| Expression differs by treatment | Samples processed on different days | Processing day = treatment |
Simpson’s Paradox
The most dramatic form of confounding. A trend that appears in aggregate reverses when groups are separated:
| Hospital | Drug A Survival | Drug B Survival | Better Drug |
|---|---|---|---|
| Hospital 1 (mild cases) | 95% | 90% | Drug A |
| Hospital 2 (severe cases) | 40% | 30% | Drug A |
| Combined | 55% | 70% | Drug B?? |
Drug A is better at BOTH hospitals, but Drug B appears better overall because Hospital 2 (severe cases, low survival) preferentially used Drug A.
Clinical relevance: Simpson’s paradox has real consequences. In the 1970s, Berkeley was accused of sex discrimination in graduate admissions. Overall, women had lower acceptance rates. But department by department, women were accepted at equal or higher rates — they had applied more often to competitive departments with low overall acceptance rates.
The Confounded Design Trap
The most dangerous scenario: when batch perfectly correlates with biology.
| Sample | Center | Condition |
|---|---|---|
| 1-50 | Center A | All Tumor |
| 51-100 | Center B | All Normal |
Here, center and condition are completely confounded. Every difference between tumor and normal could equally be a difference between Center A and Center B. No statistical method can separate them. The experiment is fatally flawed.
Common pitfall: This happens more often than you’d think. A collaborator sends tumor samples from one hospital and normal samples from another. Or samples are processed tumor on Monday, normal on Tuesday. The solution is balanced design — ensure batch variables are distributed across biological groups.
Strategies for Handling Batch Effects
1. Prevention (Best Option)
| Strategy | How |
|---|---|
| Balanced design | Distribute conditions across batches equally |
| Randomization | Random assignment to processing order/position |
| Blocking | Process one sample from each condition per batch |
| Standard protocols | Minimize technical variation with SOPs |
| Single batch | Process everything together (often impractical) |
2. Detection
| Method | What It Reveals |
|---|---|
| PCA colored by batch | Visual clustering by technical variable |
| Correlation heatmap | Sample similarity by batch |
| Box plots by batch | Distribution shifts |
| ANOVA per gene | Number of genes affected |
3. Correction
| Method | Approach | Caveat |
|---|---|---|
| Include as covariate | Add batch to regression/DE model | Requires balanced design |
| ComBat (parametric) | Empirical Bayes batch correction | Assumes batch is known |
| ComBat-seq | ComBat for raw RNA-seq counts | Preserves count nature |
| SVA (surrogate variables) | Discovers unknown batch effects | May remove biology |
| RUVseq | Uses control genes | Needs negative controls |
Common pitfall: Do NOT use batch correction methods when batch is perfectly confounded with biology. They will remove both the batch effect AND the biological signal. If all tumors were processed in batch 1 and all normals in batch 2, no correction method can save you — the experiment must be redesigned.
Batch Effects in BioLang
Simulating Batch-Contaminated Data
set_seed(42)
# Simulate a multi-center gene expression study
let n_samples = 300
let n_genes = 50
# Assign samples to centers and conditions
let center = []
let condition = []
for i in 0..n_samples {
if i < 100 {
center = center + ["Memorial"]
condition = condition + [if i < 60 { "Tumor" } else { "Normal" }]
} else if i < 200 {
center = center + ["Hopkins"]
condition = condition + [if i < 170 { "Tumor" } else { "Normal" }]
} else {
center = center + ["Mayo"]
condition = condition + [if i < 250 { "Tumor" } else { "Normal" }]
}
}
# Generate expression matrix with batch effects
let expr_matrix = []
let batch_effects = {
"Memorial": rnorm(n_genes, 2.0, 0.5),
"Hopkins": rnorm(n_genes, -1.5, 0.5),
"Mayo": rnorm(n_genes, 0.5, 0.5)
}
# True biological signal (tumor vs normal)
let bio_signal = rnorm(n_genes, 0, 0.3)
# Only 10 genes are truly DE
for g in 0..10 {
bio_signal[g] = rnorm(1, 1.5, 0.3)[0]
}
for i in 0..n_samples {
let sample_expr = []
for g in 0..n_genes {
let base = 10 + rnorm(1, 0, 1)[0]
let batch = batch_effects[center[i]][g]
let bio = if condition[i] == "Tumor" { bio_signal[g] } else { 0 }
sample_expr = sample_expr + [base + batch + bio]
}
expr_matrix = expr_matrix + [sample_expr]
}
print("Expression matrix: {n_samples} samples x {n_genes} genes")
print("Centers: Memorial=100, Hopkins=100, Mayo=100")
Detecting Batch Effects with PCA
# PCA on the expression matrix
let pca_result = pca(expr_matrix, 5)
print("=== PCA Variance Explained ===")
for i in 0..5 {
print("PC{i+1}: {(pca_result.variance_explained[i] * 100) |> round(1)}%")
}
# Plot PC1 vs PC2, colored by center
let pca_data = table({
"PC1": pca_result.scores |> map(|s| s[0]),
"PC2": pca_result.scores |> map(|s| s[1]),
"center": center,
"condition": condition
})
pca_plot(pca_data)
# If center dominates PC1 and condition is only visible on PC3+,
# batch effects are overwhelming the biological signal
Quantifying Batch Effect with ANOVA
# For each gene, test how much variance is explained by batch vs biology
let batch_significant = 0
let bio_significant = 0
print("=== ANOVA: Batch vs Biology ===")
for g in 0..10 {
let gene_expr = expr_matrix |> map(|row| row[g])
# Group by center
let memorial_expr = []
let hopkins_expr = []
let mayo_expr = []
for i in 0..n_samples {
if center[i] == "Memorial" { memorial_expr = memorial_expr + [gene_expr[i]] }
else if center[i] == "Hopkins" { hopkins_expr = hopkins_expr + [gene_expr[i]] }
else { mayo_expr = mayo_expr + [gene_expr[i]] }
}
let batch_test = anova([memorial_expr, hopkins_expr, mayo_expr])
# Group by condition
let tumor_expr = []
let normal_expr = []
for i in 0..n_samples {
if condition[i] == "Tumor" { tumor_expr = tumor_expr + [gene_expr[i]] }
else { normal_expr = normal_expr + [gene_expr[i]] }
}
let bio_test = ttest(tumor_expr, normal_expr)
if batch_test.p_value < 0.05 { batch_significant = batch_significant + 1 }
if bio_test.p_value < 0.05 { bio_significant = bio_significant + 1 }
print("Gene {g+1}: batch F={batch_test.f_statistic |> round(2)} p={batch_test.p_value |> round(4)} bio p={bio_test.p_value |> round(4)}")
}
print("\nGenes with significant batch effect: {batch_significant} / 10")
print("Genes with significant biological effect: {bio_significant} / 10")
Correlation Heatmap for Batch Detection
# Sample-to-sample correlation: compute a few representative pairs
# Full matrix would be 300x300; spot-check a few
let s1 = expr_matrix[0] # Memorial, Tumor
let s2 = expr_matrix[50] # Memorial, Normal
let s3 = expr_matrix[100] # Hopkins, Tumor
print("=== Sample Correlations ===")
print("Memorial Tumor vs Memorial Normal: {cor(s1, s2) |> round(3)}")
print("Memorial Tumor vs Hopkins Tumor: {cor(s1, s3) |> round(3)}")
print("If same-center pairs are more correlated than same-condition pairs,")
print("batch effects dominate")
Box Plots by Batch
# Distribution of a batch-affected gene across centers
let gene_1_expr = expr_matrix |> map(|row| row[0])
let memorial_g1 = []
let hopkins_g1 = []
let mayo_g1 = []
for i in 0..n_samples {
if center[i] == "Memorial" { memorial_g1 = memorial_g1 + [gene_1_expr[i]] }
else if center[i] == "Hopkins" { hopkins_g1 = hopkins_g1 + [gene_1_expr[i]] }
else { mayo_g1 = mayo_g1 + [gene_1_expr[i]] }
}
let bp_table = table({"Memorial": memorial_g1, "Hopkins": hopkins_g1, "Mayo": mayo_g1})
boxplot(bp_table, {title: "Gene 1 Expression by Center"})
# Systematic shifts between centers = batch effect
Correcting Batch Effects: Include as Covariate
# The simplest and most transparent correction:
# include batch as a covariate in your statistical model
# For differential expression: multiple regression
for g in 0..5 {
let gene_expr = expr_matrix |> map(|row| row[g])
# Encode condition: Tumor=1, Normal=0
let cond_numeric = condition |> map(|c| if c == "Tumor" { 1.0 } else { 0.0 })
# Encode center: dummy variables
let is_hopkins = center |> map(|c| if c == "Hopkins" { 1.0 } else { 0.0 })
let is_mayo = center |> map(|c| if c == "Mayo" { 1.0 } else { 0.0 })
# Model WITHOUT batch correction
let model_naive = lm(gene_expr, cond_numeric)
# Model WITH batch correction (include center as covariate)
let adj_data = table({
"expr": gene_expr, "cond": cond_numeric,
"hopkins": is_hopkins, "mayo": is_mayo
})
let model_adjusted = lm("expr ~ cond + hopkins + mayo", adj_data)
print("Gene {g+1}:")
print(" Naive: slope = {model_naive.slope |> round(3)}, p = {model_naive.p_value |> round(4)}")
print(" Adjusted: cond coef = {model_adjusted.coefficients[0] |> round(3)}")
}
Before/After Comparison
# Visualize the effect of batch correction with PCA
# Step 1: PCA on raw data (batch-contaminated)
let pca_before = pca(expr_matrix, 3)
print("=== Before Correction ===")
print("PC1 variance: {(pca_before.variance_explained[0] * 100) |> round(1)}% (likely batch)")
# Step 2: Regress out batch effect from each gene
let is_hopkins = center |> map(|c| if c == "Hopkins" { 1.0 } else { 0.0 })
let is_mayo = center |> map(|c| if c == "Mayo" { 1.0 } else { 0.0 })
let corrected_matrix = []
for i in 0..n_samples {
let corrected_sample = []
for g in 0..n_genes {
let gene_expr = expr_matrix |> map(|row| row[g])
let batch_data = table({"expr": gene_expr, "hopkins": is_hopkins, "mayo": is_mayo})
let model = lm("expr ~ hopkins + mayo", batch_data)
let residual = gene_expr[i] - model.coefficients[0] * is_hopkins[i]
- model.coefficients[1] * is_mayo[i]
corrected_sample = corrected_sample + [residual]
}
corrected_matrix = corrected_matrix + [corrected_sample]
}
# Step 3: PCA on corrected data
let pca_after = pca(corrected_matrix, 3)
print("\n=== After Correction ===")
print("PC1 variance: {(pca_after.variance_explained[0] * 100) |> round(1)}% (should be biology now)")
# Compare side by side
let pca_corrected = table({
"PC1": pca_after.scores |> map(|s| s[0]),
"PC2": pca_after.scores |> map(|s| s[1]),
"condition": condition
})
pca_plot(pca_corrected)
Detecting the Confounded Design Trap
# Check whether batch and condition are confounded
print("=== Balance Check ===")
print("Center Tumor Normal % Tumor")
let centers = ["Memorial", "Hopkins", "Mayo"]
for c in centers {
let n_tumor = 0
let n_normal = 0
for i in 0..n_samples {
if center[i] == c {
if condition[i] == "Tumor" { n_tumor = n_tumor + 1 }
else { n_normal = n_normal + 1 }
}
}
let pct = (n_tumor / (n_tumor + n_normal) * 100) |> round(1)
print("{c} {n_tumor} {n_normal} {pct}%")
}
# If one center is 100% tumor and another is 100% normal,
# the design is fatally confounded — no statistical fix exists
print("\nDesign assessment:")
let balanced = true
for c in centers {
let has_tumor = false
let has_normal = false
for i in 0..n_samples {
if center[i] == c {
if condition[i] == "Tumor" { has_tumor = true }
else { has_normal = true }
}
}
if !has_tumor || !has_normal {
print("FATAL: {c} has only one condition!")
balanced = false
}
}
if balanced {
print("Design is balanced — batch correction is possible")
}
Python:
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# PCA for batch detection
pca = PCA(n_components=5)
scores = pca.fit_transform(expr_matrix)
plt.scatter(scores[:, 0], scores[:, 1], c=batch_labels, cmap='Set1')
plt.title('PCA Colored by Batch')
# ComBat batch correction
from combat.pycombat import pycombat
corrected = pycombat(expr_df, batch_series)
# SVA for unknown batches
# Use the sva package via rpy2 or pydeseq2
# Include batch as covariate in DE analysis
import statsmodels.api as sm
X = sm.add_constant(pd.get_dummies(df[['condition', 'center']], drop_first=True))
model = sm.OLS(gene_expr, X).fit()
R:
# PCA for batch detection
pca <- prcomp(t(expr_matrix), scale. = TRUE)
plot(pca$x[,1], pca$x[,2], col = batch, pch = 19)
# ComBat batch correction
library(sva)
corrected <- ComBat(dat = expr_matrix, batch = center)
# ComBat-seq for RNA-seq counts
corrected <- ComBat_seq(counts = count_matrix, batch = center)
# SVA for unknown batches
mod <- model.matrix(~ condition)
mod0 <- model.matrix(~ 1, data = pdata)
svobj <- sva(expr_matrix, mod, mod0)
# Include in DE model (limma)
design <- model.matrix(~ condition + center)
fit <- lmFit(expr_matrix, design)
fit <- eBayes(fit)
topTable(fit, coef = "conditionTumor")
Exercises
Exercise 1: Detect Batch Effects
Given a simulated expression matrix with hidden batch effects, use PCA and ANOVA to identify which technical variable is causing the problem.
let n = 120
let n_genes = 30
# Simulate with batch effect on processing_date
# Biology: treatment vs control
# Batch: 3 processing dates
# 1. Run pca() and color by treatment, then by processing date
# 2. Which variable dominates PC1?
# 3. How many genes show significant batch effects (anova)?
# 4. How many show significant treatment effects?
Exercise 2: Balanced vs. Confounded Design
Create two study designs: one where batch and condition are balanced, another where they are completely confounded. Show that the confounded design cannot be corrected.
# Design A (balanced): equal tumor/normal at each center
# Design B (confounded): all tumor at Center 1, all normal at Center 2
# 1. Simulate data for both designs with identical biological effects
# 2. Apply batch correction (include center as covariate in lm())
# 3. Compare: does Design A recover the true biological signal?
# 4. Does Design B? Why or why not?
Exercise 3: Before/After Correction
Apply batch correction to a multi-center dataset and create a before/after PCA comparison. Quantify how much the batch effect is reduced.
# 1. Simulate 200 samples from 4 centers
# 2. pca() before correction — measure batch variance (anova on PC1 scores)
# 3. Correct by regressing out center effects with lm()
# 4. pca() after correction — measure batch variance again
# 5. What percentage of the batch effect was removed?
Exercise 4: Simpson’s Paradox
Simulate a drug trial where Drug A is better within every hospital, but Drug B appears better overall due to confounding. Demonstrate the paradox numerically.
# Hospital 1: treats mild cases (80% survival baseline)
# Drug A: 85% survival, Drug B: 75% survival
# Hospital 2: treats severe cases (30% survival baseline)
# Drug A: 35% survival, Drug B: 25% survival
# Hospital 2 uses Drug A more often
# 1. Show Drug A wins within each hospital
# 2. Combine data — show Drug B appears better overall
# 3. Explain why (confounding by disease severity)
# 4. What analysis would give the correct answer?
Exercise 5: Designing a Batch-Robust Study
You have 60 tumor and 60 normal samples that must be processed across 3 days (40 per day). Design the processing schedule to minimize confounding, and simulate data to verify your design resists batch effects.
# Good design: 20 tumor + 20 normal per day
# Bad design: 40 tumor on day 1, 40 tumor on day 2, 60 normal on day 3
# 1. Create the good balanced assignment
# 2. Create the bad confounded assignment
# 3. Simulate data with identical batch and biological effects
# 4. Analyze both designs — which recovers the true biology?
Key Takeaways
- Batch effects are systematic technical differences that can dominate biological signals — they are present in virtually all high-throughput datasets
- PCA is the most powerful tool for detecting batch effects: color by batch and biological variables to see which dominates
- Confounders create spurious associations (or mask real ones); Simpson’s paradox is the extreme case where aggregate trends reverse within subgroups
- Prevention is the best strategy: use balanced, randomized designs that distribute biological conditions across batches
- Confounded designs (batch = biology) cannot be rescued by any statistical method — the experiment must be redesigned
- Correction approaches: include batch as a covariate (simplest), ComBat (empirical Bayes), SVA (discover unknown batches), or RUVseq (negative controls)
- Always perform a balance check before analysis: ensure no batch variable is perfectly correlated with the biological variable of interest
- Before/after PCA is the standard way to demonstrate that batch correction worked
What’s Next
With three full weeks of biostatistics under your belt, you’re ready for the advanced topics of Week 4. Day 21 introduces dimensionality reduction — PCA, t-SNE, and UMAP — the tools that turn 20,000-dimensional gene expression data into interpretable 2D visualizations.