Day 29: Capstone — Differential Expression Study
The Problem
You receive an email from a gastroenterology collaborator: “We have RNA-seq data from 12 colon biopsies — 6 from colorectal tumors and 6 from matched normal tissue. We need to identify genes that are differentially expressed between tumor and normal, find pathways that are altered, and generate figures for a manuscript. Can you run the analysis?”
The raw data has already been aligned and quantified. You have a gene-by-sample count matrix: 15,000 genes (rows) by 12 samples (columns). Each entry is the number of sequencing reads mapped to that gene in that sample. The values range from 0 to several hundred thousand.
This is the bread and butter of computational genomics. Every RNA-seq experiment, every cancer study, every drug treatment analysis begins with some version of this pipeline. Today, you will build the complete analysis from scratch, applying methods from nearly every chapter of this book.
The Complete DE Pipeline
The analysis follows a standard workflow:
- Quality control: Library sizes, PCA for outliers and batch effects
- Filtering: Remove lowly expressed genes
- Normalization: Convert raw counts to comparable expression values
- Differential expression: Statistical testing per gene
- Multiple testing correction: FDR control
- Visualization: Volcano plot, heatmap, gene-level plots
- Biological interpretation: Top genes, correlation patterns
Section 1: Data Loading and Quality Control
set_seed(42)
# ============================================
# Differential Expression Analysis
# Colorectal Tumor vs Normal Colon
# 15,000 genes x 12 samples (6 tumor, 6 normal)
# ============================================
# --- Configuration ---
let CONFIG = {
n_genes: 15000,
n_tumor: 6,
n_normal: 6,
fc_threshold: 1.0, # log2 fold change
fdr_threshold: 0.05,
min_count: 10, # minimum count for filtering
min_samples: 3, # minimum samples above threshold
n_top_heatmap: 50,
n_top_label: 10
}
# --- Simulate realistic RNA-seq counts ---
# Most genes: low to moderate expression, no DE
# ~500 genes: truly upregulated in tumor (log2FC ~ 1.5-4)
# ~500 genes: truly downregulated in tumor (log2FC ~ -1.5 to -4)
# ~14000 genes: no true difference
let gene_names = seq(1, CONFIG.n_genes) |> map(|i| "Gene_" + str(i))
# Base expression: log-normal distribution (typical for RNA-seq)
let base_means = rnorm(CONFIG.n_genes, 100, 80)
|> map(|x| max(1, round(abs(x), 0)))
# Generate count matrix
let counts = table(CONFIG.n_genes, 12, 0)
for g in 0..CONFIG.n_genes {
let mu = base_means[g]
for s in 0..12 {
# Negative binomial-like: Poisson with overdispersion
let lib_factor = rnorm(1, 1.0, 0.15)[0]
let this_mu = mu * max(0.1, lib_factor)
# Add differential expression for first 500 genes (up in tumor)
if g < 500 && s < 6 {
let fc = rnorm(1, 2.5, 0.8)[0]
this_mu = this_mu * pow(2, max(0.5, fc))
}
# Add DE for genes 500-999 (down in tumor)
if g >= 500 && g < 1000 && s < 6 {
let fc = rnorm(1, -2.0, 0.7)[0]
this_mu = this_mu * pow(2, min(-0.5, fc))
}
counts[g][s] = max(0, round(rpois(1, max(0.1, this_mu))[0], 0))
}
}
let sample_names = ["T1", "T2", "T3", "T4", "T5", "T6",
"N1", "N2", "N3", "N4", "N5", "N6"]
let groups = repeat("Tumor", 6) + repeat("Normal", 6)
print("Count matrix: " + str(CONFIG.n_genes) + " genes x 12 samples")
Library Size Check
Library size (total reads per sample) should be roughly comparable. Large differences indicate technical problems.
# Library sizes
let lib_sizes = col_sums(counts)
print("\n=== Library Sizes ===")
for i in 0..12 {
print(sample_names[i] + " (" + groups[i] + "): " +
str(round(lib_sizes[i] / 1e6, 2)) + "M reads")
}
bar_chart(sample_names, lib_sizes |> map(|x| x / 1e6),
{title: "Library Sizes",
ylabel: "Millions of Reads"})
# Flag samples with library size < 50% or > 200% of median
let med_lib = median(lib_sizes)
for i in 0..12 {
let ratio = lib_sizes[i] / med_lib
if ratio < 0.5 || ratio > 2.0 {
print("WARNING: " + sample_names[i] + " has unusual library size (ratio=" +
str(round(ratio, 2)) + ")")
}
}
PCA for Outlier Detection
PCA is the first tool for quality control. Samples should cluster by biological group (tumor vs normal), not by batch or other technical factors.
# Quick normalization for QC PCA: log2(CPM + 1)
let cpm = counts |> map_cells(|x, col| x / lib_sizes[col] * 1e6)
let log_cpm = cpm |> map_cells(|x, _| log2(x + 1))
# PCA on all genes
let qc_pca = pca(transpose(log_cpm)) # transpose: samples as rows
scatter(qc_pca.scores[0], qc_pca.scores[1],
{xlabel: "PC1 (" + str(round(qc_pca.variance_explained[0] * 100, 1)) + "%)",
ylabel: "PC2 (" + str(round(qc_pca.variance_explained[1] * 100, 1)) + "%)",
title: "PCA — Quality Control (All Genes)"})
# Check: PC1 should separate tumor from normal
print("\n=== PCA Quality Control ===")
print("PC1 explains " + str(round(qc_pca.variance_explained[0] * 100, 1)) + "% of variance")
print("PC2 explains " + str(round(qc_pca.variance_explained[1] * 100, 1)) + "% of variance")
# Identify outliers (>3 SD from group centroid)
let tumor_pc1 = qc_pca.scores[0] |> select(0..6)
let normal_pc1 = qc_pca.scores[0] |> select(6..12)
let tumor_pc2 = qc_pca.scores[1] |> select(0..6)
let normal_pc2 = qc_pca.scores[1] |> select(6..12)
for i in 0..6 {
let dist_from_center = sqrt(
pow(tumor_pc1[i] - mean(tumor_pc1), 2) +
pow(tumor_pc2[i] - mean(tumor_pc2), 2))
if dist_from_center > 3 * sd(tumor_pc1) {
print("WARNING: " + sample_names[i] + " is a potential outlier (tumor group)")
}
}
Sample Correlation Heatmap
# Correlation matrix between samples
let cor_mat = cor_matrix(transpose(log_cpm))
heatmap(cor_mat,
{color_scale: "red_blue",
title: "Sample-Sample Correlation Heatmap",
cluster_rows: true,
cluster_cols: true})
# All same-group correlations should be high (>0.9)
let min_within_tumor = 1.0
let min_within_normal = 1.0
for i in 0..6 {
for j in (i+1)..6 {
min_within_tumor = min(min_within_tumor, cor_mat[i][j])
min_within_normal = min(min_within_normal, cor_mat[i+6][j+6])
}
}
print("\nMinimum within-tumor correlation: " + str(round(min_within_tumor, 3)))
print("Minimum within-normal correlation: " + str(round(min_within_normal, 3)))
Key insight: If PCA shows samples clustering by something other than biology (e.g., batch, RNA extraction date, sequencing lane), you have a batch effect (Day 20). Address it before proceeding with DE analysis.
Section 2: Gene Filtering
Low-expression genes add noise and multiple testing burden without contributing useful information. Filter them before testing.
# Filter: keep genes with at least min_count reads in at least min_samples
let keep = []
let remove_count = 0
for g in 0..CONFIG.n_genes {
let above_threshold = 0
for s in 0..12 {
if counts[g][s] >= CONFIG.min_count {
above_threshold = above_threshold + 1
}
}
if above_threshold >= CONFIG.min_samples {
keep = keep + [g]
} else {
remove_count = remove_count + 1
}
}
print("\n=== Gene Filtering ===")
print("Genes before filtering: " + str(CONFIG.n_genes))
print("Genes removed (low expression): " + str(remove_count))
print("Genes retained: " + str(len(keep)))
# Subset to kept genes
let filtered_counts = counts |> select_rows(keep)
let filtered_names = gene_names |> select(keep)
let filtered_log_cpm = log_cpm |> select_rows(keep)
Section 3: Normalization
Raw counts are not directly comparable between samples because of different library sizes. We normalize to log2 counts per million (CPM).
# Normalize: log2(CPM + 1)
let norm_expr = filtered_counts |> map_cells(|x, col|
log2(x / lib_sizes[col] * 1e6 + 1))
print("\n=== Normalization ===")
print("Method: log2(CPM + 1)")
print("Expression range: [" +
str(round(min_all(norm_expr), 2)) + ", " +
str(round(max_all(norm_expr), 2)) + "]")
# Density plot of normalized expression by sample
for i in 0..12 {
density(norm_expr |> col(i), {label: sample_names[i]})
}
Common pitfall: CPM normalization assumes that most genes are not differentially expressed. If a large fraction of genes are DE (uncommon but possible in some cancer comparisons), CPM can introduce systematic bias. More sophisticated methods like TMM (edgeR) or median-of-ratios (DESeq2) handle this better. For typical DE analyses with <20% DE genes, log2(CPM+1) is adequate.
Section 4: Differential Expression Testing
We test each gene individually using Welch’s t-test, comparing the 6 tumor samples to the 6 normal samples.
# --- Gene-by-gene testing ---
print("\n=== Differential Expression Testing ===")
print("Testing " + str(len(keep)) + " genes (Welch t-test)...")
let de_results = []
for g in 0..len(keep) {
let tumor_vals = norm_expr[g] |> select(0..6)
let normal_vals = norm_expr[g] |> select(6..12)
let log2fc = mean(tumor_vals) - mean(normal_vals)
let tt = ttest(tumor_vals, normal_vals)
# Cohen's d inline
let pooled_sd = sqrt(((len(tumor_vals) - 1) * pow(sd(tumor_vals), 2) +
(len(normal_vals) - 1) * pow(sd(normal_vals), 2)) /
(len(tumor_vals) + len(normal_vals) - 2))
let d = if pooled_sd > 0 { log2fc / pooled_sd } else { 0 }
de_results = de_results + [{
gene: filtered_names[g],
gene_idx: keep[g],
log2fc: log2fc,
mean_tumor: mean(tumor_vals),
mean_normal: mean(normal_vals),
t_stat: tt.statistic,
p_value: tt.p_value,
cohens_d: d
}]
}
# Quick check
print("Tests completed: " + str(len(de_results)))
print("Raw p < 0.05: " + str(count(de_results, |r| r.p_value < 0.05)))
print("Raw p < 0.01: " + str(count(de_results, |r| r.p_value < 0.01)))
Section 5: Multiple Testing Correction (FDR)
With thousands of tests, raw p-values are unreliable. A 5% false positive rate across 10,000 tests means 500 false positives. FDR correction (Day 12) controls this.
# --- FDR correction (Benjamini-Hochberg) ---
let raw_pvals = de_results |> map(|r| r.p_value)
let adj_pvals = p_adjust(raw_pvals, "BH")
# Add adjusted p-values
for i in 0..len(de_results) {
de_results[i].adj_p = adj_pvals[i]
}
# Identify significant genes
let sig_genes = de_results
|> filter(|r| r.adj_p < CONFIG.fdr_threshold && abs(r.log2fc) > CONFIG.fc_threshold)
|> sort_by(|r| r.adj_p)
let sig_up = sig_genes |> filter(|r| r.log2fc > 0)
let sig_down = sig_genes |> filter(|r| r.log2fc < 0)
print("\n=== Multiple Testing Correction ===")
print("Method: Benjamini-Hochberg (FDR)")
print("FDR threshold: " + str(CONFIG.fdr_threshold))
print("Fold change threshold: |log2FC| > " + str(CONFIG.fc_threshold))
print("")
print("Total DE genes: " + str(len(sig_genes)))
print(" Upregulated in tumor: " + str(len(sig_up)))
print(" Downregulated in tumor: " + str(len(sig_down)))
# Comparison: how many would Bonferroni find?
let bonf_pvals = p_adjust(raw_pvals, "bonferroni")
let sig_bonf = de_results
|> filter(|r| bonf_pvals[de_results |> index_of(r)] < CONFIG.fdr_threshold &&
abs(r.log2fc) > CONFIG.fc_threshold)
print("\nBonferroni significant: " + str(len(sig_bonf |> collect())))
print("BH recovers more true positives while controlling FDR")
Key insight: The choice between Bonferroni and BH matters enormously in genomics. Bonferroni controls the family-wise error rate (FWER) — the probability of even one false positive — and is extremely conservative. BH controls the FDR — the proportion of false positives among all positives — and is the standard for discovery-oriented genomics.
Section 6: Volcano Plot
# --- Volcano plot ---
let all_log2fc = de_results |> map(|r| r.log2fc)
let all_adj_p = de_results |> map(|r| r.adj_p)
# Find top genes for labeling
let top_by_significance = de_results
|> sort_by(|r| r.adj_p)
|> take(CONFIG.n_top_label)
|> map(|r| r.gene)
let volcano_tbl = de_results |> map(|r| {
gene: r.gene, log2fc: r.log2fc, adj_p: r.adj_p
}) |> to_table()
volcano(volcano_tbl,
{fc_threshold: CONFIG.fc_threshold,
p_threshold: CONFIG.fdr_threshold,
title: "Volcano Plot — Tumor vs Normal Colon",
xlabel: "log2 Fold Change",
ylabel: "-log10(FDR-adjusted p-value)"})
print("\n=== Top 10 DE Genes by Significance ===")
print("Gene log2FC adj_p Cohen's d")
print("-" * 55)
for gene in de_results |> sort_by(|r| r.adj_p) |> take(10) {
print(gene.gene + " " +
str(round(gene.log2fc, 2)) + " " +
str(round(gene.adj_p, 6)) + " " +
str(round(gene.cohens_d, 2)))
}
Section 7: Heatmap of Top DE Genes
# --- Heatmap of top 50 DE genes ---
let top_50 = sig_genes |> take(CONFIG.n_top_heatmap)
let top_50_names = top_50 |> map(|g| g.gene)
let top_50_idx = top_50 |> map(|g|
filtered_names |> index_of(g.gene))
let heatmap_data = norm_expr |> select_rows(top_50_idx)
# Z-score normalization per gene (for visualization)
let z_scored = heatmap_data |> map_rows(|row|
let mu = mean(row)
let s = sd(row)
row |> map(|x| if s > 0 { (x - mu) / s } else { 0 })
)
heatmap(z_scored,
{cluster_rows: true,
cluster_cols: true,
color_scale: "red_blue",
title: "Top 50 DE Genes (Z-scored Expression)"})
print("\nHeatmap: Top " + str(CONFIG.n_top_heatmap) +
" DE genes, Z-scored, hierarchically clustered")
Section 8: Gene-Level Visualization
For specific genes of interest, show the individual data points.
# --- Box plots for top DE genes ---
print("\n=== Gene-Level Expression Plots ===")
let genes_of_interest = sig_genes |> take(6) |> map(|g| g.gene)
for gene in genes_of_interest {
let idx = filtered_names |> index_of(gene)
let tumor_vals = norm_expr[idx] |> select(0..6)
let normal_vals = norm_expr[idx] |> select(6..12)
let p = de_results |> find(|r| r.gene == gene)
let bp_table = table({"Tumor": tumor_vals, "Normal": normal_vals})
boxplot(bp_table, {title: gene + " (log2FC=" + str(round(p.log2fc, 2)) +
", FDR=" + str(round(p.adj_p, 4)) + ")"})
}
Section 9: Co-expression Analysis
Do the top DE genes form co-regulated modules? A correlation heatmap of the significant genes reveals co-expression structure.
# --- Correlation among top DE genes ---
let top_100 = sig_genes |> take(100) |> map(|g| g.gene)
let top_100_idx = top_100 |> map(|g| filtered_names |> index_of(g))
let top_100_expr = norm_expr |> select_rows(top_100_idx)
let gene_cor = cor_matrix(top_100_expr)
heatmap(gene_cor,
{cluster_rows: true,
cluster_cols: true,
color_scale: "red_blue",
title: "Co-expression of Top 100 DE Genes"})
# Identify strongly correlated gene pairs
print("\n=== Strong Co-expression Pairs (|r| > 0.9) ===")
let strong_pairs = []
for i in 0..len(top_100) {
for j in (i+1)..len(top_100) {
if abs(gene_cor[i][j]) > 0.9 {
strong_pairs = strong_pairs + [{
gene_a: top_100[i],
gene_b: top_100[j],
r: gene_cor[i][j]
}]
}
}
}
print("Found " + str(len(strong_pairs)) + " strongly correlated pairs")
for pair in strong_pairs |> take(10) {
print(" " + pair.gene_a + " <-> " + pair.gene_b +
" (r = " + str(round(pair.r, 3)) + ")")
}
Section 10: Summary Report
# ============================================
# FINAL REPORT
# ============================================
print("\n" + "=" * 65)
print("DIFFERENTIAL EXPRESSION ANALYSIS — FINAL REPORT")
print("Colorectal Tumor (n=6) vs Normal Colon (n=6)")
print("=" * 65)
print("\n--- Data Summary ---")
print("Total genes: " + str(CONFIG.n_genes))
print("Genes after filtering: " + str(len(keep)))
print("Normalization: log2(CPM + 1)")
print("Statistical test: Welch's t-test (per gene)")
print("Multiple testing: Benjamini-Hochberg FDR")
print("Random seed: 42")
print("\n--- Quality Control ---")
print("PCA: PC1 separates tumor vs normal (" +
str(round(qc_pca.variance_explained[0] * 100, 1)) + "% variance)")
print("No outlier samples detected")
print("Within-group correlations > " +
str(round(min(min_within_tumor, min_within_normal), 2)))
print("\n--- Differential Expression ---")
print("Significance criteria: FDR < " + str(CONFIG.fdr_threshold) +
" AND |log2FC| > " + str(CONFIG.fc_threshold))
print("Total DE genes: " + str(len(sig_genes)))
print(" Upregulated in tumor: " + str(len(sig_up)))
print(" Downregulated in tumor: " + str(len(sig_down)))
print("\n--- Top 5 Upregulated Genes ---")
for g in sig_up |> take(5) {
print(" " + g.gene + ": log2FC = " + str(round(g.log2fc, 2)) +
", FDR = " + str(round(g.adj_p, 6)))
}
print("\n--- Top 5 Downregulated Genes ---")
for g in sig_down |> take(5) {
print(" " + g.gene + ": log2FC = " + str(round(g.log2fc, 2)) +
", FDR = " + str(round(g.adj_p, 6)))
}
print("\n--- Co-expression ---")
print("Strong co-expression pairs (|r| > 0.9): " + str(len(strong_pairs)))
# Save results
write_csv(de_results |> sort_by(|r| r.adj_p),
"results/de_results_all.csv")
write_csv(sig_genes,
"results/de_results_significant.csv")
print("\nResults saved to results/")
print("=" * 65)
Python:
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multitest import multipletests
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
# Load count matrix
counts = pd.read_csv("counts.csv", index_col=0)
# Library sizes
lib_sizes = counts.sum(axis=0)
# Normalize
cpm = counts.div(lib_sizes) * 1e6
log_cpm = np.log2(cpm + 1)
# PCA QC
pca = PCA(n_components=2)
scores = pca.fit_transform(log_cpm.T)
plt.scatter(scores[:6, 0], scores[:6, 1], c='red', label='Tumor')
plt.scatter(scores[6:, 0], scores[6:, 1], c='blue', label='Normal')
plt.legend()
plt.show()
# DE testing
results = []
for gene in log_cpm.index:
tumor = log_cpm.loc[gene, :6].values
normal = log_cpm.loc[gene, 6:].values
t, p = stats.ttest_ind(tumor, normal, equal_var=False)
fc = tumor.mean() - normal.mean()
results.append({'gene': gene, 'log2fc': fc, 'pvalue': p})
results = pd.DataFrame(results)
_, results['padj'], _, _ = multipletests(results['pvalue'], method='fdr_bh')
sig = results[(results['padj'] < 0.05) & (results['log2fc'].abs() > 1)]
print(f"Significant DE genes: {len(sig)}")
R:
library(DESeq2)
library(pheatmap)
library(EnhancedVolcano)
# DESeq2 workflow (gold standard for RNA-seq DE)
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ group)
dds <- DESeq(dds)
res <- results(dds, contrast = c("group", "Tumor", "Normal"))
# Volcano plot
EnhancedVolcano(res, lab = rownames(res),
x = 'log2FoldChange', y = 'padj',
pCutoff = 0.05, FCcutoff = 1)
# Heatmap of top 50
sig_genes <- head(res[order(res$padj), ], 50)
vsd <- vst(dds)
pheatmap(assay(vsd)[rownames(sig_genes), ],
scale = "row",
annotation_col = sample_info)
# Summary
summary(res, alpha = 0.05)
Exercises
- Filtering sensitivity. Run the DE analysis with three different filtering thresholds: (a) no filtering, (b) at least 10 counts in 3 samples, (c) at least 50 counts in 6 samples. How does the number of DE genes change? Which filter gives the best balance of discovery and reliability?
# Your code: three filtering thresholds, compare DE counts
- Non-parametric alternative. Replace the Welch t-test with the Wilcoxon rank-sum test for each gene. How many DE genes do you find? Is the Wilcoxon test more or less powerful with n=6 per group?
# Your code: Wilcoxon DE analysis, compare to t-test results
- Permutation-based DE. For the top 20 DE genes (by t-test), run a permutation test (10,000 permutations) to confirm the p-value. Do the permutation p-values agree with the t-test p-values?
# Your code: permutation test for top 20 genes, compare p-values
- Effect size analysis. For all significant DE genes, compute Cohen’s d. Create a scatter plot of |log2FC| vs |Cohen’s d|. Are they correlated? Which genes have high fold change but low effect size (and why)?
# Your code: scatter plot of fold change vs effect size
- Batch effect simulation. Add a batch effect to the data (3 tumor + 3 normal from batch 1, 3 tumor + 3 normal from batch 2; add 1.5 to all gene expression in batch 2). Re-run the analysis. How does PCA change? How many spurious DE genes appear? Remove the batch effect using linear regression on PC1 and re-test.
# Your code: add batch effect, visualize, correct, re-analyze
Key Takeaways
- A complete DE analysis follows a structured pipeline: QC (library sizes, PCA, correlation), filtering, normalization, testing, FDR correction, and visualization.
- PCA is the first quality control step — it reveals outliers, batch effects, and whether the primary source of variation is biological or technical.
- Gene filtering removes lowly expressed genes, reducing the multiple testing burden and removing unreliable measurements.
- Log2(CPM+1) is a simple, effective normalization for DE analysis. More sophisticated methods (TMM, DESeq2’s median-of-ratios) are preferred for production analyses.
- Welch’s t-test per gene with BH FDR correction is a valid DE approach. Dedicated tools (DESeq2, edgeR, limma-voom) use more sophisticated statistical models that borrow information across genes.
- The volcano plot simultaneously shows fold change and significance; the clustered heatmap shows expression patterns across samples for top genes.
- Co-expression analysis reveals gene modules — groups of genes with correlated expression that may share biological functions.
- With only 6 samples per group, statistical power is limited. Genes with moderate true effects may be missed. Power analysis (Day 18) should guide future experimental design.
What’s Next
Tomorrow is the final capstone — and the most computationally ambitious. You will analyze a genome-wide association study: 500,000 SNPs across 10,000 individuals, testing for association with type 2 diabetes. Quality control, population structure, genome-wide testing, Manhattan plots, Q-Q plots, and effect size interpretation — the full GWAS pipeline, from raw genotypes to publishable results.