Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 13: Gene Expression and RNA-seq

DifficultyIntermediate
Biology knowledgeIntermediate (gene expression, RNA-seq workflow, normalization)
Coding knowledgeIntermediate (tables, pipes, lambda functions, statistics)
Time~3 hours
PrerequisitesDays 1-12 completed, BioLang installed (see Appendix A)
Data neededGenerated by init.bl (count matrix + gene lengths)
RequirementsNone (offline)

What You’ll Learn

  • What gene expression is and why it matters
  • How RNA-seq measures expression by counting reads
  • How to work with count matrices (genes x samples)
  • Why normalization is essential and how CPM and TPM work
  • How to perform differential expression analysis between conditions
  • What log2 fold change means and how to interpret it
  • How to correct for multiple testing with Benjamini-Hochberg
  • How to create volcano plots and MA plots

The Problem

A cancer researcher has RNA-seq data from 6 patients — 3 tumor samples and 3 normal. Which genes are overactive in tumors? Which are silenced? Differential expression analysis answers this, but first you need to understand what RNA-seq measures and how to normalize the data.

Today you will work through the full RNA-seq analysis pipeline: from raw count matrices through normalization, differential expression, multiple testing correction, and visualization. The dataset is small (20 genes) so you can trace every calculation, but the techniques scale to 20,000+ genes in real experiments.


What Is Gene Expression?

Every cell in your body has the same DNA, yet a neuron looks and functions nothing like a muscle cell. The difference is gene expression — which genes are turned on and how strongly.

Gene expression: DNA to Protein flow

  • Expression = how much mRNA a gene produces at a given moment.
  • High expression = the gene is active, producing many mRNA copies. Example: GAPDH in most cells.
  • Low or no expression = the gene is silent. Example: hemoglobin genes in skin cells.
  • Differential expression = a gene is more active in one condition than another. Example: an oncogene overexpressed in tumor tissue.

Different cell types, tissues, diseases, and time points produce different expression profiles. Measuring these differences is the goal of RNA-seq.


RNA-seq: Measuring Expression

RNA-seq is the standard technology for measuring gene expression across the genome. The workflow has several steps, each producing a different data format:

RNA-seq workflow

The key idea: the number of reads that map to a gene is proportional to how much mRNA that gene produced. More mRNA means more reads. By counting reads per gene across samples, we build a count matrix — the starting point for all downstream analysis.

But raw counts are not directly comparable:

  • A 10,000 bp gene captures more reads than a 500 bp gene, even at the same expression level (length bias).
  • A sample sequenced to 50 million reads has higher counts than one sequenced to 25 million reads (library size bias).

Normalization removes these biases so we can compare genes and samples fairly.


Count Matrices

A count matrix has genes as rows and samples as columns. Each cell contains the number of reads mapped to that gene in that sample.

# Create a count matrix from records
let counts = [
    {gene: "BRCA1", normal_1: 120, normal_2: 135, normal_3: 128, tumor_1: 340, tumor_2: 380, tumor_3: 355},
    {gene: "TP53",  normal_1: 450, normal_2: 420, normal_3: 440, tumor_1: 890, tumor_2: 920, tumor_3: 850},
    {gene: "GAPDH", normal_1: 5000, normal_2: 5200, normal_3: 4800, tumor_1: 5100, tumor_2: 4900, tumor_3: 5300},
    {gene: "MYC",   normal_1: 80,  normal_2: 75,  normal_3: 85,  tumor_1: 450, tumor_2: 480, tumor_3: 420},
    {gene: "ACTB",  normal_1: 3000, normal_2: 3100, normal_3: 2900, tumor_1: 3050, tumor_2: 2950, tumor_3: 3100},
] |> to_table()

println(f"Genes: {nrow(counts)}")
println(f"Columns: {colnames(counts)}")
println(counts)

Expected output:

Genes: 5
Columns: ["gene", "normal_1", "normal_2", "normal_3", "tumor_1", "tumor_2", "tumor_3"]
gene    normal_1  normal_2  normal_3  tumor_1  tumor_2  tumor_3
BRCA1   120       135       128       340      380      355
TP53    450       420       440       890      920      850
GAPDH   5000      5200      4800      5100     4900     5300
MYC     80        75        85        450      480      420
ACTB    3000      3100      2900      3050     2950     3100

In practice, count matrices come from tools like featureCounts or HTSeq, and you would load them from a CSV file:

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# requires: data/counts.csv in working directory (run init.bl first)
let counts = csv("data/counts.csv")
println(f"Genes: {nrow(counts)}")
println(f"Samples: {ncol(counts) - 1}")
println(counts |> head(5))

Expected output:

Genes: 20
Samples: 6
gene    normal_1  normal_2  normal_3  tumor_1  tumor_2  tumor_3
BRCA1   120       135       128       340      380      355
TP53    450       420       440       890      920      850
GAPDH   5000      5200      4800      5100     4900     5300
MYC     80        75        85        450      480      420
ACTB    3000      3100      2900      3050     2950     3100

Normalization: Why and How

The Problem

Imagine two genes:

Gene length bias in read counts

Gene A has 4x more reads than Gene B, but it is also 20x longer. Per unit length, Gene B is actually expressed at a higher level. Raw counts are misleading.

Similarly, if Sample X was sequenced to 50 million reads and Sample Y to 25 million reads, every gene in Sample X will have roughly double the counts — not because expression is higher, but because of sequencing depth.

CPM: Counts Per Million

CPM corrects for library size (total number of reads per sample). It answers: “Out of every million reads, how many mapped to this gene?”

Formula: CPM = (count / total reads in sample) x 1,000,000

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# CPM normalization
# requires: data/counts.csv in working directory
let counts = csv("data/counts.csv")
let normalized_cpm = cpm(counts)
println("CPM normalized (first 5 genes):")
println(normalized_cpm |> head(5))

Expected output:

CPM normalized (first 5 genes):
gene    normal_1    normal_2    normal_3    tumor_1    tumor_2    tumor_3
BRCA1   5765.2      6311.5      6111.5      14475.9    16174.9    15191.3
TP53    21619.5     19630.3     21002.4     37889.0    39163.5    36369.3
GAPDH   240217.1    243034.7    229095.5    217107.5   208617.1   226786.8
MYC     3843.3      3505.5      4057.6      19156.6    20432.3    17972.8
ACTB    144130.2    144875.9    138431.6    129848.3   125558.6   132638.9

CPM is good for comparing the same gene across samples but does not account for gene length.

TPM: Transcripts Per Million

TPM corrects for both gene length and library size. It answers: “What fraction of transcripts in this sample came from this gene?”

Steps:

  1. Divide each count by gene length (in kilobases) to get reads per kilobase (RPK).
  2. Sum all RPK values in the sample.
  3. Divide each RPK by the sum and multiply by 1,000,000.

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# TPM normalization (needs gene lengths)
# requires: data/counts.csv, data/gene_lengths.csv in working directory
let counts = csv("data/counts.csv")
let gene_lengths = csv("data/gene_lengths.csv")
let normalized_tpm = tpm(counts, gene_lengths)
println("TPM normalized (first 5 genes):")
println(normalized_tpm |> head(5))

Expected output:

TPM normalized (first 5 genes):
gene    normal_1    normal_2    normal_3    tumor_1    tumor_2    tumor_3
BRCA1   3214.8      3518.9      3401.5      8150.2     9116.3     8550.1
TP53    26971.3     24483.4     26179.1     47620.3    48967.0    45584.2
GAPDH   238641.5    241543.0    226895.7    216413.8   207590.7   225700.3
MYC     9607.1      8759.6      10130.4     48220.9    51524.5    45301.2
ACTB    143201.7    143969.6    137264.8    129348.5   124933.4   131946.3

TPM is preferred for most analyses because it accounts for gene length. CPM is simpler and appropriate when comparing the same gene across samples.

FPKM/RPKM (Older Methods)

FPKM (Fragments Per Kilobase of transcript per Million mapped reads) and RPKM (Reads Per Kilobase per Million) were early normalization methods. They divide by library size first, then by gene length. This ordering makes FPKM/RPKM values not comparable across samples in some edge cases. TPM fixes this problem by normalizing in the opposite order. You may encounter FPKM in older datasets, but use TPM for new analyses.


Exploratory Analysis

Before differential expression, inspect your data for obvious problems.

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# requires: data/counts.csv in working directory
let counts = csv("data/counts.csv")

# Check library sizes (total reads per sample)
let samples = ["normal_1", "normal_2", "normal_3", "tumor_1", "tumor_2", "tumor_3"]
let sample_sums = samples
    |> map(|s| {sample: s, total: col(counts, s) |> sum()})
    |> to_table()
println("Library sizes:")
println(sample_sums)

Expected output:

Library sizes:
sample    total
normal_1  20813
normal_2  21389
normal_3  20958
tumor_1   23486
tumor_2   23497
tumor_3   23376

Library sizes should be roughly similar. If one sample has far fewer reads, it may be a failed library and should be excluded.

# Mean expression per gene across conditions
let gene_means = counts
    |> mutate("normal_mean", |r| round((r.normal_1 + r.normal_2 + r.normal_3) / 3.0, 1))
    |> mutate("tumor_mean", |r| round((r.tumor_1 + r.tumor_2 + r.tumor_3) / 3.0, 1))
    |> select("gene", "normal_mean", "tumor_mean")
println("Mean expression per gene:")
println(gene_means)

Expected output:

Mean expression per gene:
gene      normal_mean  tumor_mean
BRCA1     127.7        358.3
TP53      436.7        886.7
GAPDH     5000.0       5100.0
MYC       80.0         450.0
ACTB      3000.0       3033.3
VEGFA     200.0        620.0
EGFR      310.0        780.0
CDH1      520.0        155.0
RB1       380.0        115.0
PTEN      290.0        90.0
APC       150.0        50.0
KRAS      95.0         420.0
HER2      60.0         540.0
BCL2      340.0        120.0
CDKN2A    260.0        70.0
MDM2      180.0        500.0
PIK3CA    110.0        370.0
TERT      15.0         310.0
IL6       45.0         380.0
TNF       55.0         120.0

Genes like GAPDH and ACTB show similar expression in both conditions — they are housekeeping genes. Genes like MYC, TERT, and IL6 show large differences, suggesting they may be differentially expressed.


Differential Expression

Differential expression analysis identifies genes whose expression differs significantly between two conditions. It uses statistical tests that account for biological variability across replicates.

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# requires: data/counts.csv in working directory
let counts = csv("data/counts.csv")

# Run differential expression analysis
let de_results = diff_expr(counts,
    control: ["normal_1", "normal_2", "normal_3"],
    treatment: ["tumor_1", "tumor_2", "tumor_3"]
)
println(f"DE results: {nrow(de_results)} genes")
println(de_results |> head(5))

Expected output:

DE results: 20 genes
gene    log2fc    pvalue      padj        mean_ctrl  mean_treat
TERT    4.37      0.000012    0.000240    15.0       310.0
MYC     2.49      0.000035    0.000350    80.0       450.0
HER2    3.17      0.000041    0.000273    60.0       540.0
IL6     3.08      0.000058    0.000290    45.0       380.0
KRAS    2.14      0.000089    0.000356    95.0       420.0

The result table includes:

  • log2fc: log2 fold change (positive = higher in treatment/tumor)
  • pvalue: raw p-value from the statistical test
  • padj: p-value adjusted for multiple testing (Benjamini-Hochberg)
  • mean_ctrl: mean expression in control samples
  • mean_treat: mean expression in treatment samples
# Filter significant results
let significant = de_results
    |> filter(|r| r.padj < 0.05 and abs(r.log2fc) > 1.0)
    |> arrange("padj")

println(f"\nSignificant DE genes (|log2FC| > 1, padj < 0.05):")
println(significant)

# Count up vs down regulated
let up = significant |> filter(|r| r.log2fc > 0) |> nrow()
let down = significant |> filter(|r| r.log2fc < 0) |> nrow()
println(f"Upregulated in tumor: {up}")
println(f"Downregulated in tumor: {down}")

Expected output:

Significant DE genes (|log2FC| > 1, padj < 0.05):
gene      log2fc    pvalue      padj        mean_ctrl  mean_treat
TERT      4.37      0.000012    0.000240    15.0       310.0
HER2      3.17      0.000041    0.000273    60.0       540.0
IL6       3.08      0.000058    0.000290    45.0       380.0
MYC       2.49      0.000035    0.000350    80.0       450.0
KRAS      2.14      0.000089    0.000356    95.0       420.0
PIK3CA    1.75      0.000150    0.000500    110.0      370.0
MDM2      1.47      0.000210    0.000600    180.0      500.0
VEGFA     1.63      0.000180    0.000514    200.0      620.0
EGFR      1.33      0.000320    0.000800    310.0      780.0
TP53      1.02      0.000450    0.001000    436.7      886.7
CDKN2A    -1.89     0.000095    0.000380    260.0      70.0
APC       -1.58     0.000120    0.000400    150.0      50.0
PTEN      -1.69     0.000110    0.000393    290.0      90.0
CDH1      -1.75     0.000085    0.000356    520.0      155.0
RB1       -1.72     0.000130    0.000433    380.0      115.0
BCL2      -1.50     0.000200    0.000571    340.0      120.0

Upregulated in tumor: 10
Downregulated in tumor: 6

The upregulated genes (MYC, TERT, HER2, KRAS, EGFR, VEGFA) are well-known oncogenes. The downregulated genes (PTEN, RB1, APC, CDH1, CDKN2A, BCL2) are tumor suppressors. This pattern is biologically consistent with cancer biology.


Fold Change

Fold change measures how much a gene’s expression changes between conditions. We use the log2 scale because it makes increases and decreases symmetric:

log2FCFold changeInterpretation
01x (no change)Same expression in both conditions
12x increaseTwice as high in treatment
24x increaseFour times as high
38x increaseEight times as high
-12x decreaseHalf as much in treatment
-24x decreaseQuarter as much
-38x decreaseOne-eighth as much

On the linear scale, a 2x increase is +100% but a 2x decrease is only -50%. On the log2 scale, both are the same magnitude (1 and -1), making it easier to compare up- and down-regulation.

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# Manual fold change calculation
# requires: data/counts.csv in working directory
let counts = csv("data/counts.csv")

let fc_table = counts
    |> mutate("normal_mean", |r| (r.normal_1 + r.normal_2 + r.normal_3) / 3.0)
    |> mutate("tumor_mean", |r| (r.tumor_1 + r.tumor_2 + r.tumor_3) / 3.0)
    |> mutate("log2fc", |r| log2(r.tumor_mean / r.normal_mean))
    |> select("gene", "normal_mean", "tumor_mean", "log2fc")

println("Fold changes:")
println(fc_table |> head(10))

Expected output:

Fold changes:
gene    normal_mean  tumor_mean  log2fc
BRCA1   127.7        358.3       1.49
TP53    436.7        886.7       1.02
GAPDH   5000.0       5100.0      0.03
MYC     80.0         450.0       2.49
ACTB    3000.0       3033.3      0.02
VEGFA   200.0        620.0       1.63
EGFR    310.0        780.0       1.33
CDH1    520.0        155.0       -1.75
RB1     380.0        115.0       -1.72
PTEN    290.0        90.0        -1.69

Notice: GAPDH and ACTB have log2FC near 0 (housekeeping genes, stable expression). MYC has log2FC = 2.49, meaning it is about 5.6x higher in tumors. CDH1 has log2FC = -1.75, meaning it is about 3.4x lower in tumors (a tumor suppressor being silenced).


Visualization

Volcano Plot

The volcano plot is the classic differential expression visualization. It plots statistical significance (-log10 p-value, y-axis) against biological effect size (log2 fold change, x-axis). Genes in the upper corners are both significant and strongly changed — the most interesting candidates.

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# requires: data/counts.csv in working directory
let counts = csv("data/counts.csv")
let de_results = diff_expr(counts,
    control: ["normal_1", "normal_2", "normal_3"],
    treatment: ["tumor_1", "tumor_2", "tumor_3"]
)

# Basic volcano plot
volcano(de_results)

# With thresholds highlighted
volcano(de_results, fc_threshold: 1.0, p_threshold: 0.05, title: "Tumor vs Normal")

The plot marks genes as:

  • Red (upper right): significantly upregulated (high log2FC, low p-value)
  • Blue (upper left): significantly downregulated (negative log2FC, low p-value)
  • Gray (center/bottom): not significant or small effect

MA Plot

The MA plot shows the relationship between average expression (x-axis) and fold change (y-axis). It helps identify whether fold change estimates are biased by expression level.

# MA plot
ma_plot(de_results)

In a well-behaved experiment, the cloud of points should be centered on log2FC = 0 across all expression levels. If low-expression genes show systematically larger fold changes, additional normalization may be needed.


Multiple Testing Correction

When you test 20,000 genes for differential expression at p < 0.05, you expect 1,000 false positives purely by chance (0.05 x 20,000 = 1,000). Multiple testing correction adjusts p-values to control the false discovery rate.

The Benjamini-Hochberg method is the standard correction. It controls the false discovery rate (FDR): the expected proportion of false positives among all genes called significant.

# Why correction matters
let raw_pvals = [0.001, 0.01, 0.03, 0.04, 0.049, 0.06, 0.1]
let adjusted = p_adjust(raw_pvals, "BH")
println("Raw vs Adjusted p-values:")
for i in range(0, len(raw_pvals)) {
    println(f"  {raw_pvals[i]} -> {round(adjusted[i], 4)}")
}

Expected output:

Raw vs Adjusted p-values:
  0.001 -> 0.007
  0.01 -> 0.035
  0.03 -> 0.07
  0.04 -> 0.07
  0.049 -> 0.0686
  0.06 -> 0.07
  0.1 -> 0.1

Notice how some p-values that were below 0.05 (raw) become above 0.05 after correction. This removes likely false positives.

Rules of thumb:

  • Always use adjusted p-values (padj) when testing many genes.
  • FDR < 0.05 means you expect fewer than 5% of your “significant” results to be false positives.
  • FDR < 0.01 is a more stringent threshold for high-confidence results.
  • diff_expr() in BioLang already returns adjusted p-values in the padj column.

Complete RNA-seq Pipeline

Putting it all together into a single script:

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

# Complete RNA-seq Differential Expression Pipeline
# requires: data/counts.csv, data/gene_lengths.csv in working directory

println("=== RNA-seq Differential Expression Pipeline ===\n")

# Step 1: Load data
let counts = csv("data/counts.csv")
println(f"1. Loaded {nrow(counts)} genes x {ncol(counts) - 1} samples")

# Step 2: Check library sizes
let samples = ["normal_1", "normal_2", "normal_3", "tumor_1", "tumor_2", "tumor_3"]
let lib_sizes = samples
    |> map(|s| {sample: s, total: col(counts, s) |> sum()})
    |> to_table()
println("2. Library sizes:")
println(lib_sizes)

# Step 3: Normalize
let gene_lengths = csv("data/gene_lengths.csv")
let norm = tpm(counts, gene_lengths)
println(f"3. TPM normalization complete")

# Step 4: Differential expression
let de = diff_expr(counts,
    control: ["normal_1", "normal_2", "normal_3"],
    treatment: ["tumor_1", "tumor_2", "tumor_3"]
)

# Step 5: Filter significant
let sig = de
    |> filter(|r| r.padj < 0.05 and abs(r.log2fc) > 1.0)
    |> arrange("padj")

let up = sig |> filter(|r| r.log2fc > 0) |> nrow()
let down = sig |> filter(|r| r.log2fc < 0) |> nrow()
println(f"4. Significant: {nrow(sig)} genes ({up} up, {down} down)")

# Step 6: Show top results
println("\n   Top upregulated:")
let top_up = sig |> filter(|r| r.log2fc > 0) |> head(5)
println(top_up)

println("\n   Top downregulated:")
let top_down = sig |> filter(|r| r.log2fc < 0) |> head(5)
println(top_down)

# Step 7: Visualize
println("\n5. Generating volcano plot...")
volcano(de, fc_threshold: 1.0, p_threshold: 0.05, title: "Tumor vs Normal DE")

# Step 8: Export
write_csv(sig, "results/significant_genes.csv")
println(f"6. Results saved: results/significant_genes.csv")
println("\n=== Pipeline complete ===")

Exercises

  1. Build a count matrix. Create a count matrix for 8 genes across 4 samples (2 treated, 2 control) using to_table(). Calculate CPM for each sample manually (divide by column sum, multiply by 1,000,000) and verify your results match the cpm() function.

  2. Compute fold change. For your 8-gene matrix, calculate the mean expression in each condition and the log2 fold change. Which genes have the largest positive fold change? Which have the largest negative?

  3. Differential expression. Load data/counts.csv and run diff_expr(). How many genes have |log2FC| > 2? What are they? Why might a stricter threshold (|log2FC| > 2) be preferred over |log2FC| > 1?

  4. Volcano plot interpretation. Generate a volcano plot from the differential expression results. Identify the gene in the upper right corner (most significantly upregulated). Identify the gene in the upper left corner (most significantly downregulated). What are their biological roles?

  5. Multiple testing. Generate a list of 100 random p-values between 0 and 1. Apply Benjamini-Hochberg correction with p_adjust(). How many are significant at raw p < 0.05? How many remain significant at adjusted p < 0.05? What does this tell you about false positives?


Key Takeaways

  • RNA-seq measures gene expression by counting sequencing reads that map to each gene. More reads = higher expression.
  • Raw counts need normalization. CPM corrects for library size (sequencing depth). TPM corrects for both gene length and library size. Use TPM for cross-gene comparisons.
  • Differential expression finds genes whose expression changes significantly between conditions, using statistical tests that account for biological variability.
  • log2 fold change is symmetric: log2FC = 1 means 2x increase, log2FC = -1 means 2x decrease, log2FC = 0 means no change.
  • Always correct for multiple testing. Testing 20,000 genes at p < 0.05 generates about 1,000 false positives by chance. Benjamini-Hochberg correction controls the false discovery rate.
  • Volcano plots are the standard visualization, showing both statistical significance and effect size in a single figure.

What’s Next

Tomorrow: statistics for bioinformatics — hypothesis testing, p-values, and when to use which test. You will learn the statistical foundations behind the methods used today.