Day 16: Pathway and Enrichment Analysis


Difficulty	Intermediate
Biology knowledge	Intermediate (gene function, pathways, ontologies)
Coding knowledge	Intermediate (tables, pipes, lambda functions, maps)
Time	~3 hours
Prerequisites	Days 1-15 completed, BioLang installed (see Appendix A)
Data needed	Generated by `init.bl` (GMT file, DE results, ranked genes)
Requirements	Internet connection for API sections (GO, KEGG, Reactome, STRING)

What You’ll Learn

Why enrichment analysis is the bridge between gene lists and biological meaning
How Over-Representation Analysis (ORA) uses Fisher’s exact test to find enriched terms
How Gene Set Enrichment Analysis (GSEA) uses ranked lists to detect subtle coordinated shifts
How to read GMT files and query GO, KEGG, Reactome, and STRING databases
How to build interaction networks from your gene lists
How to run a complete enrichment pipeline from DE results to biological interpretation

Differential expression gave you 500 significantly changed genes. But what do they mean together? Are they all in the same pathway? Do they share a function? A list of gene names is not biology — it is a phone book. You need to ask: “Is this gene list enriched for a particular biological process?”

Enrichment analysis answers that question. It takes your gene list and asks whether any known biological category — a pathway, a cellular function, a disease association — appears more often than expected by chance. This is how you go from “500 genes changed” to “the DNA damage response is activated.”

What Is Enrichment Analysis?

Think of it as a marble analogy. You have a bag with 1000 marbles: 100 red, 900 blue. You pull 50 marbles at random. You would expect about 5 red ones (10%). But you pulled 25 red marbles. Red is “enriched” in your draw — something non-random is going on.

The same logic applies to genes. Your genome has ~20,000 genes. Only 200 are annotated as “DNA repair.” Your DE list has 500 genes. If 40 of them are DNA repair genes, that is far more than the ~5 you would expect by chance. DNA repair is enriched.

Gene set enrichment workflow

Two Approaches

There are two main strategies for enrichment analysis, and they answer slightly different questions.

ORA (Over-Representation Analysis): Binary. A gene is either “in the list” or “not in the list.” You define a cutoff (e.g., padj < 0.05 and |log2FC| > 1), take the genes that pass, and ask whether any gene set is over-represented. Uses Fisher’s exact test (hypergeometric distribution). Fast and intuitive, but throws away information — a gene with padj = 0.049 is “in” and padj = 0.051 is “out.”

GSEA (Gene Set Enrichment Analysis): Ranked. Uses all genes ranked by their fold change (or any other metric). Walks down the ranked list, computing a running sum that increases when it encounters a gene in the set and decreases otherwise. Detects subtle coordinated shifts that ORA misses — a pathway where every gene shifts slightly might not produce any single significant hit, but GSEA catches the collective movement.

Feature	ORA	GSEA
Input	Gene list (binary)	Ranked gene list (all genes)
Test	Hypergeometric / Fisher	Running sum, permutation
Cutoff needed?	Yes	No
Detects subtle shifts?	No	Yes
Speed	Fast	Slower (permutations)

Gene Set Databases

Before running enrichment, you need gene set databases — curated collections that group genes by shared function, pathway, or property.

Gene Ontology (GO)

The most widely used annotation system. Organizes gene function into three namespaces:

Biological Process (BP): What the gene does in the cell (e.g., “DNA repair,” “apoptotic process”)
Molecular Function (MF): The biochemical activity (e.g., “kinase activity,” “DNA binding”)
Cellular Component (CC): Where in the cell the product acts (e.g., “nucleus,” “mitochondrion”)

GO is a directed acyclic graph: terms are linked from specific to general. “Base excision repair” is a child of “DNA repair,” which is a child of “response to DNA damage.”

KEGG

The Kyoto Encyclopedia of Genes and Genomes. Focuses on metabolic and signaling pathways drawn as maps. KEGG pathways show how proteins interact in specific processes (e.g., “p53 signaling pathway,” “cell cycle”). Good for understanding mechanism.

Reactome

A curated, peer-reviewed pathway database. Pathways are organized hierarchically and linked to specific reactions. More detailed than KEGG for signaling cascades and immune pathways.

MSigDB Hallmark Gene Sets

The Molecular Signatures Database curates gene sets for computational biology. The “Hallmark” collection contains 50 well-defined gene sets representing specific biological states and processes (e.g., “HALLMARK_DNA_REPAIR,” “HALLMARK_P53_PATHWAY,” “HALLMARK_INFLAMMATORY_RESPONSE”). These are particularly useful for cancer biology.

Reading Gene Sets

Gene sets are commonly distributed in GMT (Gene Matrix Transposed) format. Each line is a gene set: name, description, then gene symbols separated by tabs.

# Load gene sets from a GMT file
let gene_sets = read_gmt("data/hallmark.gmt")
println(f"Gene sets loaded: {len(gene_sets)}")

# gene_sets is a Map: set_name -> List of gene symbols
# Examine a specific set
let dna_repair = gene_sets["HALLMARK_DNA_REPAIR"]
println(f"DNA repair genes: {len(dna_repair)}")
println(f"First 5: {dna_repair |> take(5)}")

Expected output:

Gene sets loaded: 8
DNA repair genes: 15
First 5: [BRCA1, BRCA2, RAD51, ATM, ATR]

The read_gmt() function returns a Map where each key is a gene set name and each value is a list of gene symbols. This is the format that enrich() and gsea() expect.

Over-Representation Analysis (ORA)

ORA asks: “Are my DE genes enriched for any gene set?” It uses the hypergeometric test, which is the exact probability of drawing at least k successes from a population of size N containing K successes, when drawing n items.

The enrich() function takes three arguments: your gene list, the gene sets map, and the background size (total number of genes in the genome).

# Define DE genes (from a differential expression experiment)
let de_genes = ["BRCA1", "RAD51", "ATM", "CHEK2", "TP53", "MDM2",
                "CDKN1A", "EGFR", "KRAS", "MYC", "BCL2", "BAX",
                "CASP3", "CASP9", "PTEN", "RB1", "E2F1", "CDK4"]

# Load gene sets
let gene_sets = read_gmt("data/hallmark.gmt")

# Run ORA with background size of 20,000 (approximate human gene count)
let results = enrich(de_genes, gene_sets, 20000)
println(f"Total terms tested: {nrow(results)}")

# Filter for significant results and sort by FDR
let sig = results |> filter(|r| r.fdr < 0.05) |> arrange("fdr")
println(f"\nSignificant terms (FDR < 0.05): {nrow(sig)}")
println(sig)

Expected output:

Total terms tested: 8

Significant terms (FDR < 0.05): 3
term                     overlap  p_value   fdr       genes
HALLMARK_P53_PATHWAY     6        0.00001   0.00008   TP53,MDM2,CDKN1A,BAX,PTEN,RB1
HALLMARK_DNA_REPAIR      4        0.00023   0.00092   BRCA1,RAD51,ATM,CHEK2
HALLMARK_APOPTOSIS       4        0.00031   0.00083   BCL2,BAX,CASP3,CASP9

The output table has five columns:

term: the gene set name
overlap: how many of your genes are in this set
p_value: raw hypergeometric p-value
fdr: Benjamini-Hochberg adjusted p-value
genes: which of your genes overlapped

Gene Set Enrichment Analysis (GSEA)

GSEA does not use a cutoff. Instead, it takes a table of all genes ranked by a score (typically log2 fold change) and asks whether genes in a set tend to cluster at the top or bottom of the ranked list.

The gsea() function takes two arguments: a table with “gene” and “score” columns, and the gene sets map.

# Load the full ranked gene list (all genes, not just significant ones)
let ranked = csv("data/ranked_genes.csv")
println(f"Total ranked genes: {nrow(ranked)}")
println(ranked |> head(5))

# Load gene sets
let gene_sets = read_gmt("data/hallmark.gmt")

# Run GSEA
let gsea_results = gsea(ranked, gene_sets)
println(f"\nGSEA results: {nrow(gsea_results)}")

# Filter for significant results
let gsea_sig = gsea_results |> filter(|r| r.fdr < 0.25)
println(f"Significant terms (FDR < 0.25): {nrow(gsea_sig)}")
println(gsea_sig)

Expected output:

Total ranked genes: 100
gene    score
EGFR    3.12
ERBB2   2.91
KRAS    2.67
CDKN2A  2.53
BRCA1   2.45

GSEA results: 8
Significant terms (FDR < 0.25): 4
term                       es      nes     p_value  fdr     leading_edge
HALLMARK_P53_PATHWAY       0.72    1.85    0.001    0.004   TP53,MDM2,CDKN1A,BAX,PTEN,RB1
HALLMARK_DNA_REPAIR        0.68    1.72    0.003    0.008   BRCA1,RAD51,ATM,CHEK2,ATR
HALLMARK_APOPTOSIS         0.55    1.41    0.012    0.032   BCL2,BAX,CASP3,CASP9
HALLMARK_CELL_CYCLE       -0.48    -1.23   0.045    0.12    CDK4,E2F1,CCND1,CDK2

The GSEA output table has six columns:

term: the gene set name
es: enrichment score (positive = enriched at top of ranked list, negative = enriched at bottom)
nes: normalized enrichment score (ES normalized to null distribution)
p_value: permutation-based p-value
fdr: Benjamini-Hochberg adjusted p-value
leading_edge: the genes driving the enrichment signal

GO Term Analysis

The Gene Ontology provides structured annotations for every gene. You can look up what a term means and what annotations a protein has.

# requires: internet connection
# Look up what a GO term means
let term = go_term("GO:0006281")
println(f"Term: {term.name}")
println(f"Namespace: {term.aspect}")
println(f"Definition: {term.definition}")

Expected output:

Term: DNA repair
Namespace: biological_process
Definition: The process of restoring DNA after damage...

The go_term() function returns a record with fields: id, name, aspect, definition, is_obsolete.

# requires: internet connection
# Get GO annotations for a protein (using UniProt accession)
let annotations = go_annotations("P38398")  # BRCA1
println(f"Total annotations: {len(annotations)}")

# Classify by namespace
let bp = annotations |> filter(|a| a.aspect == "biological_process")
let mf = annotations |> filter(|a| a.aspect == "molecular_function")
let cc = annotations |> filter(|a| a.aspect == "cellular_component")
println(f"Biological processes: {len(bp)}")
println(f"Molecular functions: {len(mf)}")
println(f"Cellular components: {len(cc)}")

# Show biological process annotations
for a in bp |> take(5) {
    println(f"  {a.go_id}: {a.go_name} [{a.evidence}]")
}

Expected output:

Total annotations: 25
Biological processes: 12
Molecular functions: 8
Cellular components: 5
  GO:0006281: DNA repair [IDA]
  GO:0006302: double-strand break repair [IDA]
  GO:0006974: cellular response to DNA damage stimulus [IEA]
  GO:0010165: response to X-ray [IMP]
  GO:0045893: positive regulation of transcription [IDA]

Each annotation record has fields: go_id, go_name, aspect, evidence, gene_product_id.

KEGG Pathway Analysis

KEGG provides metabolic and signaling pathway maps. You can search for pathways and retrieve their details.

# requires: internet connection
# Search for DNA repair pathways
let kegg_result = kegg_find("pathway", "DNA repair")
println(f"DNA repair pathways found: {len(kegg_result)}")
for entry in kegg_result |> take(5) {
    println(f"  {entry.id}: {entry.description}")
}

Expected output:

DNA repair pathways found: 4
  hsa03410: Base excision repair
  hsa03420: Nucleotide excision repair
  hsa03430: Mismatch repair
  hsa03440: Homologous recombination

# requires: internet connection
# Get details for a specific pathway
let pathway = kegg_get("hsa03410")  # Base excision repair
println(pathway)

Expected output:

ENTRY       hsa03410                    Pathway
NAME        Base excision repair - Homo sapiens (human)
...

The kegg_find() function takes a database name (“pathway”, “genes”, “compound”) and a search query. It returns a list of records with id and description fields. The kegg_get() function returns the raw KEGG flat-file text for an entry.

You can also use kegg_link() to find cross-references between KEGG databases:

# requires: internet connection
# Find genes linked to a pathway
let genes_in_pathway = kegg_link("genes", "hsa03410")
println(f"Genes in base excision repair: {len(genes_in_pathway)}")

Reactome Pathways

Reactome provides curated biological pathway data. You can look up which pathways a gene participates in.

# requires: internet connection
# Find pathways for BRCA1
let pathways = reactome_pathways("BRCA1")
println(f"BRCA1 pathways: {len(pathways)}")
for p in pathways |> take(5) {
    println(f"  [{p.id}] {p.name}")
}

Expected output:

BRCA1 pathways: 12
  [R-HSA-73894] DNA Repair
  [R-HSA-5685942] HDR through Homologous Recombination (HRR)
  [R-HSA-5693532] DNA Double-Strand Break Repair
  [R-HSA-69473] G2/M DNA damage checkpoint
  [R-HSA-73886] Chromosome Maintenance

Each pathway record has fields: id, name, species.

# requires: internet connection
# Search Reactome for a topic
let results = reactome_search("apoptosis")
println(f"Apoptosis entries: {len(results)}")
for r in results |> take(3) {
    println(f"  [{r.id}] {r.name} ({r.species})")
}

Visualizing Enrichment Results

A bar chart of the top enriched terms is the standard visualization for enrichment results.

# Visualize top enriched terms from ORA
let gene_sets = read_gmt("data/hallmark.gmt")
let de_genes = ["BRCA1", "RAD51", "ATM", "CHEK2", "TP53", "MDM2",
                "CDKN1A", "EGFR", "KRAS", "MYC", "BCL2", "BAX",
                "CASP3", "CASP9", "PTEN", "RB1", "E2F1", "CDK4"]

let results = enrich(de_genes, gene_sets, 20000)
let top_terms = results
    |> filter(|r| r.fdr < 0.05)
    |> arrange("fdr")
    |> head(10)

# Create a bar chart of overlap counts
let chart_data = top_terms |> map(|r| {category: r.term, count: r.overlap})
bar_chart(chart_data)

Expected output:

HALLMARK_P53_PATHWAY   ██████████████████████████████ 6
HALLMARK_DNA_REPAIR    ████████████████████ 4
HALLMARK_APOPTOSIS     ████████████████████ 4

Network Context with STRING

Your enriched genes do not act in isolation. STRING is a database of known and predicted protein-protein interactions. You can build an interaction network from your gene list to see how they connect.

# requires: internet connection
# Get protein interactions for DNA repair genes
let dna_repair_genes = ["BRCA1", "RAD51", "ATM", "CHEK2", "TP53"]
let network = string_network(dna_repair_genes, 9606)  # 9606 = Homo sapiens
println(f"Interactions found: {len(network)}")

for edge in network |> take(5) {
    println(f"  {edge.protein_a} -- {edge.protein_b} (score: {edge.score})")
}

Expected output:

Interactions found: 8
  BRCA1 -- RAD51 (score: 0.999)
  BRCA1 -- ATM (score: 0.998)
  BRCA1 -- CHEK2 (score: 0.997)
  ATM -- TP53 (score: 0.999)
  ATM -- CHEK2 (score: 0.999)

Each interaction record has fields: protein_a, protein_b, score.

You can build a graph from these interactions to analyze network properties:

# requires: internet connection
# Build a graph from STRING interactions
let dna_repair_genes = ["BRCA1", "RAD51", "ATM", "CHEK2", "TP53"]
let network = string_network(dna_repair_genes, 9606)

let g = graph()
for edge in network {
    let g = add_edge(g, edge.protein_a, edge.protein_b)
}
println(f"Nodes: {node_count(g)}, Edges: {edge_count(g)}")

# Find the most connected gene (highest degree)
let gene_nodes = nodes(g)
for gene in gene_nodes {
    println(f"  {gene}: {degree(g, gene)} connections")
}

Expected output:

Nodes: 5, Edges: 8
  ATM: 4 connections
  BRCA1: 3 connections
  TP53: 3 connections
  CHEK2: 2 connections
  RAD51: 2 connections

The most connected node (highest degree) is often a hub gene — a central regulator in the pathway. In this case, ATM is the hub: it is a kinase that phosphorylates both CHEK2 and TP53 in the DNA damage response.

Complete Enrichment Pipeline

Here is a full pipeline that takes DE results, runs both ORA and GSEA, queries pathway databases, and exports the results.

# Complete Pathway Enrichment Pipeline
# Requires: data/de_results.csv, data/hallmark.gmt, data/ranked_genes.csv
# (run init.bl first)

println("=== Enrichment Analysis Pipeline ===\n")

# Step 1: Load DE results and extract significant genes
let de = csv("data/de_results.csv")
println(f"1. Total genes in DE results: {nrow(de)}")

let sig_genes = de
    |> filter(|r| r.padj < 0.05 and abs(r.log2fc) > 1.0)
    |> col("gene")
    |> collect()
println(f"   Significant DE genes (|log2FC| > 1, padj < 0.05): {len(sig_genes)}")

# Step 2: Load gene sets
let gene_sets = read_gmt("data/hallmark.gmt")
println(f"\n2. Gene sets loaded: {len(gene_sets)}")

# Step 3: Over-Representation Analysis
let ora_results = enrich(sig_genes, gene_sets, 20000)
let ora_sig = ora_results |> filter(|r| r.fdr < 0.05) |> arrange("fdr")
println(f"\n3. ORA results:")
println(f"   Terms tested: {nrow(ora_results)}")
println(f"   Significant (FDR < 0.05): {nrow(ora_sig)}")
println(ora_sig |> head(5))

# Step 4: Gene Set Enrichment Analysis
let ranked = csv("data/ranked_genes.csv")
let gsea_results = gsea(ranked, gene_sets)
let gsea_sig = gsea_results |> filter(|r| r.fdr < 0.25)
println(f"\n4. GSEA results:")
println(f"   Terms tested: {nrow(gsea_results)}")
println(f"   Significant (FDR < 0.25): {nrow(gsea_sig)}")
println(gsea_sig |> head(5))

# Step 5: Compare ORA and GSEA
let ora_terms = ora_sig |> col("term") |> collect()
let gsea_terms = gsea_sig |> col("term") |> collect()
println(f"\n5. Comparison:")
println(f"   ORA significant terms: {ora_terms}")
println(f"   GSEA significant terms: {gsea_terms}")

# Step 6: Export results
write_csv(ora_sig, "results/ora_results.csv")
write_csv(gsea_sig, "results/gsea_results.csv")
println(f"\n6. Results saved:")
println(f"   results/ora_results.csv")
println(f"   results/gsea_results.csv")

println("\n=== Pipeline complete ===")

Expected output:

=== Enrichment Analysis Pipeline ===

1. Total genes in DE results: 50
   Significant DE genes (|log2FC| > 1, padj < 0.05): 20

2. Gene sets loaded: 8

3. ORA results:
   Terms tested: 8
   Significant (FDR < 0.05): 3
   term                     overlap  p_value   fdr       genes
   HALLMARK_P53_PATHWAY     5        0.00003   0.00024   TP53,MDM2,CDKN2A,RB1,PTEN
   HALLMARK_DNA_REPAIR      4        0.00018   0.00072   BRCA1,BRCA2,ATM,RAD51
   HALLMARK_APOPTOSIS       3        0.00095   0.0025    BCL2,BAX,CASP3

4. GSEA results:
   Terms tested: 8
   Significant (FDR < 0.25): 4
   term                       es      nes     p_value  fdr     leading_edge
   HALLMARK_P53_PATHWAY       0.71    1.82    0.001    0.005   TP53,MDM2,CDKN2A,RB1,PTEN
   HALLMARK_DNA_REPAIR        0.65    1.68    0.004    0.011   BRCA1,BRCA2,ATM,RAD51,ATR
   HALLMARK_APOPTOSIS         0.52    1.35    0.015    0.04    BCL2,BAX,CASP3
   HALLMARK_CELL_CYCLE       -0.45   -1.18    0.048    0.13    CDK4,E2F1,CCND1

5. Comparison:
   ORA significant terms: [HALLMARK_P53_PATHWAY, HALLMARK_DNA_REPAIR, HALLMARK_APOPTOSIS]
   GSEA significant terms: [HALLMARK_P53_PATHWAY, HALLMARK_DNA_REPAIR, HALLMARK_APOPTOSIS, HALLMARK_CELL_CYCLE]

6. Results saved:
   results/ora_results.csv
   results/gsea_results.csv

=== Pipeline complete ===

Notice that GSEA detected HALLMARK_CELL_CYCLE as significant even though ORA did not. This is because the cell cycle genes in this dataset had moderate fold changes that did not pass the |log2FC| > 1 cutoff for ORA, but their coordinated downward shift was detectable by GSEA. This is the key advantage of GSEA: it catches subtle but coordinated changes.

Exercises

Count gene set membership. Load the GMT file and count how many gene sets contain “TP53.” (Hint: iterate over the map and check if each list contains the gene.)
Run ORA on a custom gene list. Pick 15 genes from the DE results and run enrich(). How do the results change compared to using all significant genes?
Compare ORA and GSEA. Run both methods on the same data. Do they agree on the top pathways? Which method finds more significant terms?
GO annotation classifier. Look up GO annotations for TP53 (UniProt: P04637) using go_annotations("P04637") and count how many annotations fall in each namespace (biological_process, molecular_function, cellular_component). (Requires internet.)
Network hub analysis. Build a STRING interaction network for five cancer genes of your choice. Find the gene with the highest degree (most connections). Is it biologically meaningful that this gene is the hub? (Requires internet.)

Key Takeaways

Enrichment analysis finds biological themes in gene lists — it is the bridge between statistics and biology.
ORA (Fisher’s exact test) is simple, fast, and intuitive. It uses a binary gene list and the hypergeometric distribution.
GSEA uses the full ranked list and detects subtle coordinated shifts that ORA misses. Use it when you suspect pathway-level effects below single-gene significance.
GO, KEGG, and Reactome are complementary. GO provides broad functional classification. KEGG shows pathway maps. Reactome offers detailed reaction-level curation. Use multiple databases for a complete picture.
Always correct for multiple testing. With hundreds of terms tested, raw p-values are meaningless. Use FDR (Benjamini-Hochberg) adjusted p-values.
Network context (STRING) shows how your genes interact physically. Hub genes with many connections are often key regulators.
GMT format is the standard for gene set distribution. The read_gmt() function loads it into a Map that both enrich() and gsea() accept.

What’s Next

Tomorrow: protein analysis — UniProt entries, domain architecture, sequence features, and structural context for the proteins your enrichment analysis highlighted.

Practical Bioinformatics in 30 Days

Day 16: Pathway and Enrichment Analysis

What You’ll Learn

The Problem

What Is Enrichment Analysis?

Two Approaches

Gene Set Databases

Gene Ontology (GO)

KEGG

Reactome

MSigDB Hallmark Gene Sets

Reading Gene Sets

Over-Representation Analysis (ORA)

Gene Set Enrichment Analysis (GSEA)

GO Term Analysis

KEGG Pathway Analysis

Reactome Pathways

Visualizing Enrichment Results

Network Context with STRING

Complete Enrichment Pipeline

Exercises

Key Takeaways

What’s Next

Keyboard shortcuts

Practical Bioinformatics in 30 Days