Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 22: Clustering — Finding Structure in Omics Data

Day 22 of 30 Prerequisites: Days 2-3, 13, 21 ~60 min reading Unsupervised Learning

The Problem

You are part of a cancer genomics consortium. Five hundred tumor samples have been profiled with RNA-seq, measuring the expression of 18,000 genes in each. The pathologist has classified these tumors into three histological subtypes based on what she sees under the microscope. But molecular data often reveals finer distinctions invisible to the eye.

Your task: find natural groupings in the gene expression data — without peeking at the pathologist’s labels. If the molecular subtypes align with the histological ones, confidence in the classification increases. If the data reveals additional subtypes, you may have discovered clinically distinct groups that respond differently to treatment. In breast cancer, this is exactly how the PAM50 molecular subtypes were discovered — and they now guide treatment decisions for millions of patients worldwide.

But there is a danger lurking. Clustering algorithms always find clusters, even in random noise. The critical question is not “can I find groups?” but “are the groups real?”

What Is Clustering?

Clustering is unsupervised grouping: divide observations into sets such that observations within a set are more similar to each other than to observations in other sets.

Unlike classification (supervised learning), clustering has no labels to learn from. You are not training the algorithm to distinguish “tumor” from “normal.” You are asking the algorithm to discover structure on its own.

Think of it as sorting a pile of coins. If you have pennies, nickels, dimes, and quarters, the task is straightforward — there are obvious groupings by size and color. But if someone hands you a pile of irregularly shaped pebbles and says “sort these into groups,” you must decide what “similar” means. By weight? By color? By texture? Different definitions of similarity lead to different groupings. This subjectivity is both the power and the peril of clustering.

Key insight: Clustering is a tool for exploration, not proof. It generates hypotheses about structure in your data. Validating those hypotheses requires independent evidence — clinical outcomes, functional assays, or replication in new datasets.

K-Means Clustering

K-means is the simplest and most widely used clustering algorithm. It partitions data into exactly k groups by iterating two steps.

The Algorithm

  1. Choose k (the number of clusters you want).
  2. Initialize k random “centroids” (cluster centers).
  3. Assign each data point to the nearest centroid.
  4. Update each centroid to the mean of its assigned points.
  5. Repeat steps 3-4 until assignments stop changing.

That is it. The algorithm converges quickly, typically in 10-20 iterations.

Properties

PropertyK-means
Shape of clustersSpherical (roughly equal-sized blobs)
Requires k in advanceYes
Handles outliersPoorly — outliers pull centroids
Handles unequal cluster sizesPoorly — tends to split large clusters
SpeedVery fast, even on large datasets
DeterministicNo — depends on random initialization
K-Means Iteration (k=3) 1. Random Centroids + + + Centroids placed randomly 2. Assign to Nearest + + + Colors show assignments 3. Update Centroids + + + Centroids at cluster means

+ Blue centroid + Red centroid + Green centroid Repeat steps 2-3 until convergence

Hierarchical Clustering: Dendrogram

S1 S2 S3 S4 S5 S6 S7 S8

Cut here → 3 clusters Height

The Initialization Problem

Because k-means starts with random centroids, different runs can give different results. A bad initialization might converge to a suboptimal solution. The standard fix is to run k-means multiple times (say 10-50) with different random seeds and keep the solution with the lowest total within-cluster variance.

set_seed(42)
# K-means clustering on PCA-reduced expression data
let result = pca(expression_matrix)
let pca_scores = result.scores |> take_cols(0..20)  # first 20 PCs

let clusters = kmeans(pca_scores, 4)
print("Cluster sizes: " + str(clusters.sizes))
print("Total within-cluster variance: " + str(clusters.total_within_ss))

Choosing k: The Elbow Method

The hardest part of k-means is choosing k. If you pick k = 500, every tumor gets its own cluster — perfect within-cluster similarity but meaningless. If you pick k = 1, everything is in one group — also meaningless. The right k is somewhere in between.

The elbow method runs k-means for k = 1, 2, 3, …, K and plots the total within-cluster sum of squares (WCSS) against k. As k increases, WCSS always decreases (more clusters = tighter fits). The “elbow” — where the rate of decrease sharply levels off — suggests the best k.

set_seed(42)
# Elbow plot to find optimal k
let wcss = seq(1, 15) |> map(|k| {
  let cl = kmeans(pca_scores, k)
  {k: k, within_ss: cl.total_within_ss}
}) |> to_table()
plot(wcss, {type: "line", x: "k", y: "within_ss",
  title: "Elbow Plot — Optimal Number of Clusters"})

Common pitfall: The elbow is often ambiguous. Real data rarely shows a sharp bend. If the elbow plot suggests k could be 3, 4, or 5, use biological knowledge or validation metrics (like silhouette scores) to decide.

Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of nested clusters rather than producing a flat partition. It does not require you to choose k in advance — you can cut the tree at any height to get the desired number of clusters.

Agglomerative (Bottom-Up) Algorithm

  1. Start with each sample as its own cluster (500 clusters for 500 tumors).
  2. Find the two closest clusters and merge them (now 499 clusters).
  3. Repeat until everything is in one cluster.
  4. The dendrogram records every merge and the distance at which it occurred.

Linkage Methods

“Distance between clusters” is ambiguous when clusters contain multiple points. The linkage method resolves this:

LinkageDistance between clusters A and BTendency
SingleMinimum distance between any pairProduces long, chained clusters
CompleteMaximum distance between any pairProduces compact, equal-sized clusters
AverageMean pairwise distanceCompromise between single and complete
WardIncrease in total variance when mergedMinimizes within-cluster variance (like k-means)

Ward linkage is the most common choice in genomics because it tends to produce balanced, compact clusters similar to k-means.

# Hierarchical clustering with Ward linkage
let hc = hclust(pca_scores, "ward")

# Cut tree at k=4 clusters
let labels = hc |> cut_tree(4)
print("Hierarchical cluster sizes: " + str(table_counts(labels)))

The Dendrogram

The dendrogram is one of the most informative visualizations in genomics. The height of each merge indicates how dissimilar the merged clusters were. A long vertical line before a merge suggests a clear separation; short lines suggest gradual transitions.

# Dendrogram with colored clusters
dendrogram(hc, {k: 4, title: "Tumor Expression — Hierarchical Clustering"})

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach. Instead of assuming spherical clusters, it finds regions of high density separated by regions of low density. It also identifies outliers — points in low-density regions that do not belong to any cluster.

The Algorithm

  1. For each point, count neighbors within distance epsilon.
  2. If a point has at least min_samples neighbors, it is a “core point.”
  3. Connect core points that are within epsilon of each other.
  4. Connected components of core points form clusters.
  5. Non-core points within epsilon of a core point join that cluster.
  6. Remaining points are labeled as noise (outliers).

Properties

PropertyDBSCAN
Shape of clustersAny shape (can find crescents, rings, etc.)
Requires k in advanceNo — finds k automatically
Handles outliersExcellent — explicitly labels them
Parametersepsilon (neighborhood radius), min_samples
Handles varying densityPoorly — one epsilon for all regions
set_seed(42)
# DBSCAN on PCA scores
let db = dbscan(pca_scores, 2.5, 10)
print("Clusters found: " + str(db.n_clusters))
print("Noise points: " + str(db.n_noise))

Key insight: DBSCAN is excellent for scRNA-seq data where clusters are irregularly shaped and outlier cells (doublets, dying cells) should be flagged rather than forced into a cluster. K-means forces every point into a cluster — DBSCAN does not.

Silhouette Scores: Validating Clusters

How do you know your clusters are real? The silhouette score provides internal validation. For each point, it measures:

  • a = mean distance to other points in the same cluster
  • b = mean distance to points in the nearest other cluster
  • Silhouette = (b - a) / max(a, b)
Silhouette Score: Measuring Cluster Quality Point X a = 28 (own cluster) b = 215 (nearest other) Silhouette Score s = (b - a) / max(a, b) s = (215 - 28) / 215 = 0.87 Cluster A Cluster B Cluster C Close to +1 = well-clustered | Near 0 = on boundary | Negative = probably misassigned

The score ranges from -1 to +1:

ScoreInterpretation
+1Point is far from neighboring clusters — excellent clustering
0Point is on the boundary between clusters
-1Point is probably in the wrong cluster

Average silhouette scores:

Average ScoreQuality
0.71 - 1.00Strong structure
0.51 - 0.70Reasonable structure
0.26 - 0.50Weak structure, possibly artificial
< 0.25No substantial structure found
set_seed(42)
# Silhouette analysis for different k
for k in 2..8 {
  let cl = kmeans(pca_scores, k)
  let sil = cl.silhouette
  print("k=" + str(k) + " silhouette: " + str(round(sil, 3)))
}

Common pitfall: A high silhouette score does not prove biological relevance. Random data with well-separated clusters in PCA space can have high silhouette scores. Always validate clusters against independent information — clinical outcomes, known markers, or held-out data.

Clustering Always Finds Clusters — Even in Noise

This is the single most important warning about clustering. Run k-means with k = 3 on completely random data, and it will dutifully return three clusters. The clusters will look somewhat real in a scatter plot. But they are meaningless.

set_seed(42)
# DANGER: Clustering random noise
let noise = table(500, 20, "rnorm")
let fake_clusters = kmeans(noise, 3)
let noise_pca = pca(noise)

# These clusters are pure noise — but the plot looks convincing!
scatter(noise_pca.scores[0], noise_pca.scores[1])

# Silhouette score will be low
let sil = fake_clusters.silhouette
print("Silhouette on noise: " + str(round(sil, 3)))  # ~0.1-0.2

Always compare your clustering against a null — random data of the same dimensions and sample size. If the silhouette score on real data is not substantially higher than on random data, the clusters may not be real.

Heatmaps with Clustering

The clustered heatmap is the workhorse visualization of genomics. It shows expression values as colors, with both rows (genes) and columns (samples) reordered by hierarchical clustering. Similar genes are placed next to similar genes; similar samples next to similar samples.

# Clustered heatmap of top variable genes
let top_genes = variance_per_column(expression_matrix)
  |> sort_by(|x| -x.value)
  |> take(100)
  |> map(|x| x.name)

let sub_matrix = expression_matrix |> select_cols(top_genes)

heatmap(sub_matrix, {cluster_rows: true, cluster_cols: true,
  color_scale: "viridis",
  title: "Top 100 Variable Genes — Clustered Heatmap"})

Clustering in BioLang — Complete Pipeline

set_seed(42)
# Full clustering analysis on simulated tumor expression data

# Simulate 500 tumors with 4 molecular subtypes
let n_per_group = 125
let n_genes = 200

let subtype_a = table(n_per_group, n_genes, "rnorm")
let subtype_b = table(n_per_group, n_genes, "rnorm")
let subtype_c = table(n_per_group, n_genes, "rnorm")
let subtype_d = table(n_per_group, n_genes, "rnorm")

# Each subtype has distinct gene blocks elevated
for i in 0..n_per_group {
  for j in 0..50 { subtype_a[i][j] = subtype_a[i][j] + 3.0 }
  for j in 50..100 { subtype_b[i][j] = subtype_b[i][j] + 3.0 }
  for j in 100..150 { subtype_c[i][j] = subtype_c[i][j] + 3.0 }
  for j in 150..200 { subtype_d[i][j] = subtype_d[i][j] + 3.0 }
}

let expr = rbind(subtype_a, subtype_b, subtype_c, subtype_d)
let true_labels = repeat("A", 125) + repeat("B", 125) + repeat("C", 125) + repeat("D", 125)

# 1. PCA for visualization and dimensionality reduction
let pca_result = pca(expr)
scatter(pca_result.scores[0], pca_result.scores[1])

# 2. K-means clustering
let km = kmeans(expr, 4)
scatter(pca_result.scores[0], pca_result.scores[1])

# 3. Elbow plot
let wcss = seq(1, 10) |> map(|k| {
  let cl = kmeans(expr, k)
  {k: k, within_ss: cl.total_within_ss}
}) |> to_table()
plot(wcss, {type: "line", x: "k", y: "within_ss",
  title: "Elbow Plot"})

# 4. Silhouette analysis
for k in 2..8 {
  let cl = kmeans(expr, k)
  let sil = cl.silhouette
  print("k=" + str(k) + " silhouette=" + str(round(sil, 3)))
}

# 5. Hierarchical clustering
let hc = hclust(expr, "ward")
dendrogram(hc, {k: 4, title: "Dendrogram — Ward Linkage"})

let hc_labels = hc |> cut_tree(4)
scatter(pca_result.scores[0], pca_result.scores[1])

# 6. DBSCAN
let db = dbscan(expr, 8.0, 10)
print("DBSCAN found " + str(db.n_clusters) + " clusters, " + str(db.n_noise) + " noise")

# 7. Compare clustering to true labels
print("\nK-means vs true labels:")
print(cross_tabulate(km.labels, true_labels))

print("\nHierarchical vs true labels:")
print(cross_tabulate(hc_labels, true_labels))

# 8. Clustered heatmap
heatmap(expr, {cluster_rows: true, cluster_cols: true,
  color_scale: "red_blue",
  title: "Expression Heatmap — K-means Clusters"})

Python:

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage
import seaborn as sns

# K-means
km = KMeans(n_clusters=4, n_init=25, random_state=42)
km_labels = km.fit_predict(expr)

# Elbow plot
wcss = [KMeans(n_clusters=k, n_init=25).fit(expr).inertia_ for k in range(1, 10)]
plt.plot(range(1, 10), wcss, 'bo-')
plt.show()

# Silhouette
for k in range(2, 8):
    km_k = KMeans(n_clusters=k, n_init=25).fit(expr)
    print(f"k={k} silhouette={silhouette_score(expr, km_k.labels_):.3f}")

# Hierarchical
Z = linkage(expr, method='ward')
dendrogram(Z)
plt.show()

# Heatmap
sns.clustermap(expr, method='ward', cmap='RdBu_r')
plt.show()

R:

# K-means
km <- kmeans(expr, centers = 4, nstart = 25)
table(km$cluster, true_labels)

# Elbow plot
wcss <- sapply(1:9, function(k) kmeans(expr, k, nstart=25)$tot.withinss)
plot(1:9, wcss, type="b", xlab="k", ylab="Total within SS")

# Hierarchical
hc <- hclust(dist(expr), method = "ward.D2")
plot(hc)
cutree(hc, k = 4)

# Heatmap
library(pheatmap)
pheatmap(expr, clustering_method = "ward.D2", show_rownames = FALSE)

# Silhouette
library(cluster)
for (k in 2:7) {
  cl <- kmeans(expr, k, nstart=25)
  cat(sprintf("k=%d silhouette=%.3f\n", k, mean(silhouette(cl$cluster, dist(expr))[,3])))
}

Exercises

  1. Three subtypes. The pathologist says 3 subtypes, but your elbow plot suggests 4 or 5. Simulate 500 tumors with 5 true subtypes (100 each). Run k-means for k = 3, 4, 5, 6. Use silhouette scores and cross-tabulation against true labels. Which k best recovers the truth?
# Your code: simulate 5 subtypes, test multiple k values
  1. Linkage comparison. Run hierarchical clustering on the same data with single, complete, average, and Ward linkage. Cut each at k = 5. Which linkage best recovers the true labels? Visualize the four dendrograms.
# Your code: four linkage methods, compare cross-tabulations
  1. DBSCAN tuning. On the 5-subtype data, try DBSCAN with epsilon values from 3 to 15 (step 1) and min_samples from 5 to 20. Find the combination that gives 5 clusters with the fewest noise points.
# Your code: grid search over epsilon and min_samples
  1. Noise test. Generate pure random data (500 samples x 200 genes, no structure). Run k-means with k = 4. Plot the result. Calculate the silhouette score. Now compare to the silhouette score from the structured data. What is the difference?
# Your code: cluster noise, compare silhouette to real data
  1. Heatmap with annotation. Create a clustered heatmap of the top 50 most variable genes, with a color bar showing both k-means cluster assignment and true subtype. Do the two agree?
# Your code: select top variable genes, heatmap with dual annotation

Key Takeaways

  • Clustering is unsupervised grouping — it finds structure without labels, making it essential for discovering molecular subtypes, cell populations, and other hidden patterns.
  • K-means is fast and simple but requires choosing k in advance, assumes spherical clusters, and is sensitive to initialization.
  • The elbow method and silhouette scores help choose k, but neither is definitive — biological knowledge must guide the final decision.
  • Hierarchical clustering produces a dendrogram showing nested relationships; Ward linkage is the default choice for genomics.
  • DBSCAN finds arbitrarily shaped clusters and identifies outliers, making it valuable for scRNA-seq data, but requires tuning epsilon and min_samples.
  • Silhouette scores validate clustering quality: above 0.5 is reasonable, above 0.7 is strong, but always compare against random data.
  • Clustering always finds clusters, even in noise. Never trust clusters without validation against independent evidence.
  • Clustered heatmaps are the standard visualization for gene expression patterns across samples.

What’s Next

PCA and clustering assume that the data you have is large enough to draw conclusions. But what if your dataset is tiny — six mice per group, far too few to verify normality or trust asymptotic theory? Tomorrow, we meet resampling methods: bootstrap and permutation tests. These techniques let your data speak for itself, building confidence intervals and testing hypotheses without any distributional assumptions, by literally reshuffling the data thousands of times.