Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 13: Correlation — Finding Relationships

The Problem

Dr. Sarah Kim is studying breast cancer transcriptomics across 200 tumor samples. She notices that BRCA1 and BARD1 expression seem to rise and fall together — when one is high, the other tends to be high too. Exciting! These genes encode proteins that form a heterodimer critical for DNA repair.

But her collaborator raises a concern: “Both genes are upregulated in rapidly dividing cells. Couldn’t cell proliferation be driving both signals independently? You might be seeing a spurious association.”

Sarah needs to:

  1. Quantify how strongly BRCA1 and BARD1 co-vary
  2. Determine whether the relationship is statistically significant
  3. Control for the confounding effect of proliferation
  4. Visualize relationships across an entire panel of DNA repair genes

This is the domain of correlation analysis — one of the most used (and most misused) tools in all of biology.

What Is Correlation?

Correlation measures the strength and direction of the relationship between two variables. It answers: “When one variable goes up, does the other tend to go up, go down, or do nothing?”

The correlation coefficient ranges from -1 to +1:

ValueInterpretationExample
+1.0Perfect positiveIdentical twins’ heights
+0.7 to +0.9Strong positiveBRCA1 and BARD1 expression
+0.3 to +0.7Moderate positiveBMI and blood pressure
-0.3 to +0.3Weak / noneShoe size and IQ
-0.3 to -0.7Moderate negativeExercise and resting heart rate
-0.7 to -1.0Strong negativeTumor suppressor vs. proliferation rate
-1.0Perfect negativeAltitude and air pressure

Key insight: Correlation is symmetric — the correlation between X and Y is the same as between Y and X. It doesn’t imply direction or causation.

Correlation Spectrum: r = -0.9 to r = +0.9 r = -0.9 Strong negative r = -0.5 Moderate neg. r = 0 No correlation r = +0.5 Moderate pos. r = +0.9 Strong positive -1.0 0 +1.0

Three Types of Correlation

1. Pearson’s r: The Linear Standard

Pearson’s correlation coefficient measures linear association:

$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

Assumptions:

  • Both variables are continuous
  • Relationship is approximately linear
  • Data are roughly normally distributed (for inference)
  • No extreme outliers (a single outlier can flip the sign)

Strengths: Most powerful when assumptions are met. Directly related to R² in linear regression.

Weakness: Sensitive to outliers. Misses non-linear relationships entirely.

2. Spearman’s ρ (rho): The Rank-Based Alternative

Spearman’s correlation operates on ranks rather than raw values. It measures monotonic association — whether Y tends to increase (or decrease) as X increases, even if not linearly.

How it works:

  1. Rank each variable separately (1, 2, 3, …)
  2. Compute Pearson’s r on the ranks

Assumptions:

  • Ordinal or continuous data
  • Monotonic relationship (doesn’t need to be linear)

Strengths: Robust to outliers. Works with non-normal distributions. Handles log-scale expression data naturally.

Weakness: Less powerful than Pearson when linearity holds.

3. Kendall’s τ (tau): The Most Robust

Kendall’s tau counts concordant and discordant pairs:

$$\tau = \frac{(\text{concordant pairs}) - (\text{discordant pairs})}{\binom{n}{2}}$$

A pair (i, j) is concordant if both xᵢ > xⱼ and yᵢ > yⱼ (or both less). It’s discordant if they disagree.

Strengths: Most robust to outliers. Better for small samples. Has clearer probabilistic interpretation.

Weakness: Computationally slower for large datasets. Values tend to be smaller than Pearson/Spearman for the same data.

Decision Table: Which Correlation to Use?

SituationBest ChoiceWhy
Both variables normal, linear relationshipPearsonMost powerful
Skewed data (e.g., raw gene expression)SpearmanRank-based, outlier-robust
Small sample (n < 30)KendallMost reliable for small n
Ordinal data (e.g., tumor grade 1-4)Spearman or KendallRanks are appropriate
Suspect outliersSpearman or KendallRank-based methods resist outliers
Large RNA-seq dataset, log-transformedPearson or SpearmanBoth work well after log transform
Survival times with censoringKendallHandles ties from censoring

Common pitfall: Using Pearson on raw RNA-seq counts. These are heavily right-skewed — a few highly expressed genes dominate the correlation. Always log-transform first or use Spearman.

Partial Correlation: Controlling for Confounders

Standard correlation between X and Y can be inflated or deflated by a confounding variable Z that influences both. Partial correlation removes the effect of Z:

$$r_{XY \cdot Z} = \frac{r_{XY} - r_{XZ} \cdot r_{YZ}}{\sqrt{(1 - r_{XZ}^2)(1 - r_{YZ}^2)}}$$

Example: BRCA1 and BARD1 might correlate at r = 0.85. But if cell proliferation (MKI67 expression) drives both, the partial correlation controlling for MKI67 might drop to r = 0.45 — still real, but weaker.

Clinical relevance: In pharmacogenomics, two drug targets may appear correlated simply because both are overexpressed in a particular cancer subtype. Partial correlation controlling for subtype reveals whether the targets have an independent relationship.

Anscombe’s Quartet: Why You Must Visualize

In 1973, Francis Anscombe created four datasets with identical Pearson correlations (r = 0.816), identical means, and identical regression lines — but wildly different patterns:

DatasetPatternLesson
INormal linearCorrelation works correctly
IIPerfect curver misses non-linearity
IIITight line + one outlierSingle point inflates r
IVVertical cluster + outlierr is meaningless

The lesson: Never trust a correlation coefficient without a scatter plot.

Anscombe's Quartet: All Have r = 0.82 — But Look Different! I: Linear Correlation works II: Curved r misses non-linearity III: Outlier One point inflates r IV: Clustered r is meaningless All four datasets: r = 0.82, same mean, same variance, same regression line Lesson: ALWAYS visualize before trusting a correlation coefficient

Correlation Matrix and Heatmaps

When studying many variables simultaneously, a correlation matrix shows all pairwise correlations. For p variables, this is a p × p symmetric matrix with 1s on the diagonal.

For genomics, this is invaluable for:

  • Identifying co-expression modules
  • Detecting batch effects (technical variables cluster together)
  • Revealing pathway relationships

A heatmap visualization uses color intensity to represent correlation strength, making patterns immediately visible across dozens or hundreds of variables.

Testing Significance: Is This Correlation Real?

A correlation of r = 0.3 might be noise in 10 samples but highly significant in 1000. The correlation test evaluates:

  • H₀: ρ = 0 (no correlation in the population)
  • H₁: ρ ≠ 0

The test statistic follows a t-distribution with n-2 degrees of freedom:

$$t = r\sqrt{\frac{n-2}{1-r^2}}$$

Key insight: With large n (common in genomics), even tiny correlations become “significant.” A correlation of r = 0.05 is significant at p < 0.05 when n > 1500. Always report both the coefficient AND the p-value, and judge practical significance by the magnitude of r.

Correlation ≠ Causation

This cannot be overstated. Correlation tells you variables co-vary — nothing more.

Classic examples in biology:

  • Ice cream sales correlate with drowning deaths (confounder: hot weather)
  • Shoe size correlates with reading ability in children (confounder: age)
  • Stork population correlates with birth rate across European countries (confounder: rural vs. urban)

To establish causation, you need:

  1. Temporal precedence — cause precedes effect
  2. Experimental manipulation — perturb X, observe Y change
  3. Elimination of confounders — no third variable explains both
  4. Biological mechanism — plausible pathway
Correlation Does NOT Imply Causation A confounding variable creates a spurious association Hot Weather (Confounder) Ice Cream Sales Drowning Deaths r = 0.87 (spurious!) No causal link exists between ice cream and drowning

Correlation in BioLang

Basic Correlations

set_seed(42)
# Generate tumor expression data for 200 samples
let n = 200

# Simulate BRCA1 and BARD1 with true correlation + noise
let proliferation = rnorm(n, 10, 3)
let brca1 = proliferation * 0.8 + rnorm(n, 5, 2)
let bard1 = proliferation * 0.7 + rnorm(n, 4, 2)

# Pearson correlation — assumes linearity
let r_pearson = cor(brca1, bard1)
print("Pearson r: {r_pearson}")  # ~0.78

# Spearman rank correlation — monotonic, robust
let r_spearman = spearman(brca1, bard1)
print("Spearman ρ: {r_spearman}")  # ~0.76

# Kendall tau — concordant/discordant pairs, most robust
let r_kendall = kendall(brca1, bard1)
print("Kendall τ: {r_kendall}")  # ~0.57 (typically smaller)

Statistical Testing

# Test whether correlation is significantly different from zero
# cor() returns the coefficient; use a t-test transformation for p-value
let r = cor(brca1, bard1)
let t_stat = r * sqrt((n - 2) / (1 - r * r))
print("r = {r}, t = {t_stat}")

# Spearman returns {coefficient, pvalue}
let spearman_result = spearman(brca1, bard1)
print("Spearman ρ = {spearman_result}")

Partial Correlation: Removing Confounders

set_seed(42)
# BRCA1-BARD1 correlation controlling for proliferation (MKI67)
let mki67 = proliferation + rnorm(n, 0, 1)

# Raw correlation
let r_raw = cor(brca1, bard1)
print("Raw Pearson r: {r_raw}")  # ~0.78

# Partial correlation controlling for MKI67
# Compute manually: regress out confounder from both variables
let r_xz = cor(brca1, mki67)
let r_yz = cor(bard1, mki67)
let r_partial = (r_raw - r_xz * r_yz) / sqrt((1 - r_xz * r_xz) * (1 - r_yz * r_yz))
print("Partial r (controlling MKI67): {r_partial}")  # lower — confounder removed

# The difference reveals how much of the BRCA1-BARD1
# association was driven by shared proliferation signal
print("Reduction: {((r_raw - r_partial) / r_raw * 100) |> round(1)}%")

Correlation Matrix and Heatmap

set_seed(42)
# Build expression matrix for 8 DNA repair genes
let base = rnorm(200, 0, 1)

let genes = {
    "BRCA1":  base * 0.8 + rnorm(200, 10, 2),
    "BARD1":  base * 0.7 + rnorm(200, 8, 2),
    "RAD51":  base * 0.6 + rnorm(200, 12, 3),
    "PALB2":  base * 0.5 + rnorm(200, 9, 2),
    "ATM":    rnorm(200, 11, 3),
    "TP53":   base * -0.4 + rnorm(200, 15, 4),
    "MDM2":   base * -0.3 + rnorm(200, 7, 2),
    "GAPDH":  rnorm(200, 20, 1)
}

# Compute pairwise correlations
let gene_names = ["BRCA1", "BARD1", "RAD51", "PALB2", "ATM", "TP53", "MDM2", "GAPDH"]
for i in 0..8 {
    for j in (i+1)..8 {
        let r = cor(genes[gene_names[i]], genes[gene_names[j]])
        print("{gene_names[i]} vs {gene_names[j]}: r = {r |> round(3)}")
    }
}

# Visualize as heatmap — instantly reveals co-expression modules
heatmap(genes, {title: "DNA Repair Gene Co-Expression", color_scale: "RdBu"})

Visualizing with Scatter Plots

# Scatter plot with correlation annotation
let scatter_data = table({"BRCA1": brca1, "BARD1": bard1})
plot(scatter_data, {type: "scatter", x: "BRCA1", y: "BARD1",
    title: "BRCA1 vs BARD1 Co-Expression",
    x_label: "BRCA1 Expression (log2 FPKM)",
    y_label: "BARD1 Expression (log2 FPKM)"})

Demonstrating Anscombe’s Quartet Effect

set_seed(42)
# Two datasets with identical Pearson r but different patterns
let x_linear = rnorm(100, 10, 3)
let y_linear = x_linear * 0.5 + rnorm(100, 0, 2)

let x_curve = rnorm(100, 10, 5)
let y_curve = (x_curve - 10) ** 2 / 10 + rnorm(100, 0, 1)

print("Linear: Pearson r = {cor(x_linear, y_linear)}")
print("Curved: Pearson r = {cor(x_curve, y_curve)}")
print("Linear: Spearman ρ = {spearman(x_linear, y_linear)}")
print("Curved: Spearman ρ = {spearman(x_curve, y_curve)}")

# Spearman catches the monotonic but non-linear pattern
# Always plot your data!

Python:

import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

# Pearson
r, p = stats.pearsonr(brca1, bard1)

# Spearman
rho, p = stats.spearmanr(brca1, bard1)

# Kendall
tau, p = stats.kendalltau(brca1, bard1)

# Partial correlation (using pingouin)
import pingouin as pg
partial = pg.partial_corr(data=df, x='BRCA1', y='BARD1', covar='MKI67')

# Correlation matrix heatmap
corr = df.corr(method='spearman')
sns.heatmap(corr, annot=True, cmap='RdBu_r', center=0)

R:

# Pearson
cor.test(brca1, bard1, method = "pearson")

# Spearman
cor.test(brca1, bard1, method = "spearman")

# Kendall
cor.test(brca1, bard1, method = "kendall")

# Partial correlation (using ppcor)
library(ppcor)
pcor.test(brca1, bard1, mki67)

# Correlation matrix heatmap
library(corrplot)
corrplot(cor(gene_matrix, method = "spearman"),
         method = "color", type = "upper")

Exercises

Exercise 1: Compare Three Methods

Compute Pearson, Spearman, and Kendall correlations for the following gene pairs. Which method gives the most different result, and why?

set_seed(42)
let n = 150

# Gene pair 1: linear relationship with outliers
let gene_a = rnorm(n, 8, 2)
let gene_b = gene_a * 0.6 + rnorm(n, 3, 1)

# Add 5 extreme outliers
# (imagine contaminated samples)

# Compute all three correlations for gene_a vs gene_b
# Which is most affected by outliers?

Exercise 2: Partial Correlation in Drug Response

Three variables are measured across 100 cancer cell lines: drug sensitivity (IC50), target gene expression, and cell doubling time. The target gene and IC50 appear correlated. Is the relationship real, or driven by growth rate?

set_seed(42)
let n = 100
let growth_rate = rnorm(n, 24, 6)

let target_expr = growth_rate * 0.5 + rnorm(n, 10, 3)
let ic50 = growth_rate * -0.3 + rnorm(n, 50, 10)

# 1. Compute raw correlation between target_expr and ic50
# 2. Compute partial correlation controlling for growth_rate
# 3. What fraction of the apparent association was confounded?

Exercise 3: Build and Interpret a Heatmap

Create a correlation heatmap for the following gene expression panel. Identify which genes cluster into co-expression modules.

set_seed(42)
let n = 300

# Simulate 3 biological modules:
# Module 1 (immune): CD8A, GZMB, PRF1, IFNG
# Module 2 (proliferation): MKI67, TOP2A, PCNA
# Module 3 (housekeeping): GAPDH, ACTB

# Create correlated expression within modules
# and weak/no correlation between modules
# Compute pairwise cor() and visualize with heatmap
# Which modules emerge from the clustering?

Exercise 4: Significance vs. Magnitude

Generate datasets with n = 20 and n = 2000. Show that a weak correlation (r ≈ 0.1) is non-significant with small n but highly significant with large n. Argue why the p-value alone is misleading.

set_seed(42)
# Small sample: n = 20, weak correlation
let x_small = rnorm(20, 0, 1)
let y_small = x_small * 0.1 + rnorm(20, 0, 1)

# Large sample: n = 2000, same weak correlation
let x_large = rnorm(2000, 0, 1)
let y_large = x_large * 0.1 + rnorm(2000, 0, 1)

# Compute cor() on both and test significance
# Compare p-values and correlation magnitudes
# What should you report?

Exercise 5: Anscombe Challenge

Create two synthetic gene-pair datasets where Pearson r is nearly identical (~0.7) but scatter plots reveal completely different biology — one linear, one with a threshold effect (flat then rising). Show that Spearman catches the difference.

Key Takeaways

  • Pearson measures linear association; Spearman measures monotonic (rank-based); Kendall counts concordant pairs and is most robust
  • Correlation ranges from -1 to +1; the sign indicates direction, the magnitude indicates strength
  • Always visualize — identical correlations can hide wildly different patterns (Anscombe’s quartet)
  • Partial correlation removes the effect of confounders, revealing true associations
  • With large genomics datasets, tiny correlations become “significant” — always report the magnitude alongside the p-value
  • Correlation is not causation — co-expression does not imply co-regulation or functional relationship
  • Spearman is generally the safest default for gene expression data

What’s Next

Now that we can quantify relationships between two variables, Day 14 takes the next step: using one variable to predict another. We’ll build our first linear regression models, learning to fit lines, interpret slopes, and critically evaluate whether our predictions are trustworthy.