Day 13: Correlation — Finding Relationships
The Problem
Dr. Sarah Kim is studying breast cancer transcriptomics across 200 tumor samples. She notices that BRCA1 and BARD1 expression seem to rise and fall together — when one is high, the other tends to be high too. Exciting! These genes encode proteins that form a heterodimer critical for DNA repair.
But her collaborator raises a concern: “Both genes are upregulated in rapidly dividing cells. Couldn’t cell proliferation be driving both signals independently? You might be seeing a spurious association.”
Sarah needs to:
- Quantify how strongly BRCA1 and BARD1 co-vary
- Determine whether the relationship is statistically significant
- Control for the confounding effect of proliferation
- Visualize relationships across an entire panel of DNA repair genes
This is the domain of correlation analysis — one of the most used (and most misused) tools in all of biology.
What Is Correlation?
Correlation measures the strength and direction of the relationship between two variables. It answers: “When one variable goes up, does the other tend to go up, go down, or do nothing?”
The correlation coefficient ranges from -1 to +1:
| Value | Interpretation | Example |
|---|---|---|
| +1.0 | Perfect positive | Identical twins’ heights |
| +0.7 to +0.9 | Strong positive | BRCA1 and BARD1 expression |
| +0.3 to +0.7 | Moderate positive | BMI and blood pressure |
| -0.3 to +0.3 | Weak / none | Shoe size and IQ |
| -0.3 to -0.7 | Moderate negative | Exercise and resting heart rate |
| -0.7 to -1.0 | Strong negative | Tumor suppressor vs. proliferation rate |
| -1.0 | Perfect negative | Altitude and air pressure |
Key insight: Correlation is symmetric — the correlation between X and Y is the same as between Y and X. It doesn’t imply direction or causation.
Three Types of Correlation
1. Pearson’s r: The Linear Standard
Pearson’s correlation coefficient measures linear association:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
Assumptions:
- Both variables are continuous
- Relationship is approximately linear
- Data are roughly normally distributed (for inference)
- No extreme outliers (a single outlier can flip the sign)
Strengths: Most powerful when assumptions are met. Directly related to R² in linear regression.
Weakness: Sensitive to outliers. Misses non-linear relationships entirely.
2. Spearman’s ρ (rho): The Rank-Based Alternative
Spearman’s correlation operates on ranks rather than raw values. It measures monotonic association — whether Y tends to increase (or decrease) as X increases, even if not linearly.
How it works:
- Rank each variable separately (1, 2, 3, …)
- Compute Pearson’s r on the ranks
Assumptions:
- Ordinal or continuous data
- Monotonic relationship (doesn’t need to be linear)
Strengths: Robust to outliers. Works with non-normal distributions. Handles log-scale expression data naturally.
Weakness: Less powerful than Pearson when linearity holds.
3. Kendall’s τ (tau): The Most Robust
Kendall’s tau counts concordant and discordant pairs:
$$\tau = \frac{(\text{concordant pairs}) - (\text{discordant pairs})}{\binom{n}{2}}$$
A pair (i, j) is concordant if both xᵢ > xⱼ and yᵢ > yⱼ (or both less). It’s discordant if they disagree.
Strengths: Most robust to outliers. Better for small samples. Has clearer probabilistic interpretation.
Weakness: Computationally slower for large datasets. Values tend to be smaller than Pearson/Spearman for the same data.
Decision Table: Which Correlation to Use?
| Situation | Best Choice | Why |
|---|---|---|
| Both variables normal, linear relationship | Pearson | Most powerful |
| Skewed data (e.g., raw gene expression) | Spearman | Rank-based, outlier-robust |
| Small sample (n < 30) | Kendall | Most reliable for small n |
| Ordinal data (e.g., tumor grade 1-4) | Spearman or Kendall | Ranks are appropriate |
| Suspect outliers | Spearman or Kendall | Rank-based methods resist outliers |
| Large RNA-seq dataset, log-transformed | Pearson or Spearman | Both work well after log transform |
| Survival times with censoring | Kendall | Handles ties from censoring |
Common pitfall: Using Pearson on raw RNA-seq counts. These are heavily right-skewed — a few highly expressed genes dominate the correlation. Always log-transform first or use Spearman.
Partial Correlation: Controlling for Confounders
Standard correlation between X and Y can be inflated or deflated by a confounding variable Z that influences both. Partial correlation removes the effect of Z:
$$r_{XY \cdot Z} = \frac{r_{XY} - r_{XZ} \cdot r_{YZ}}{\sqrt{(1 - r_{XZ}^2)(1 - r_{YZ}^2)}}$$
Example: BRCA1 and BARD1 might correlate at r = 0.85. But if cell proliferation (MKI67 expression) drives both, the partial correlation controlling for MKI67 might drop to r = 0.45 — still real, but weaker.
Clinical relevance: In pharmacogenomics, two drug targets may appear correlated simply because both are overexpressed in a particular cancer subtype. Partial correlation controlling for subtype reveals whether the targets have an independent relationship.
Anscombe’s Quartet: Why You Must Visualize
In 1973, Francis Anscombe created four datasets with identical Pearson correlations (r = 0.816), identical means, and identical regression lines — but wildly different patterns:
| Dataset | Pattern | Lesson |
|---|---|---|
| I | Normal linear | Correlation works correctly |
| II | Perfect curve | r misses non-linearity |
| III | Tight line + one outlier | Single point inflates r |
| IV | Vertical cluster + outlier | r is meaningless |
The lesson: Never trust a correlation coefficient without a scatter plot.
Correlation Matrix and Heatmaps
When studying many variables simultaneously, a correlation matrix shows all pairwise correlations. For p variables, this is a p × p symmetric matrix with 1s on the diagonal.
For genomics, this is invaluable for:
- Identifying co-expression modules
- Detecting batch effects (technical variables cluster together)
- Revealing pathway relationships
A heatmap visualization uses color intensity to represent correlation strength, making patterns immediately visible across dozens or hundreds of variables.
Testing Significance: Is This Correlation Real?
A correlation of r = 0.3 might be noise in 10 samples but highly significant in 1000. The correlation test evaluates:
- H₀: ρ = 0 (no correlation in the population)
- H₁: ρ ≠ 0
The test statistic follows a t-distribution with n-2 degrees of freedom:
$$t = r\sqrt{\frac{n-2}{1-r^2}}$$
Key insight: With large n (common in genomics), even tiny correlations become “significant.” A correlation of r = 0.05 is significant at p < 0.05 when n > 1500. Always report both the coefficient AND the p-value, and judge practical significance by the magnitude of r.
Correlation ≠ Causation
This cannot be overstated. Correlation tells you variables co-vary — nothing more.
Classic examples in biology:
- Ice cream sales correlate with drowning deaths (confounder: hot weather)
- Shoe size correlates with reading ability in children (confounder: age)
- Stork population correlates with birth rate across European countries (confounder: rural vs. urban)
To establish causation, you need:
- Temporal precedence — cause precedes effect
- Experimental manipulation — perturb X, observe Y change
- Elimination of confounders — no third variable explains both
- Biological mechanism — plausible pathway
Correlation in BioLang
Basic Correlations
set_seed(42)
# Generate tumor expression data for 200 samples
let n = 200
# Simulate BRCA1 and BARD1 with true correlation + noise
let proliferation = rnorm(n, 10, 3)
let brca1 = proliferation * 0.8 + rnorm(n, 5, 2)
let bard1 = proliferation * 0.7 + rnorm(n, 4, 2)
# Pearson correlation — assumes linearity
let r_pearson = cor(brca1, bard1)
print("Pearson r: {r_pearson}") # ~0.78
# Spearman rank correlation — monotonic, robust
let r_spearman = spearman(brca1, bard1)
print("Spearman ρ: {r_spearman}") # ~0.76
# Kendall tau — concordant/discordant pairs, most robust
let r_kendall = kendall(brca1, bard1)
print("Kendall τ: {r_kendall}") # ~0.57 (typically smaller)
Statistical Testing
# Test whether correlation is significantly different from zero
# cor() returns the coefficient; use a t-test transformation for p-value
let r = cor(brca1, bard1)
let t_stat = r * sqrt((n - 2) / (1 - r * r))
print("r = {r}, t = {t_stat}")
# Spearman returns {coefficient, pvalue}
let spearman_result = spearman(brca1, bard1)
print("Spearman ρ = {spearman_result}")
Partial Correlation: Removing Confounders
set_seed(42)
# BRCA1-BARD1 correlation controlling for proliferation (MKI67)
let mki67 = proliferation + rnorm(n, 0, 1)
# Raw correlation
let r_raw = cor(brca1, bard1)
print("Raw Pearson r: {r_raw}") # ~0.78
# Partial correlation controlling for MKI67
# Compute manually: regress out confounder from both variables
let r_xz = cor(brca1, mki67)
let r_yz = cor(bard1, mki67)
let r_partial = (r_raw - r_xz * r_yz) / sqrt((1 - r_xz * r_xz) * (1 - r_yz * r_yz))
print("Partial r (controlling MKI67): {r_partial}") # lower — confounder removed
# The difference reveals how much of the BRCA1-BARD1
# association was driven by shared proliferation signal
print("Reduction: {((r_raw - r_partial) / r_raw * 100) |> round(1)}%")
Correlation Matrix and Heatmap
set_seed(42)
# Build expression matrix for 8 DNA repair genes
let base = rnorm(200, 0, 1)
let genes = {
"BRCA1": base * 0.8 + rnorm(200, 10, 2),
"BARD1": base * 0.7 + rnorm(200, 8, 2),
"RAD51": base * 0.6 + rnorm(200, 12, 3),
"PALB2": base * 0.5 + rnorm(200, 9, 2),
"ATM": rnorm(200, 11, 3),
"TP53": base * -0.4 + rnorm(200, 15, 4),
"MDM2": base * -0.3 + rnorm(200, 7, 2),
"GAPDH": rnorm(200, 20, 1)
}
# Compute pairwise correlations
let gene_names = ["BRCA1", "BARD1", "RAD51", "PALB2", "ATM", "TP53", "MDM2", "GAPDH"]
for i in 0..8 {
for j in (i+1)..8 {
let r = cor(genes[gene_names[i]], genes[gene_names[j]])
print("{gene_names[i]} vs {gene_names[j]}: r = {r |> round(3)}")
}
}
# Visualize as heatmap — instantly reveals co-expression modules
heatmap(genes, {title: "DNA Repair Gene Co-Expression", color_scale: "RdBu"})
Visualizing with Scatter Plots
# Scatter plot with correlation annotation
let scatter_data = table({"BRCA1": brca1, "BARD1": bard1})
plot(scatter_data, {type: "scatter", x: "BRCA1", y: "BARD1",
title: "BRCA1 vs BARD1 Co-Expression",
x_label: "BRCA1 Expression (log2 FPKM)",
y_label: "BARD1 Expression (log2 FPKM)"})
Demonstrating Anscombe’s Quartet Effect
set_seed(42)
# Two datasets with identical Pearson r but different patterns
let x_linear = rnorm(100, 10, 3)
let y_linear = x_linear * 0.5 + rnorm(100, 0, 2)
let x_curve = rnorm(100, 10, 5)
let y_curve = (x_curve - 10) ** 2 / 10 + rnorm(100, 0, 1)
print("Linear: Pearson r = {cor(x_linear, y_linear)}")
print("Curved: Pearson r = {cor(x_curve, y_curve)}")
print("Linear: Spearman ρ = {spearman(x_linear, y_linear)}")
print("Curved: Spearman ρ = {spearman(x_curve, y_curve)}")
# Spearman catches the monotonic but non-linear pattern
# Always plot your data!
Python:
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
# Pearson
r, p = stats.pearsonr(brca1, bard1)
# Spearman
rho, p = stats.spearmanr(brca1, bard1)
# Kendall
tau, p = stats.kendalltau(brca1, bard1)
# Partial correlation (using pingouin)
import pingouin as pg
partial = pg.partial_corr(data=df, x='BRCA1', y='BARD1', covar='MKI67')
# Correlation matrix heatmap
corr = df.corr(method='spearman')
sns.heatmap(corr, annot=True, cmap='RdBu_r', center=0)
R:
# Pearson
cor.test(brca1, bard1, method = "pearson")
# Spearman
cor.test(brca1, bard1, method = "spearman")
# Kendall
cor.test(brca1, bard1, method = "kendall")
# Partial correlation (using ppcor)
library(ppcor)
pcor.test(brca1, bard1, mki67)
# Correlation matrix heatmap
library(corrplot)
corrplot(cor(gene_matrix, method = "spearman"),
method = "color", type = "upper")
Exercises
Exercise 1: Compare Three Methods
Compute Pearson, Spearman, and Kendall correlations for the following gene pairs. Which method gives the most different result, and why?
set_seed(42)
let n = 150
# Gene pair 1: linear relationship with outliers
let gene_a = rnorm(n, 8, 2)
let gene_b = gene_a * 0.6 + rnorm(n, 3, 1)
# Add 5 extreme outliers
# (imagine contaminated samples)
# Compute all three correlations for gene_a vs gene_b
# Which is most affected by outliers?
Exercise 2: Partial Correlation in Drug Response
Three variables are measured across 100 cancer cell lines: drug sensitivity (IC50), target gene expression, and cell doubling time. The target gene and IC50 appear correlated. Is the relationship real, or driven by growth rate?
set_seed(42)
let n = 100
let growth_rate = rnorm(n, 24, 6)
let target_expr = growth_rate * 0.5 + rnorm(n, 10, 3)
let ic50 = growth_rate * -0.3 + rnorm(n, 50, 10)
# 1. Compute raw correlation between target_expr and ic50
# 2. Compute partial correlation controlling for growth_rate
# 3. What fraction of the apparent association was confounded?
Exercise 3: Build and Interpret a Heatmap
Create a correlation heatmap for the following gene expression panel. Identify which genes cluster into co-expression modules.
set_seed(42)
let n = 300
# Simulate 3 biological modules:
# Module 1 (immune): CD8A, GZMB, PRF1, IFNG
# Module 2 (proliferation): MKI67, TOP2A, PCNA
# Module 3 (housekeeping): GAPDH, ACTB
# Create correlated expression within modules
# and weak/no correlation between modules
# Compute pairwise cor() and visualize with heatmap
# Which modules emerge from the clustering?
Exercise 4: Significance vs. Magnitude
Generate datasets with n = 20 and n = 2000. Show that a weak correlation (r ≈ 0.1) is non-significant with small n but highly significant with large n. Argue why the p-value alone is misleading.
set_seed(42)
# Small sample: n = 20, weak correlation
let x_small = rnorm(20, 0, 1)
let y_small = x_small * 0.1 + rnorm(20, 0, 1)
# Large sample: n = 2000, same weak correlation
let x_large = rnorm(2000, 0, 1)
let y_large = x_large * 0.1 + rnorm(2000, 0, 1)
# Compute cor() on both and test significance
# Compare p-values and correlation magnitudes
# What should you report?
Exercise 5: Anscombe Challenge
Create two synthetic gene-pair datasets where Pearson r is nearly identical (~0.7) but scatter plots reveal completely different biology — one linear, one with a threshold effect (flat then rising). Show that Spearman catches the difference.
Key Takeaways
- Pearson measures linear association; Spearman measures monotonic (rank-based); Kendall counts concordant pairs and is most robust
- Correlation ranges from -1 to +1; the sign indicates direction, the magnitude indicates strength
- Always visualize — identical correlations can hide wildly different patterns (Anscombe’s quartet)
- Partial correlation removes the effect of confounders, revealing true associations
- With large genomics datasets, tiny correlations become “significant” — always report the magnitude alongside the p-value
- Correlation is not causation — co-expression does not imply co-regulation or functional relationship
- Spearman is generally the safest default for gene expression data
What’s Next
Now that we can quantify relationships between two variables, Day 14 takes the next step: using one variable to predict another. We’ll build our first linear regression models, learning to fit lines, interpret slopes, and critically evaluate whether our predictions are trustworthy.