Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 3: Distributions — The Shape of Biological Variation

Day 3 of 30 Prerequisites: Days 1-2 ~55 min Hands-on

The Problem

Dr. James Park has RNA-seq data from 12 tumor samples. He needs to identify genes that are differentially expressed between treatment and control groups. He knows the standard approach: run a t-test on each gene. The t-test assumes data is normally distributed, so he checks his data.

Gene FPKM values range from 0 to 50,000. Most genes sit near zero, with a handful expressed at astronomical levels. The histogram looks nothing like a bell curve — it is a massive spike at zero with a long right tail stretching to infinity. He runs the t-test anyway. Out of 20,000 genes, 4,700 come back “significant” at p < 0.05. That is 23.5% of all genes, far more than seems biologically plausible for this modest treatment.

Something is deeply wrong. The t-test assumed his data was symmetric and bell-shaped. It was neither. The test produced garbage because the assumption was violated. If Dr. Park had understood distributions — the subject of today’s chapter — he would have known to transform his data or use a different test entirely.

What Is a Distribution?

A distribution is a recipe that tells you how likely each possible value is. Think of it as a map of a city, where the height at each point represents how many people live there. Downtown is a tall spike (many people). The suburbs are a gentle slope. The surrounding farmland is nearly flat.

Every dataset has an underlying distribution — the theoretical shape that generated the data you observe. When you draw a histogram of your data, you are estimating this shape from a finite sample. With 10 data points, the histogram is choppy and unreliable. With 10,000, it smooths out and begins to reveal the true underlying curve.

Why does this matter? Because every statistical test makes assumptions about the distribution of your data. The t-test assumes normality. The chi-square test assumes expected counts are large enough. The Poisson regression assumes counts follow a Poisson process. Violate the assumption, and the test’s guarantees evaporate.

Key insight: A statistical test is a contract. It says: “If your data follows distribution X, then I guarantee my conclusions are reliable with probability Y.” Break the contract, and you get no guarantees.

The Normal Distribution

The normal distribution — the bell curve — is the most famous distribution in statistics, and for good reason. It arises naturally whenever many small, independent effects add together.

Properties

The normal distribution is defined by two parameters:

  • μ (mu): the mean, which determines the center
  • σ (sigma): the standard deviation, which determines the width

The curve is perfectly symmetric around μ. It extends infinitely in both directions, though values far from the mean are exceedingly unlikely.

The 68-95-99.7 Rule

The 68-95-99.7 Rule (Empirical Rule) mu -1sigma +1sigma -2sigma +2sigma -3sigma +3sigma 68.3% 95.4% 99.7%
RangeProbabilityMeaning
μ ± 1σ68.3%About two-thirds of data
μ ± 2σ95.4%Nearly all data
μ ± 3σ99.7%Almost everything
Beyond 3σ0.3%Extreme outliers

If a measurement falls more than 3 standard deviations from the mean, it is either a genuine outlier or something went wrong.

Biological Examples of Normality

The normal distribution is a good model for:

  • Measurement error: Technical replicates of the same sample tend to be normally distributed.
  • Height and weight in a homogeneous population (though mixtures of populations are not normal).
  • Blood pressure readings in healthy adults.
  • Quantitative traits influenced by many genes of small effect (additive genetic model).
  • Log-transformed gene expression (more on this below).

When Data Is NOT Normal

The normal distribution is a terrible model for:

  • Raw gene expression (FPKM, TPM, counts) — heavily right-skewed
  • Read counts — discrete, non-negative, often zero-inflated
  • Allele frequencies — bounded between 0 and 1
  • Survival times — always positive, typically right-skewed
  • Any data with a hard boundary (concentrations cannot be negative)
set_seed(42)
# Generate and visualize normal data
let heights = rnorm(5000, 170, 8)

histogram(heights, {bins: 50, title: "Adult Heights (cm) — Normal Distribution"})
let stats = summary(heights)
print("Mean: {stats.mean:.1}, Median: {stats.median:.1}, Skewness: {stats.skewness:.3}")
# Mean and median nearly identical; skewness near zero — hallmarks of normality

The Log-Normal Distribution

If gene expression is not normal, what is it? In most cases, it is log-normal: the data itself is skewed, but the logarithm of the data is normally distributed.

Why Gene Expression Is Log-Normal

Gene regulation is a cascade of multiplicative processes. A transcription factor binds (or doesn’t), an enhancer activates (fold-change), mRNA is stabilized (half-life multiplied), ribosomes translate at varying rates (multiplied efficiency). When effects multiply rather than add, the result is log-normal, not normal.

This is a mathematical fact: if X = Y₁ × Y₂ × … × Yₙ and the Y values are independent, then log(X) = log(Y₁) + log(Y₂) + … + log(Yₙ). Sums of independent variables tend toward normal (by the Central Limit Theorem), so log(X) is approximately normal, meaning X is log-normal.

The Log-Transform Trick

This is why bioinformaticians routinely log-transform expression data before analysis:

set_seed(42)
# Simulate gene expression (log-normal)
let log_expr = rnorm(5000, 3.0, 2.0)
let expression = log_expr |> map(|x| 2.0 ** x)  # 2^x to simulate FPKM

# Raw expression: heavily skewed
histogram(expression, {bins: 50, title: "Raw FPKM — Right Skewed"})
let raw_stats = summary(expression)
print("Raw — Mean: {raw_stats.mean:.1}, Median: {raw_stats.median:.1}, Skew: {raw_stats.skewness:.2}")

# Log2-transformed: approximately normal
let log2_expr = expression |> map(|x| log2(x + 1))  # +1 to handle zeros
histogram(log2_expr, {bins: 50, title: "log2(FPKM+1) — Approximately Normal"})
let log_stats = summary(log2_expr)
print("Log2 — Mean: {log_stats.mean:.1}, Median: {log_stats.median:.1}, Skew: {log_stats.skewness:.2}")

After log-transformation, the mean and median converge, skewness drops toward zero, and the histogram looks bell-shaped. Now parametric tests are appropriate.

Clinical relevance: Differential expression tools like DESeq2 and edgeR work with counts and model them with the negative binomial distribution, but many downstream analyses (clustering, PCA, visualization) require log-transformed data. Understanding why is essential for correct analysis.

The Poisson Distribution

The Poisson distribution models the number of events that occur in a fixed interval, when events happen independently at a constant average rate.

Properties

  • Parameter: λ (lambda) — the average rate
  • Support: 0, 1, 2, 3, … (non-negative integers)
  • Mean = Variance = λ (this is the key property)

Biological Examples

ApplicationWhat is the “event”?What is the “interval”?
RNA-seq read countsOne read mapping to a geneOne gene in one sample
MutationsOne mutationPer megabase of genome
Rare diseasesOne casePer 100,000 population per year
Sequencing errorsOne errorPer 1000 bases
Distribution Shape Comparison Normal mu=0, sigma=1 Symmetric bell curve Heights, measurement error Log-Normal mu=0, sigma=1 (of log) Right-skewed, always positive Gene expression (FPKM/TPM) Poisson lambda=3 0 1 2 3 4 5 6 7 8 Discrete counts Read counts, mutations

The Overdispersion Problem

In theory, RNA-seq counts should be Poisson. In practice, biological replicates show more variability than Poisson predicts — the variance exceeds the mean. This is overdispersion, caused by biological variability between samples.

The solution is the negative binomial distribution, which adds a dispersion parameter to allow variance > mean. This is why DESeq2 uses negative binomial, not Poisson.

set_seed(42)
# Poisson distribution for mutation counts

# Average 3.5 mutations per megabase in a tumor
let mutation_counts = rpois(1000, 3.5)

histogram(mutation_counts, {bins: 15, title: "Mutations per Megabase (Poisson, lambda=3.5)"})

# Verify mean ~ variance (Poisson property)
print("Mean: {mean(mutation_counts):.2}")
print("Variance: {variance(mutation_counts):.2}")
# Both should be close to 3.5

# Probability of seeing 10+ mutations in a region (hypermutation?)
let p_hyper = 1.0 - ppois(9, 3.5)
print("P(10+ mutations): {p_hyper:.4}")
# Very low — a region with 10+ mutations is genuinely unusual

The Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent trials, each with the same probability of success.

Properties

  • Parameters: n (number of trials), p (probability of success)
  • Support: 0, 1, 2, …, n
  • Mean = np, Variance = np(1-p)

Biological Examples

ApplicationWhat is a “trial”?What is “success”?What is p?
Heterozygous genotypeOne offspringInherits variant allele0.5
SNP callingOne read at a positionCarries alt alleleVAF
Drug responseOne patientRespondsResponse rate
Hardy-WeinbergOne individualHas genotype AA

Hardy-Weinberg Equilibrium

For a biallelic locus with allele frequencies p and q = 1-p, Hardy-Weinberg predicts genotype frequencies:

  • AA: p²
  • Aa: 2pq
  • aa: q²

Deviations from HWE can indicate selection, population structure, or genotyping error.

set_seed(42)
# Genotype frequencies under Hardy-Weinberg
let p = 0.3  # frequency of allele A
let q = 1.0 - p

let freq_AA = p * p         # 0.09
let freq_Aa = 2.0 * p * q   # 0.42
let freq_aa = q * q         # 0.49

print("Expected genotype frequencies:")
print("  AA: {freq_AA:.3}")
print("  Aa: {freq_Aa:.3}")
print("  aa: {freq_aa:.3}")

# Simulate genotyping 500 individuals
let n_individuals = 500
let AA_count = rbinom(1, n_individuals, freq_AA) |> sum()

# Probability of observing exactly k heterozygotes
let k = 200
let p_exact = dbinom(k, n_individuals, freq_Aa)
print("P(exactly {k} heterozygotes in {n_individuals}): {p_exact:.6}")

# Probability of 230+ heterozygotes (possible HWE violation?)
let p_excess = 1.0 - pbinom(229, n_individuals, freq_Aa)
print("P(230+ heterozygotes): {p_excess:.4}")

Checking Your Distribution: Diagnostic Tools

Before running any parametric test, verify that your data matches its assumptions. Here are the essential tools.

Q-Q Plots

A Q-Q (quantile-quantile) plot compares your data’s quantiles against the theoretical quantiles of a reference distribution (usually normal). If data follows the reference distribution, points fall on a straight diagonal line. Deviations reveal departures from the assumed shape.

set_seed(42)
# Q-Q plot for normal data (should be a straight line)
let normal_data = rnorm(500, 0, 1)
qq_plot(normal_data, {title: "Q-Q Plot: Normal Data"})

# Q-Q plot for log-normal data (curved — not normal!)
let lognormal_data = normal_data |> map(|x| exp(x))
qq_plot(lognormal_data, {title: "Q-Q Plot: Log-Normal Data (Raw)"})

# Q-Q plot after log-transform (straight again)
let transformed = lognormal_data |> map(|x| log(x))
qq_plot(transformed, {title: "Q-Q Plot: Log-Normal Data (After Log Transform)"})

Reading a Q-Q plot:

  • Points on the line: data matches the assumed distribution
  • Upward curve at the right end: right skew (heavy right tail)
  • Downward curve at the left end: left skew (heavy left tail)
  • S-shape: heavy tails on both sides (high kurtosis)
Q-Q Plot Interpretation Guide Good Fit (Normal) Theoretical Quantiles Sample Points follow the diagonal Heavy Tails (S-shape) Theoretical Quantiles S-curve: more extreme values Right Skew (Curves Up) Theoretical Quantiles Upper tail heavier than normal

Shapiro-Wilk Test

The Shapiro-Wilk test formally tests whether data is normally distributed. A small p-value (< 0.05) means the data is significantly non-normal.

set_seed(42)
# Test normality visually with Q-Q plots and histograms
let normal_data = rnorm(200, 50, 10)
let skewed_data = rnorm(200, 3, 1) |> map(|x| exp(x))

# For normal data: Q-Q plot should show points on the diagonal
qq_plot(normal_data, {title: "Q-Q: Normal Data"})
let stats_normal = summary(normal_data)
print("Normal data — Skewness: {stats_normal.skewness:.4}")
# Skewness near 0: consistent with normality

# For skewed data: Q-Q plot will curve away from the diagonal
qq_plot(skewed_data, {title: "Q-Q: Skewed Data"})
let stats_skewed = summary(skewed_data)
print("Skewed data — Skewness: {stats_skewed.skewness:.4}")
# High skewness: definitely not normal

Common pitfall: The Shapiro-Wilk test is very sensitive with large samples. With n = 10,000, it will reject normality for data that is “close enough” to normal for practical purposes. For large samples, rely more on Q-Q plots and skewness/kurtosis values than on the Shapiro-Wilk p-value.

Histogram with Density Overlay

Overlay the theoretical density curve on your histogram to visually assess fit:

set_seed(42)
let data = rnorm(2000, 100, 15)

# Histogram with normal density overlay
histogram(data, {bins: 40, title: "Data with Normal Fit Overlay"})
density(data, {title: "Kernel Density Estimate"})

Distribution Summary Table

DistributionShapeParametersMeanVarianceBiology Use Case
NormalSymmetric bellμ, σμσ²Measurement error, heights, log-expression
Log-NormalRight-skewedμ, σ (of log)exp(μ + σ²/2)ComplexRaw gene expression, protein abundance
PoissonRight-skewed (low λ), symmetric (high λ)λλλRead counts, mutation rates
BinomialVariesn, pnpnp(1-p)Genotype counts, allele sampling
Negative BinomialRight-skewedr, pr(1-p)/pr(1-p)/p²Overdispersed counts (DESeq2)
BetaFlexible (0,1)α, βα/(α+β)ComplexAllele frequencies, methylation

Python and R Equivalents

Python:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Normal distribution
x = np.linspace(-4, 4, 1000)
plt.plot(x, stats.norm.pdf(x, 0, 1))

# Poisson
counts = np.random.poisson(lam=3.5, size=1000)

# Binomial
genotypes = np.random.binomial(n=500, p=0.42, size=1000)

# Q-Q plot
stats.probplot(data, dist="norm", plot=plt)

# Shapiro-Wilk test
stat, p = stats.shapiro(data)

# Distribution fitting
params = stats.norm.fit(data)  # MLE fit

R:

# Normal distribution
x <- seq(-4, 4, length.out = 1000)
plot(x, dnorm(x, 0, 1), type = "l")

# Poisson
counts <- rpois(1000, lambda = 3.5)

# Binomial
genotypes <- rbinom(1000, size = 500, prob = 0.42)

# Q-Q plot
qqnorm(data)
qqline(data)

# Shapiro-Wilk test
shapiro.test(data)

# Density overlay
hist(data, freq = FALSE)
curve(dnorm(x, mean(data), sd(data)), add = TRUE, col = "red")

Why This Matters for Testing

Here is the critical connection between today’s material and the rest of the book:

If your data is…Then you can use…But NOT…
Normalt-test, ANOVA, Pearson correlation
Log-normalt-test after log-transformt-test on raw values
Poisson countsPoisson regression, exact testst-test
Overdispersed countsNegative binomial (DESeq2, edgeR)Poisson, t-test
Non-normal, unknownMann-Whitney, Kruskal-Wallist-test, ANOVA
Bounded (0,1)Beta regression, logit transformLinear regression

Choosing the wrong test because you assumed the wrong distribution is one of the most common errors in computational biology. Dr. Park’s mistake from our opening scenario — running a t-test on raw FPKM values — is committed daily in bioinformatics labs around the world.

Key insight: The distribution is not a detail. It is the foundation. Get it right, and your downstream analysis is trustworthy. Get it wrong, and no amount of sophisticated testing can rescue your conclusions.

Exercises

Exercise 1: Identify the Distribution

For each dataset, determine the most appropriate distribution and justify your choice.

set_seed(42)
# Dataset A: Sequencing read counts per gene
let dataset_a = rpois(1000, 25)

# Dataset B: Patient blood pressure
let dataset_b = rnorm(500, 120, 15)

# Dataset C: Gene expression (raw TPM)
let log_vals = rnorm(2000, 2.0, 1.5)
let dataset_c = log_vals |> map(|x| 10.0 ** x)

# TODO: For each dataset, create a histogram and Q-Q plot
# TODO: Check normality visually with qq_plot() and histogram()
# TODO: For dataset C, try log-transforming and re-check
# TODO: State which distribution best describes each and why

Exercise 2: The Poisson Check

Verify whether mutation count data follows a Poisson distribution by checking the mean-variance relationship.

set_seed(42)
# Scenario 1: Pure Poisson (technical replicates)
let technical = rpois(500, 8.0)

# Scenario 2: Overdispersed (biological replicates)
# Simulate by mixing Poisson with varying lambda
let lambdas = rnorm(500, 8.0, 3.0) |> map(|x| max(x, 0.1))

# TODO: Compute mean and variance for technical replicates
# TODO: Are they approximately equal? (Poisson property)
# TODO: For overdispersed data, compute the dispersion ratio (variance/mean)
# TODO: What does a ratio >> 1 tell you about the data?

Exercise 3: Transform and Test

Take a skewed dataset, find the right transformation, and verify normality.

set_seed(42)
let protein_abundance = rnorm(300, 4, 1.2) |> map(|x| exp(x))

# TODO: Plot histogram of raw data
# TODO: Check normality with qq_plot() — is it normal?
# TODO: Apply log transform
# TODO: Plot histogram of transformed data
# TODO: Check normality of transformed data with qq_plot()
# TODO: Compare skewness before and after

Exercise 4: Distribution Detective

A collaborator gives you mystery data. Identify its distribution using all tools from today.

set_seed(42)
# Mystery data — what distribution is this?
let mystery = rbinom(1000, 50, 0.15)

# TODO: Compute summary()
# TODO: Create histogram
# TODO: Try Q-Q plot against normal
# TODO: Note: the data is discrete. What distributions produce discrete data?
# TODO: Estimate the parameters and identify the distribution

Key Takeaways

  • A distribution is the theoretical shape describing how likely each value is. Every dataset has one, and every statistical test assumes one.
  • The normal distribution arises from additive effects and is defined by mean and standard deviation. It is appropriate for measurement error and many physiological traits.
  • Gene expression is NOT normal — it is log-normal because gene regulation is multiplicative. Always log-transform before using parametric tests.
  • The Poisson distribution models count data (reads, mutations) with the key property that mean equals variance. When variance exceeds the mean (overdispersion), use the negative binomial instead.
  • The binomial distribution models fixed trials with a success probability — relevant for genotype frequencies and allele sampling.
  • Q-Q plots are the most informative visual diagnostic for distribution checking. The Shapiro-Wilk test provides a formal hypothesis test for normality.
  • Choosing the right distribution is not optional — it determines which statistical tests are valid and which will produce misleading results.

What’s Next

You now understand the shapes that biological data takes. But distributions describe what values are likely — which is just another way of saying they describe probabilities. Tomorrow, on Day 4, we formalize probability itself. You will learn to compute the chance that a child inherits a BRCA1 mutation, understand why a positive genetic test might mean less than you think (Bayes’ theorem will surprise you), and discover why the prosecutor’s fallacy has sent innocent people to prison. Probability is the language of uncertainty, and uncertainty is the native tongue of biology.