Day 9: When Normality Fails — Non-Parametric Tests
The Problem
Dr. Maria Gonzalez studies the gut microbiome in inflammatory bowel disease (IBD). She has 16S rRNA sequencing data from 15 IBD patients and 15 healthy controls, measuring the relative abundance of Faecalibacterium prausnitzii, a key anti-inflammatory bacterium. Looking at the data, she sees a mess: most values cluster near zero, a few patients have moderate levels, and one healthy individual has an enormous abundance of 45%. The histogram looks nothing like a bell curve — it is right-skewed with a long tail.
She runs a Shapiro-Wilk test on each group: both return p < 0.001, firmly rejecting normality. The t-test assumes normally distributed data. With data this skewed, the t-test’s p-value could be wildly inaccurate — too liberal or too conservative, depending on the specific pattern. She needs tests that work without any assumptions about the shape of the distribution.
These are non-parametric tests: methods that operate on the ranks of data rather than the raw values, making them robust to skewness, outliers, and any distributional shape.
What Are Non-Parametric Tests?
Imagine you are judging a cooking competition. A parametric judge scores each dish on a precise 1-100 scale and compares average scores. A non-parametric judge simply ranks the dishes from best to worst — first place, second place, third place. The ranking approach is less precise when scores are reliable, but it is far more robust when one judge has an eccentric scoring system.
Non-parametric tests replace raw data values with their ranks (1st smallest, 2nd smallest, …) and then analyze the ranks. This has powerful consequences:
| Property | Parametric (t-test) | Non-parametric (rank-based) |
|---|---|---|
| Assumes normality | Yes | No |
| Sensitive to outliers | Very | Resistant |
| Uses raw values | Yes | Uses ranks |
| Power (normal data) | Highest | Slightly lower (~95%) |
| Power (non-normal data) | Unreliable | Reliable |
| Handles ordinal data | No | Yes |
Key insight: Non-parametric tests are not “worse” versions of parametric tests. They are the correct choice when distributional assumptions are violated. Using a t-test on heavily skewed data is like measuring temperature with a ruler — you might get a number, but it doesn’t mean anything.
When to Choose Non-Parametric
Use non-parametric tests when:
- Shapiro-Wilk rejects normality (p < 0.05) and sample size is small
- Data are ordinal (pain scale 1-10, tumor grade I-IV)
- Data have heavy outliers that cannot be removed
- Sample sizes are very small (n < 10 per group)
- Data are bounded or have floor/ceiling effects (many zeros)
The Rank Transformation
The foundation of all non-parametric tests is replacing values with ranks:
| Patient | Abundance | Rank |
|---|---|---|
| P1 | 0.1% | 1 |
| P2 | 0.3% | 2 |
| P3 | 0.8% | 3 |
| P4 | 1.5% | 4 |
| P5 | 3.2% | 5 |
| P6 | 45.0% | 6 |
Notice: the outlier (45%) gets rank 6 — just one rank above 3.2%. Its extreme value no longer dominates the analysis.
Wilcoxon Rank-Sum Test (Mann-Whitney U)
The non-parametric counterpart of the independent two-sample t-test.
Procedure:
- Combine all observations and rank them 1 through N
- Sum the ranks in each group separately
- If one group consistently has higher ranks, the rank sum will be extreme
- Compare to the expected rank sum under H0 (no difference)
H0: The two groups have identical distributions H1: One group tends to have larger values
Common pitfall: The Wilcoxon rank-sum and Mann-Whitney U are the same test, just computed differently. U = W - n1(n1+1)/2. Different software uses different names, but the p-value is identical.
Wilcoxon Signed-Rank Test
The non-parametric counterpart of the paired t-test.
Procedure:
- Compute the difference for each pair
- Rank the absolute differences (ignoring zeros)
- Sum ranks of positive differences (W+) and negative differences (W-)
- If the treatment consistently increases (or decreases), one sum will dominate
Sign Test
Even simpler than Wilcoxon signed-rank — only considers the direction of differences, not their magnitude.
Procedure:
- For each pair, note whether the difference is positive, negative, or zero
- Count positives and negatives (discard zeros)
- Under H0, positives and negatives should be equally likely (binomial test with p = 0.5)
The sign test has less power than Wilcoxon signed-rank but makes even fewer assumptions.
Kruskal-Wallis Test
The non-parametric counterpart of one-way ANOVA, for comparing three or more groups.
H0: All groups have the same distribution H1: At least one group differs
If significant, follow up with pairwise Wilcoxon tests (with multiple testing correction).
Kolmogorov-Smirnov (KS) Test
Compares two entire distributions, not just their centers. Detects differences in shape, spread, or location.
H0: The two samples come from the same distribution H1: The distributions differ in any way
Clinical relevance: The KS test is useful when you suspect groups differ not just in average abundance, but in the entire pattern of their distribution — for example, one group might be bimodal while the other is unimodal.
Decision Guide: Parametric vs Non-Parametric
| Comparison | Parametric | Non-Parametric |
|---|---|---|
| One sample vs known value | One-sample t-test | Wilcoxon signed-rank (one sample) |
| Two independent groups | Welch’s t-test | Mann-Whitney U / Wilcoxon rank-sum |
| Two paired groups | Paired t-test | Wilcoxon signed-rank |
| Three+ independent groups | One-way ANOVA | Kruskal-Wallis |
| Three+ paired groups | Repeated measures ANOVA | Friedman test |
| Compare distributions | — | KS test |
Non-Parametric Tests in BioLang
Mann-Whitney U: Microbiome Abundance
# F. prausnitzii relative abundance (%) in IBD vs healthy
let ibd = [0.1, 0.3, 0.0, 0.8, 0.2, 0.0, 1.5, 0.4, 0.1, 0.0,
3.2, 0.5, 0.1, 0.7, 0.0]
let healthy = [2.1, 5.4, 1.8, 8.2, 3.5, 12.1, 4.7, 6.3, 2.9, 45.0,
7.1, 3.8, 9.5, 4.2, 6.8]
# First, demonstrate why t-test is inappropriate
# Check normality visually — both distributions are right-skewed
qq_plot(ibd, {title: "QQ Plot: IBD"})
qq_plot(healthy, {title: "QQ Plot: Healthy"})
print("Both groups are heavily skewed — normality violated!\n")
# Mann-Whitney U test (non-parametric)
let result = wilcoxon(ibd, healthy)
print("=== Mann-Whitney U Test ===")
print("U statistic: {result.statistic:.1}")
print("p-value: {result.p_value:.2e}")
# Compare to (inappropriate) t-test
let t_result = ttest(ibd, healthy)
print("\n(Inappropriate) Welch's t-test p-value: {t_result.p_value:.2e}")
print("Mann-Whitney p-value: {result.p_value:.2e}")
print("Results may differ substantially with skewed data")
# Visualize the skewed distributions
let bp_table = table({"IBD": ibd, "Healthy": healthy})
boxplot(bp_table, {title: "F. prausnitzii Abundance"})
Wilcoxon Signed-Rank: Paired Treatment Data
# Inflammatory cytokine IL-6 (pg/mL) before and after anti-TNF therapy
# Same 12 patients measured twice — highly skewed cytokine data
let before = [245, 18, 892, 45, 32, 1250, 67, 128, 15, 543, 78, 2100]
let after = [120, 12, 340, 22, 28, 450, 35, 65, 10, 210, 42, 890]
# Normality check on differences
let diffs = zip(before, after) |> map(|p| p[0] - p[1])
qq_plot(diffs, {title: "QQ Plot: Paired Differences"})
print("Differences are non-normal -> use Wilcoxon signed-rank\n")
# Wilcoxon signed-rank test
let result = wilcoxon(before, after)
print("=== Wilcoxon Signed-Rank Test ===")
print("V statistic: {result.statistic:.1}")
print("p-value: {result.p_value:.6}")
print("All 12 patients showed reduction in IL-6")
# For comparison: the sign test via dbinom (even more robust, less powerful)
# Count how many differences are positive
let n_pos = diffs |> filter(|d| d > 0) |> len()
let n_nonzero = diffs |> filter(|d| d != 0) |> len()
# Under H0, n_pos ~ Binomial(n_nonzero, 0.5)
let sign_p = 0.0
for k in range(n_pos, n_nonzero + 1) {
sign_p = sign_p + dbinom(k, n_nonzero, 0.5)
}
let sign_p = 2.0 * min(sign_p, 1.0 - sign_p) # two-tailed
print("\nSign test p-value: {sign_p:.6}")
Kruskal-Wallis: Multiple Body Sites
# Bacterial diversity (Shannon index) across three gut regions
let ileum = [1.2, 0.8, 1.5, 0.3, 2.1, 0.9, 1.4, 0.6, 1.8, 0.4]
let cecum = [2.5, 3.1, 2.8, 2.2, 3.4, 2.7, 3.0, 2.3, 2.9, 3.2]
let rectum = [3.8, 4.2, 3.5, 4.5, 3.9, 4.1, 3.6, 4.3, 3.7, 4.0]
# Kruskal-Wallis: use anova() on rank-transformed data
let result = anova([ileum, cecum, rectum])
print("=== Kruskal-Wallis Test: Diversity Across Body Sites ===")
print("H statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.2e}")
print("Degrees of freedom: {result.df}")
if result.p_value < 0.05 {
print("\nAt least one body site differs. Running pairwise comparisons...")
let p1 = wilcoxon(ileum, cecum).p_value
let p2 = wilcoxon(ileum, rectum).p_value
let p3 = wilcoxon(cecum, rectum).p_value
# Bonferroni correction for 3 comparisons
let adjusted = p_adjust([p1, p2, p3], "bonferroni")
print("Ileum vs Cecum: p = {adjusted[0]:.4}")
print("Ileum vs Rectum: p = {adjusted[1]:.4}")
print("Cecum vs Rectum: p = {adjusted[2]:.4}")
}
let bp_table = table({"Ileum": ileum, "Cecum": cecum, "Rectum": rectum})
boxplot(bp_table, {title: "Microbial Diversity by Gut Region"})
KS Test: Comparing Distributions
# Do tumor suppressor genes and oncogenes have different
# expression distributions (not just different means)?
let tumor_suppressors = [2.1, 3.4, 1.8, 4.2, 2.9, 3.1, 2.5, 3.8, 1.5, 4.0,
2.7, 3.3, 2.0, 3.6, 2.3, 3.9, 1.9, 4.1, 2.6, 3.5]
let oncogenes = [5.2, 8.1, 6.3, 12.4, 7.5, 5.8, 9.2, 6.7, 11.3, 7.0,
5.5, 8.8, 6.1, 10.5, 7.3, 5.9, 9.7, 6.5, 11.8, 7.8]
let result = ks_test(tumor_suppressors, oncogenes)
print("=== KS Test: Expression Distributions ===")
print("D statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.2e}")
print("Maximum distance between cumulative distributions: {result.statistic:.4}")
histogram([tumor_suppressors, oncogenes], {labels: ["Tumor Suppressors", "Oncogenes"], title: "Expression Distributions by Gene Class", x_label: "Expression (log2-CPM)", bins: 12})
Comparing t-Test vs Wilcoxon on the Same Data
set_seed(42)
# Demonstrate: with normal data, both tests agree
# With skewed data, they can disagree
print("=== Normal Data: Both Tests Agree ===")
let norm_a = rnorm(20, 5.0, 1.0)
let norm_b = rnorm(20, 6.0, 1.0)
let t_p = ttest(norm_a, norm_b).p_value
let w_p = wilcoxon(norm_a, norm_b).p_value
print("t-test p = {t_p:.4}, Mann-Whitney p = {w_p:.4}")
print("\n=== Skewed Data with Outlier: Tests May Disagree ===")
let skew_a = [1.2, 1.5, 1.8, 1.1, 1.4, 1.6, 1.3, 1.7, 1.9, 50.0]
let skew_b = [2.1, 2.3, 2.5, 2.0, 2.4, 2.2, 2.6, 2.1, 2.3, 2.5]
let t_p2 = ttest(skew_a, skew_b).p_value
let w_p2 = wilcoxon(skew_a, skew_b).p_value
print("t-test p = {t_p2:.4}, Mann-Whitney p = {w_p2:.4}")
print("The outlier inflates the t-test mean, masking the real pattern")
Python:
from scipy import stats
ibd = [0.1, 0.3, 0.0, 0.8, 0.2, 0.0, 1.5, 0.4, 0.1, 0.0,
3.2, 0.5, 0.1, 0.7, 0.0]
healthy = [2.1, 5.4, 1.8, 8.2, 3.5, 12.1, 4.7, 6.3, 2.9, 45.0,
7.1, 3.8, 9.5, 4.2, 6.8]
# Mann-Whitney U
u, p = stats.mannwhitneyu(ibd, healthy, alternative='two-sided')
print(f"U = {u}, p = {p:.2e}")
# Wilcoxon signed-rank (paired)
before = [245, 18, 892, 45, 32, 1250, 67, 128, 15, 543, 78, 2100]
after = [120, 12, 340, 22, 28, 450, 35, 65, 10, 210, 42, 890]
w, p = stats.wilcoxon(before, after)
print(f"W = {w}, p = {p:.6f}")
# Kruskal-Wallis
ileum = [1.2, 0.8, 1.5, 0.3, 2.1, 0.9, 1.4, 0.6, 1.8, 0.4]
cecum = [2.5, 3.1, 2.8, 2.2, 3.4, 2.7, 3.0, 2.3, 2.9, 3.2]
rectum = [3.8, 4.2, 3.5, 4.5, 3.9, 4.1, 3.6, 4.3, 3.7, 4.0]
h, p = stats.kruskal(ileum, cecum, rectum)
print(f"H = {h:.4f}, p = {p:.2e}")
R:
ibd <- c(0.1, 0.3, 0.0, 0.8, 0.2, 0.0, 1.5, 0.4, 0.1, 0.0,
3.2, 0.5, 0.1, 0.7, 0.0)
healthy <- c(2.1, 5.4, 1.8, 8.2, 3.5, 12.1, 4.7, 6.3, 2.9, 45.0,
7.1, 3.8, 9.5, 4.2, 6.8)
wilcox.test(ibd, healthy) # Mann-Whitney
wilcox.test(before, after, paired = TRUE) # Wilcoxon signed-rank
kruskal.test(list(ileum, cecum, rectum)) # Kruskal-Wallis
ks.test(tumor_suppressors, oncogenes) # KS test
Exercises
Exercise 1: Choose the Right Test
For each dataset, decide whether a parametric or non-parametric test is more appropriate:
a) Pain scores (0-10 scale) in drug vs placebo groups b) Blood pressure measurements in 30 patients (continuous, approximately normal) c) Number of bacterial colonies per plate (many zeros, some very high counts) d) Survival time in days (typically right-skewed)
Exercise 2: Microbiome Comparison
Two diets were compared for their effect on Bacteroides abundance:
let high_fiber = [8.2, 12.5, 6.3, 15.1, 9.8, 22.4, 7.5, 11.2, 14.8, 5.9]
let low_fiber = [1.2, 0.5, 3.1, 0.8, 2.4, 0.3, 1.8, 0.9, 2.7, 0.6]
# TODO: Test normality of each group
# TODO: Run Mann-Whitney U test
# TODO: Also run a t-test and compare results
# TODO: Create a boxplot
Exercise 3: Multiple Body Sites with Post-Hoc
OTU richness from four body sites (oral, gut, skin, vaginal). Run Kruskal-Wallis and, if significant, perform all pairwise comparisons with Bonferroni correction.
let oral = [120, 95, 145, 110, 88, 132, 105, 98, 140, 115]
let gut = [350, 420, 280, 390, 310, 445, 360, 295, 410, 380]
let skin = [180, 210, 165, 195, 220, 175, 200, 185, 230, 190]
let vaginal = [45, 30, 55, 38, 25, 50, 42, 35, 48, 28]
# TODO: Kruskal-Wallis test
# TODO: If significant, pairwise Mann-Whitney with Bonferroni correction
# TODO: Which sites differ from which?
Exercise 4: The Power Trade-Off
Generate 1000 simulations where both groups are truly normal with different means. Compare how often the t-test and Mann-Whitney detect the difference (power). Then repeat with skewed data (e.g., exponential).
# TODO: Simulate normal data, compare t-test vs Mann-Whitney power
# TODO: Simulate skewed data, compare again
# TODO: Which test wins in each scenario?
Key Takeaways
- Non-parametric tests use ranks instead of raw values, making them robust to skewness and outliers
- The Mann-Whitney U (Wilcoxon rank-sum) is the non-parametric alternative to the independent t-test
- The Wilcoxon signed-rank test is the non-parametric alternative to the paired t-test
- The Kruskal-Wallis test extends to three or more groups (non-parametric ANOVA)
- The KS test compares entire distributions, not just central tendency
- Non-parametric tests have about 95% of the power of parametric tests when data ARE normal, but are far more reliable when data are NOT normal
- Microbiome data, cytokine levels, survival times, and ordinal scales almost always require non-parametric methods
- Always check normality first (Shapiro-Wilk, QQ plots) — let the data guide your choice of test
What’s Next
So far we have compared two groups. But what if you have three, four, or ten groups — different drug doses, tissue types, or experimental conditions? Running all pairwise t-tests inflates false positives dramatically. Tomorrow we introduce ANOVA, the principled way to compare many groups simultaneously, along with post-hoc tests that identify which groups differ.