Day 9: When Normality Fails — Non-Parametric Tests

The Problem

Dr. Maria Gonzalez studies the gut microbiome in inflammatory bowel disease (IBD). She has 16S rRNA sequencing data from 15 IBD patients and 15 healthy controls, measuring the relative abundance of Faecalibacterium prausnitzii, a key anti-inflammatory bacterium. Looking at the data, she sees a mess: most values cluster near zero, a few patients have moderate levels, and one healthy individual has an enormous abundance of 45%. The histogram looks nothing like a bell curve — it is right-skewed with a long tail.

She runs a Shapiro-Wilk test on each group: both return p < 0.001, firmly rejecting normality. The t-test assumes normally distributed data. With data this skewed, the t-test’s p-value could be wildly inaccurate — too liberal or too conservative, depending on the specific pattern. She needs tests that work without any assumptions about the shape of the distribution.

These are non-parametric tests: methods that operate on the ranks of data rather than the raw values, making them robust to skewness, outliers, and any distributional shape.

What Are Non-Parametric Tests?

Imagine you are judging a cooking competition. A parametric judge scores each dish on a precise 1-100 scale and compares average scores. A non-parametric judge simply ranks the dishes from best to worst — first place, second place, third place. The ranking approach is less precise when scores are reliable, but it is far more robust when one judge has an eccentric scoring system.

Non-parametric tests replace raw data values with their ranks (1st smallest, 2nd smallest, …) and then analyze the ranks. This has powerful consequences:

Property	Parametric (t-test)	Non-parametric (rank-based)
Assumes normality	Yes	No
Sensitive to outliers	Very	Resistant
Uses raw values	Yes	Uses ranks
Power (normal data)	Highest	Slightly lower (~95%)
Power (non-normal data)	Unreliable	Reliable
Handles ordinal data	No	Yes

Key insight: Non-parametric tests are not “worse” versions of parametric tests. They are the correct choice when distributional assumptions are violated. Using a t-test on heavily skewed data is like measuring temperature with a ruler — you might get a number, but it doesn’t mean anything.

When to Choose Non-Parametric

Use non-parametric tests when:

Shapiro-Wilk rejects normality (p < 0.05) and sample size is small
Data are ordinal (pain scale 1-10, tumor grade I-IV)
Data have heavy outliers that cannot be removed
Sample sizes are very small (n < 10 per group)
Data are bounded or have floor/ceiling effects (many zeros)

The Rank Transformation

The foundation of all non-parametric tests is replacing values with ranks:

Patient	Abundance	Rank
P1	0.1%	1
P2	0.3%	2
P3	0.8%	3
P4	1.5%	4
P5	3.2%	5
P6	45.0%	6

Notice: the outlier (45%) gets rank 6 — just one rank above 3.2%. Its extreme value no longer dominates the analysis.

Wilcoxon Rank-Sum Test (Mann-Whitney U)

The non-parametric counterpart of the independent two-sample t-test.

Procedure:

Combine all observations and rank them 1 through N
Sum the ranks in each group separately
If one group consistently has higher ranks, the rank sum will be extreme
Compare to the expected rank sum under H0 (no difference)

H0: The two groups have identical distributions H1: One group tends to have larger values

Common pitfall: The Wilcoxon rank-sum and Mann-Whitney U are the same test, just computed differently. U = W - n1(n1+1)/2. Different software uses different names, but the p-value is identical.

Wilcoxon Signed-Rank Test

The non-parametric counterpart of the paired t-test.

Procedure:

Compute the difference for each pair
Rank the absolute differences (ignoring zeros)
Sum ranks of positive differences (W+) and negative differences (W-)
If the treatment consistently increases (or decreases), one sum will dominate

Sign Test

Even simpler than Wilcoxon signed-rank — only considers the direction of differences, not their magnitude.

Procedure:

For each pair, note whether the difference is positive, negative, or zero
Count positives and negatives (discard zeros)
Under H0, positives and negatives should be equally likely (binomial test with p = 0.5)

The sign test has less power than Wilcoxon signed-rank but makes even fewer assumptions.

Kruskal-Wallis Test

The non-parametric counterpart of one-way ANOVA, for comparing three or more groups.

H0: All groups have the same distribution H1: At least one group differs

If significant, follow up with pairwise Wilcoxon tests (with multiple testing correction).

Kolmogorov-Smirnov (KS) Test

Compares two entire distributions, not just their centers. Detects differences in shape, spread, or location.

H0: The two samples come from the same distribution H1: The distributions differ in any way

Clinical relevance: The KS test is useful when you suspect groups differ not just in average abundance, but in the entire pattern of their distribution — for example, one group might be bimodal while the other is unimodal.

Decision Guide: Parametric vs Non-Parametric

Comparison	Parametric	Non-Parametric
One sample vs known value	One-sample t-test	Wilcoxon signed-rank (one sample)
Two independent groups	Welch’s t-test	Mann-Whitney U / Wilcoxon rank-sum
Two paired groups	Paired t-test	Wilcoxon signed-rank
Three+ independent groups	One-way ANOVA	Kruskal-Wallis
Three+ paired groups	Repeated measures ANOVA	Friedman test
Compare distributions	—	KS test

Non-Parametric Tests in BioLang

Mann-Whitney U: Microbiome Abundance

# F. prausnitzii relative abundance (%) in IBD vs healthy
let ibd = [0.1, 0.3, 0.0, 0.8, 0.2, 0.0, 1.5, 0.4, 0.1, 0.0,
           3.2, 0.5, 0.1, 0.7, 0.0]
let healthy = [2.1, 5.4, 1.8, 8.2, 3.5, 12.1, 4.7, 6.3, 2.9, 45.0,
               7.1, 3.8, 9.5, 4.2, 6.8]

# First, demonstrate why t-test is inappropriate
# Check normality visually — both distributions are right-skewed
qq_plot(ibd, {title: "QQ Plot: IBD"})
qq_plot(healthy, {title: "QQ Plot: Healthy"})
print("Both groups are heavily skewed — normality violated!\n")

# Mann-Whitney U test (non-parametric)
let result = wilcoxon(ibd, healthy)
print("=== Mann-Whitney U Test ===")
print("U statistic: {result.statistic:.1}")
print("p-value: {result.p_value:.2e}")

# Compare to (inappropriate) t-test
let t_result = ttest(ibd, healthy)
print("\n(Inappropriate) Welch's t-test p-value: {t_result.p_value:.2e}")
print("Mann-Whitney p-value: {result.p_value:.2e}")
print("Results may differ substantially with skewed data")

# Visualize the skewed distributions
let bp_table = table({"IBD": ibd, "Healthy": healthy})
boxplot(bp_table, {title: "F. prausnitzii Abundance"})

Wilcoxon Signed-Rank: Paired Treatment Data

# Inflammatory cytokine IL-6 (pg/mL) before and after anti-TNF therapy
# Same 12 patients measured twice — highly skewed cytokine data
let before = [245, 18, 892, 45, 32, 1250, 67, 128, 15, 543, 78, 2100]
let after  = [120, 12, 340, 22, 28, 450,  35,  65, 10, 210, 42,  890]

# Normality check on differences
let diffs = zip(before, after) |> map(|p| p[0] - p[1])
qq_plot(diffs, {title: "QQ Plot: Paired Differences"})
print("Differences are non-normal -> use Wilcoxon signed-rank\n")

# Wilcoxon signed-rank test
let result = wilcoxon(before, after)
print("=== Wilcoxon Signed-Rank Test ===")
print("V statistic: {result.statistic:.1}")
print("p-value: {result.p_value:.6}")
print("All 12 patients showed reduction in IL-6")

# For comparison: the sign test via dbinom (even more robust, less powerful)
# Count how many differences are positive
let n_pos = diffs |> filter(|d| d > 0) |> len()
let n_nonzero = diffs |> filter(|d| d != 0) |> len()
# Under H0, n_pos ~ Binomial(n_nonzero, 0.5)
let sign_p = 0.0
for k in range(n_pos, n_nonzero + 1) {
    sign_p = sign_p + dbinom(k, n_nonzero, 0.5)
}
let sign_p = 2.0 * min(sign_p, 1.0 - sign_p)  # two-tailed
print("\nSign test p-value: {sign_p:.6}")

Kruskal-Wallis: Multiple Body Sites

# Bacterial diversity (Shannon index) across three gut regions
let ileum   = [1.2, 0.8, 1.5, 0.3, 2.1, 0.9, 1.4, 0.6, 1.8, 0.4]
let cecum   = [2.5, 3.1, 2.8, 2.2, 3.4, 2.7, 3.0, 2.3, 2.9, 3.2]
let rectum  = [3.8, 4.2, 3.5, 4.5, 3.9, 4.1, 3.6, 4.3, 3.7, 4.0]

# Kruskal-Wallis: use anova() on rank-transformed data
let result = anova([ileum, cecum, rectum])
print("=== Kruskal-Wallis Test: Diversity Across Body Sites ===")
print("H statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.2e}")
print("Degrees of freedom: {result.df}")

if result.p_value < 0.05 {
  print("\nAt least one body site differs. Running pairwise comparisons...")

  let p1 = wilcoxon(ileum, cecum).p_value
  let p2 = wilcoxon(ileum, rectum).p_value
  let p3 = wilcoxon(cecum, rectum).p_value

  # Bonferroni correction for 3 comparisons
  let adjusted = p_adjust([p1, p2, p3], "bonferroni")
  print("Ileum vs Cecum:  p = {adjusted[0]:.4}")
  print("Ileum vs Rectum: p = {adjusted[1]:.4}")
  print("Cecum vs Rectum: p = {adjusted[2]:.4}")
}

let bp_table = table({"Ileum": ileum, "Cecum": cecum, "Rectum": rectum})
boxplot(bp_table, {title: "Microbial Diversity by Gut Region"})

KS Test: Comparing Distributions

# Do tumor suppressor genes and oncogenes have different
# expression distributions (not just different means)?
let tumor_suppressors = [2.1, 3.4, 1.8, 4.2, 2.9, 3.1, 2.5, 3.8, 1.5, 4.0,
                         2.7, 3.3, 2.0, 3.6, 2.3, 3.9, 1.9, 4.1, 2.6, 3.5]
let oncogenes = [5.2, 8.1, 6.3, 12.4, 7.5, 5.8, 9.2, 6.7, 11.3, 7.0,
                 5.5, 8.8, 6.1, 10.5, 7.3, 5.9, 9.7, 6.5, 11.8, 7.8]

let result = ks_test(tumor_suppressors, oncogenes)
print("=== KS Test: Expression Distributions ===")
print("D statistic: {result.statistic:.4}")
print("p-value: {result.p_value:.2e}")
print("Maximum distance between cumulative distributions: {result.statistic:.4}")

histogram([tumor_suppressors, oncogenes], {labels: ["Tumor Suppressors", "Oncogenes"], title: "Expression Distributions by Gene Class", x_label: "Expression (log2-CPM)", bins: 12})

Comparing t-Test vs Wilcoxon on the Same Data

set_seed(42)
# Demonstrate: with normal data, both tests agree
# With skewed data, they can disagree

print("=== Normal Data: Both Tests Agree ===")
let norm_a = rnorm(20, 5.0, 1.0)
let norm_b = rnorm(20, 6.0, 1.0)
let t_p = ttest(norm_a, norm_b).p_value
let w_p = wilcoxon(norm_a, norm_b).p_value
print("t-test p = {t_p:.4}, Mann-Whitney p = {w_p:.4}")

print("\n=== Skewed Data with Outlier: Tests May Disagree ===")
let skew_a = [1.2, 1.5, 1.8, 1.1, 1.4, 1.6, 1.3, 1.7, 1.9, 50.0]
let skew_b = [2.1, 2.3, 2.5, 2.0, 2.4, 2.2, 2.6, 2.1, 2.3, 2.5]
let t_p2 = ttest(skew_a, skew_b).p_value
let w_p2 = wilcoxon(skew_a, skew_b).p_value
print("t-test p = {t_p2:.4}, Mann-Whitney p = {w_p2:.4}")
print("The outlier inflates the t-test mean, masking the real pattern")

Python:

from scipy import stats

ibd = [0.1, 0.3, 0.0, 0.8, 0.2, 0.0, 1.5, 0.4, 0.1, 0.0,
       3.2, 0.5, 0.1, 0.7, 0.0]
healthy = [2.1, 5.4, 1.8, 8.2, 3.5, 12.1, 4.7, 6.3, 2.9, 45.0,
           7.1, 3.8, 9.5, 4.2, 6.8]

# Mann-Whitney U
u, p = stats.mannwhitneyu(ibd, healthy, alternative='two-sided')
print(f"U = {u}, p = {p:.2e}")

# Wilcoxon signed-rank (paired)
before = [245, 18, 892, 45, 32, 1250, 67, 128, 15, 543, 78, 2100]
after  = [120, 12, 340, 22, 28, 450,  35,  65, 10, 210, 42,  890]
w, p = stats.wilcoxon(before, after)
print(f"W = {w}, p = {p:.6f}")

# Kruskal-Wallis
ileum  = [1.2, 0.8, 1.5, 0.3, 2.1, 0.9, 1.4, 0.6, 1.8, 0.4]
cecum  = [2.5, 3.1, 2.8, 2.2, 3.4, 2.7, 3.0, 2.3, 2.9, 3.2]
rectum = [3.8, 4.2, 3.5, 4.5, 3.9, 4.1, 3.6, 4.3, 3.7, 4.0]
h, p = stats.kruskal(ileum, cecum, rectum)
print(f"H = {h:.4f}, p = {p:.2e}")

ibd <- c(0.1, 0.3, 0.0, 0.8, 0.2, 0.0, 1.5, 0.4, 0.1, 0.0,
         3.2, 0.5, 0.1, 0.7, 0.0)
healthy <- c(2.1, 5.4, 1.8, 8.2, 3.5, 12.1, 4.7, 6.3, 2.9, 45.0,
             7.1, 3.8, 9.5, 4.2, 6.8)

wilcox.test(ibd, healthy)           # Mann-Whitney
wilcox.test(before, after, paired = TRUE)  # Wilcoxon signed-rank
kruskal.test(list(ileum, cecum, rectum))   # Kruskal-Wallis
ks.test(tumor_suppressors, oncogenes)      # KS test

Exercises

Exercise 1: Choose the Right Test

For each dataset, decide whether a parametric or non-parametric test is more appropriate:

a) Pain scores (0-10 scale) in drug vs placebo groups b) Blood pressure measurements in 30 patients (continuous, approximately normal) c) Number of bacterial colonies per plate (many zeros, some very high counts) d) Survival time in days (typically right-skewed)

Exercise 2: Microbiome Comparison

Two diets were compared for their effect on Bacteroides abundance:

let high_fiber = [8.2, 12.5, 6.3, 15.1, 9.8, 22.4, 7.5, 11.2, 14.8, 5.9]
let low_fiber  = [1.2, 0.5, 3.1, 0.8, 2.4, 0.3, 1.8, 0.9, 2.7, 0.6]

# TODO: Test normality of each group
# TODO: Run Mann-Whitney U test
# TODO: Also run a t-test and compare results
# TODO: Create a boxplot

Exercise 3: Multiple Body Sites with Post-Hoc

OTU richness from four body sites (oral, gut, skin, vaginal). Run Kruskal-Wallis and, if significant, perform all pairwise comparisons with Bonferroni correction.

let oral    = [120, 95, 145, 110, 88, 132, 105, 98, 140, 115]
let gut     = [350, 420, 280, 390, 310, 445, 360, 295, 410, 380]
let skin    = [180, 210, 165, 195, 220, 175, 200, 185, 230, 190]
let vaginal = [45, 30, 55, 38, 25, 50, 42, 35, 48, 28]

# TODO: Kruskal-Wallis test
# TODO: If significant, pairwise Mann-Whitney with Bonferroni correction
# TODO: Which sites differ from which?

Exercise 4: The Power Trade-Off

Generate 1000 simulations where both groups are truly normal with different means. Compare how often the t-test and Mann-Whitney detect the difference (power). Then repeat with skewed data (e.g., exponential).


# TODO: Simulate normal data, compare t-test vs Mann-Whitney power
# TODO: Simulate skewed data, compare again
# TODO: Which test wins in each scenario?

Key Takeaways

Non-parametric tests use ranks instead of raw values, making them robust to skewness and outliers
The Mann-Whitney U (Wilcoxon rank-sum) is the non-parametric alternative to the independent t-test
The Wilcoxon signed-rank test is the non-parametric alternative to the paired t-test
The Kruskal-Wallis test extends to three or more groups (non-parametric ANOVA)
The KS test compares entire distributions, not just central tendency
Non-parametric tests have about 95% of the power of parametric tests when data ARE normal, but are far more reliable when data are NOT normal
Microbiome data, cytokine levels, survival times, and ordinal scales almost always require non-parametric methods
Always check normality first (Shapiro-Wilk, QQ plots) — let the data guide your choice of test

What’s Next

So far we have compared two groups. But what if you have three, four, or ten groups — different drug doses, tissue types, or experimental conditions? Running all pairwise t-tests inflates false positives dramatically. Tomorrow we introduce ANOVA, the principled way to compare many groups simultaneously, along with post-hoc tests that identify which groups differ.

Keyboard shortcuts

Practical Biostatistics in 30 Days