Day 7: Hypothesis Testing — Asking Precise Questions

The Problem

Dr. Kenji Nakamura has spent three years developing a blood-based biomarker panel for early Alzheimer’s detection. His team measures plasma levels of phosphorylated tau (p-tau217) in 40 cognitively normal individuals and 40 patients with confirmed early-stage Alzheimer’s. The mean p-tau217 level in the Alzheimer’s group is 3.8 pg/mL, compared to 2.9 pg/mL in controls. The difference looks promising — nearly 30% higher.

But when Dr. Nakamura submits to the FDA for breakthrough device designation, the reviewer’s response is blunt: “Your biomarker shows a numerical difference. Can you demonstrate this isn’t just sampling noise? What is the probability of seeing a difference this large if the biomarker has no real diagnostic value?” This is the fundamental question that hypothesis testing answers.

The stakes are enormous. If the biomarker works, millions of patients could be diagnosed years earlier, when interventions are most effective. If it doesn’t — if the observed difference is just statistical noise — pursuing it wastes hundreds of millions in development costs and, worse, could lead to false diagnoses.

What Is Hypothesis Testing?

Think of hypothesis testing as a courtroom trial for your scientific claim.

The defendant is the null hypothesis (H0): “There is no effect.” In the courtroom, the defendant is presumed innocent.
The prosecution’s evidence is your data. You are trying to show the evidence is so overwhelming that the “innocence” explanation is implausible.
The verdict is either “guilty” (reject H0) or “not proven” (fail to reject H0). Notice: the jury never declares the defendant “innocent” — just that the evidence was insufficient.

Key insight: Hypothesis testing never proves your theory is true. It only tells you whether the data are inconsistent enough with “no effect” that you can reject that explanation with a specified level of confidence.

The Five Steps of Hypothesis Testing

Step	Description	Alzheimer’s Example
1. State H0 and H1	Define the null and alternative	H0: mu_AD = mu_control; H1: mu_AD > mu_control
2. Choose alpha	Set significance threshold	alpha = 0.05
3. Compute test statistic	Summarize evidence against H0	z = (x_bar1 - x_bar2) / SE
4. Find p-value	Probability of seeing this extreme a result under H0	p = P(Z >= z_obs)
5. Make decision	Compare p to alpha	If p < 0.05, reject H0

The Null Hypothesis (H0)

The null hypothesis is the “boring” explanation — the default assumption of no effect, no difference, no relationship. It is what you assume until the data force you to abandon it.

Research Question	Null Hypothesis (H0)
Does the drug reduce tumor size?	Mean tumor size is the same with and without drug
Is this SNP associated with diabetes?	Allele frequencies are the same in cases and controls
Does expression differ between tissues?	Mean expression is equal in both tissues
Is the mutation rate elevated?	Mutation rate equals the background rate

The Alternative Hypothesis (H1)

The alternative is what you actually believe — the “interesting” claim.

Two-tailed: H1: mu1 != mu2 (the groups differ in either direction)
One-tailed: H1: mu1 > mu2 (specifically higher) or H1: mu1 < mu2 (specifically lower)

Common pitfall: Do not choose one-tailed vs two-tailed after looking at your data. This decision must be made before the experiment, based on your scientific question. Switching from two-tailed to one-tailed after seeing results halves your p-value — that is scientific fraud.

The p-Value: Most Misunderstood Number in Science

The p-value is the probability of observing data as extreme as (or more extreme than) what you got, assuming H0 is true.

What the p-value IS:

A measure of how surprising your data are under the null hypothesis
A continuous measure of evidence — smaller p = more evidence against H0
The probability of the data given H0: P(data | H0)

What the p-value IS NOT:

The probability that H0 is true: NOT P(H0 | data)
The probability your result is due to chance
The probability of making an error
A measure of effect size or practical importance

p-value	Informal Interpretation
p > 0.10	Little evidence against H0
0.05 < p < 0.10	Weak evidence against H0
0.01 < p < 0.05	Moderate evidence against H0
0.001 < p < 0.01	Strong evidence against H0
p < 0.001	Very strong evidence against H0

Type I and Type II Errors

Every decision carries the risk of being wrong:

	H0 is True	H0 is False
Reject H0	Type I error (false positive), probability = alpha	Correct (true positive), probability = 1 - beta = power
Fail to reject H0	Correct (true negative)	Type II error (false negative), probability = beta

Type I error (alpha): You claim a drug works when it doesn’t. A patient receives ineffective treatment.
Type II error (beta): You miss a real effect. An effective drug gets shelved.

Clinical relevance: In drug safety testing, alpha is typically set very low (0.01 or even 0.001) because a Type I error means approving a dangerous drug. In exploratory genomics, higher alpha (0.05 or even 0.10) is acceptable because you will validate hits in follow-up experiments.

Statistical vs Practical Significance

A p-value of 0.001 does not mean the effect is large or important. With enough data, trivially small effects become “statistically significant.”

Scenario	p-value	Effect Size	Verdict
Gene expression differs by 0.01% (n=100,000)	p < 0.001	Negligible	Statistically significant, practically meaningless
Drug reduces tumor by 40% (n=12)	p = 0.03	Large	Both statistically and practically significant
Biomarker differs by 5% (n=20)	p = 0.08	Moderate	Not significant — but maybe underpowered

Always report effect sizes alongside p-values. We will dedicate Day 19 entirely to this topic.

The z-Test: The Simplest Hypothesis Test

When the population standard deviation sigma is known (rare, but a good starting point), the z-test compares a sample mean to a known value:

z = (x-bar - mu0) / (sigma / sqrt(n))

Under H0, z follows a standard normal distribution N(0, 1).

One-Tailed vs Two-Tailed Tests

Test Type	H1	p-value Calculation	Use When
Two-tailed	mu != mu0	2 x P(Z > abs(z))	You care about differences in either direction
Right-tailed	mu > mu0	P(Z > z)	You only care if the value is higher
Left-tailed	mu < mu0	P(Z < z)	You only care if the value is lower

Hypothesis Testing in BioLang

z-Test on Biomarker Data

# Alzheimer's biomarker study
# Known population SD from large reference database: sigma = 1.2 pg/mL
# Expected normal level: mu0 = 2.9 pg/mL
# Observed in 40 Alzheimer's patients:
let ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
                 4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4,
                 3.0, 3.9, 4.2, 3.3, 3.7, 4.0, 3.5, 3.8, 4.1, 3.2,
                 3.6, 3.4, 4.3, 3.1, 3.9, 3.7, 4.0, 3.5, 3.8, 3.3]

# Compute z-test manually (known sigma)
let n = len(ad_levels)
let x_bar = mean(ad_levels)
let se = 1.2 / sqrt(n)
let z_stat = (x_bar - 2.9) / se
let p_val = 2.0 * (1.0 - pnorm(abs(z_stat), 0, 1))

print("z-statistic: {z_stat:.4}")
print("p-value (two-tailed): {p_val:.6}")
print("Mean observed: {x_bar:.3} pg/mL")

if p_val < 0.05 {
  print("Reject H0: Alzheimer's group significantly differs from normal level")
} else {
  print("Fail to reject H0: Insufficient evidence of difference")
}

Visualizing the Null Distribution

# Show where our test statistic falls on the null distribution
let z_obs = 4.71  # from the z-test above

# Generate the null distribution (standard normal)
let x_vals = range(-4.0, 4.0, 0.01)
let y_vals = x_vals |> map(|x| dnorm(x, 0, 1))

# Plot the null distribution with our observed z marked
density(x_vals, {title: "Null Distribution (Standard Normal)", x_label: "z-statistic", y_label: "Density", vlines: [z_obs], shade_above: 1.96, shade_below: -1.96})

print("Critical value (two-tailed, alpha=0.05): +/- 1.96")
print("Our z = {z_obs} falls far in the rejection region")

Binomial Test: Is This Mutation Rate Elevated?

# In a cancer cohort, 18 of 100 patients carry a specific BRCA2 variant
# Population frequency is known to be 8%
# Compute binomial test using dbinom: P(X >= 18) when X ~ Binom(100, 0.08)
let p_val = 0.0
for k in range(18, 101) {
    p_val = p_val + dbinom(k, 100, 0.08)
}

print("Observed proportion: 18/100 = 18%")
print("Expected under H0: 8%")
print("p-value (one-sided): {p_val:.6}")

if p_val < 0.05 {
  print("Reject H0: Mutation rate is significantly elevated in this cohort")
} else {
  print("Fail to reject H0")
}

Complete Hypothesis Test Workflow

# Full workflow: Is mean platelet count elevated in a disease cohort?
# Reference population: mu = 250 (x10^3/uL), sigma = 50
# Our 25 patients:
let platelets = [280, 310, 265, 295, 275, 320, 290, 305, 260, 285,
                 300, 270, 315, 288, 292, 278, 308, 282, 298, 272,
                 310, 295, 268, 302, 288]

# Step 1: State hypotheses
print("H0: mu = 250 (platelet count is normal)")
print("H1: mu > 250 (platelet count is elevated)")
print("alpha = 0.05, one-tailed test\n")

# Step 2: Compute test statistic
let n = len(platelets)
let x_bar = mean(platelets)
let se = 50 / sqrt(n)  # sigma is known
let z = (x_bar - 250) / se

# Step 3: Find p-value (one-tailed)
let p = 1.0 - pnorm(z, 0, 1)

# Step 4: Decision
print("Sample mean: {x_bar:.1}")
print("z-statistic: {z:.4}")
print("p-value (one-tailed): {p:.6}")

if p < 0.05 {
  print("\nDecision: Reject H0 at alpha = 0.05")
  print("Conclusion: Platelet count is significantly elevated in this cohort")
} else {
  print("\nDecision: Fail to reject H0")
}

Interpreting p-Values with Simulated Data

set_seed(42)
# Demonstrate: under H0 (no effect), p-values are uniformly distributed

let p_values = []
for i in 1..1000 {
  # Generate two samples from the SAME distribution (H0 is true)
  let group1 = rnorm(20, 10, 3)
  let group2 = rnorm(20, 10, 3)
  let z_stat = (mean(group1) - mean(group2)) / (3.0 / sqrt(20))
  let p_val = 2.0 * (1.0 - pnorm(abs(z_stat), 0, 1))
  p_values = append(p_values, p_val)
}

# Count false positives at alpha = 0.05
let false_pos = p_values |> filter(|p| p < 0.05) |> len()
print("False positives out of 1000 null tests: {false_pos}")
print("Expected: ~50 (5% of 1000)")

histogram(p_values, {title: "p-Value Distribution Under the Null", x_label: "p-value", bins: 20})

Connecting CIs and Hypothesis Tests

# Demonstrate: a 95% CI that excludes the null value corresponds to p < 0.05
let ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
                 4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4]

# z-test: is the mean different from 2.9?
let n = len(ad_levels)
let x_bar = mean(ad_levels)
let se = 1.2 / sqrt(n)
let z_stat = (x_bar - 2.9) / se
let z_p = 2.0 * (1.0 - pnorm(abs(z_stat), 0, 1))
print("z-test p-value: {z_p:.6}")

# 95% CI for the mean (using known sigma)
let se2 = 1.2 / sqrt(n)
let ci_lower = x_bar - 1.96 * se2
let ci_upper = x_bar + 1.96 * se2
print("95% CI: [{ci_lower:.3}, {ci_upper:.3}]")
print("Null value (2.9) is {'outside' if ci_lower > 2.9 or ci_upper < 2.9 else 'inside'} the CI")
print("This matches the hypothesis test: p < 0.05 <=> CI excludes null value")

Python:

import numpy as np
from scipy import stats

ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
             4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4,
             3.0, 3.9, 4.2, 3.3, 3.7, 4.0, 3.5, 3.8, 4.1, 3.2,
             3.6, 3.4, 4.3, 3.1, 3.9, 3.7, 4.0, 3.5, 3.8, 3.3]

# z-test (manual — scipy doesn't have a built-in z-test for means)
z = (np.mean(ad_levels) - 2.9) / (1.2 / np.sqrt(len(ad_levels)))
p = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"z = {z:.4f}, p = {p:.6f}")

# Binomial test
result = stats.binomtest(18, 100, 0.08, alternative='greater')
print(f"Binomial test p = {result.pvalue:.6f}")

ad_levels <- c(3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
               4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4,
               3.0, 3.9, 4.2, 3.3, 3.7, 4.0, 3.5, 3.8, 4.1, 3.2,
               3.6, 3.4, 4.3, 3.1, 3.9, 3.7, 4.0, 3.5, 3.8, 3.3)

# z-test (using BSDA package, or manual)
z <- (mean(ad_levels) - 2.9) / (1.2 / sqrt(length(ad_levels)))
p <- 2 * pnorm(-abs(z))
cat(sprintf("z = %.4f, p = %.6f\n", z, p))

# Binomial test
binom.test(18, 100, p = 0.08, alternative = "greater")

Exercises

Exercise 1: Formulate Hypotheses

For each scenario, write the null and alternative hypotheses. State whether you would use a one-tailed or two-tailed test and why.

a) Does a new antibiotic reduce bacterial colony counts compared to placebo? b) Is the GC content of a newly sequenced genome different from the expected 42%? c) Do patients with the variant allele have higher LDL cholesterol?

Exercise 2: z-Test on Gene Expression

A reference database reports the mean expression of housekeeping gene GAPDH as 8.5 log2-CPM with sigma = 0.8 across thousands of samples. Your RNA-seq experiment on 15 samples yields a mean of 7.9. Is your experiment’s GAPDH level significantly different?

let gapdh_expression = [7.5, 8.1, 7.8, 7.6, 8.3, 7.2, 8.0, 7.9,
                        8.2, 7.4, 7.7, 8.1, 7.6, 8.4, 7.3]

# TODO: Perform z-test with mu=8.5, sigma=0.8
# TODO: Interpret the result — what might explain a difference?

Exercise 3: Simulate Type I Error Rate

Run 10,000 z-tests where H0 is true (both groups from the same distribution). Count what fraction of p-values fall below 0.05, 0.01, and 0.001. Do these match expectations?


# TODO: Simulate 10,000 null tests
# TODO: Count p < 0.05, p < 0.01, p < 0.001
# TODO: Compare to expected rates (5%, 1%, 0.1%)

Exercise 4: One-Tailed vs Two-Tailed

Using the Alzheimer’s biomarker data, compute the p-value for both a one-tailed test (H1: AD levels are higher) and a two-tailed test. What is the relationship between the two p-values?

let ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
                 4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4]

# TODO: Compute z-stat manually, then get two-tailed and one-tailed p-values
# Two-tailed: 2 * (1 - pnorm(abs(z), 0, 1))
# One-tailed: 1 - pnorm(z, 0, 1)
# TODO: What is the mathematical relationship?

Key Takeaways

Hypothesis testing uses the courtroom analogy: H0 (innocence) is assumed until the evidence (data) is overwhelming
The p-value is the probability of data this extreme under H0 — it is NOT the probability H0 is true
Type I error (false positive) is controlled by alpha; Type II error (false negative) is controlled by power
Statistical significance (small p) does not imply practical significance (large effect)
The z-test is the simplest hypothesis test, applicable when sigma is known
Always state hypotheses and choose alpha before looking at data
Under the null, p-values are uniformly distributed: at alpha = 0.05, exactly 5% of null tests will be “significant” by chance

What’s Next

Tomorrow we move from the z-test (which requires known sigma) to the workhorse of biological research: the t-test. You will learn independent, paired, and Welch’s versions, check assumptions with Shapiro-Wilk and Levene’s tests, and quantify effect sizes with Cohen’s d. If hypothesis testing is the question, the t-test is the answer for two-group comparisons.

Keyboard shortcuts

Practical Biostatistics in 30 Days