Day 7: Hypothesis Testing — Asking Precise Questions
The Problem
Dr. Kenji Nakamura has spent three years developing a blood-based biomarker panel for early Alzheimer’s detection. His team measures plasma levels of phosphorylated tau (p-tau217) in 40 cognitively normal individuals and 40 patients with confirmed early-stage Alzheimer’s. The mean p-tau217 level in the Alzheimer’s group is 3.8 pg/mL, compared to 2.9 pg/mL in controls. The difference looks promising — nearly 30% higher.
But when Dr. Nakamura submits to the FDA for breakthrough device designation, the reviewer’s response is blunt: “Your biomarker shows a numerical difference. Can you demonstrate this isn’t just sampling noise? What is the probability of seeing a difference this large if the biomarker has no real diagnostic value?” This is the fundamental question that hypothesis testing answers.
The stakes are enormous. If the biomarker works, millions of patients could be diagnosed years earlier, when interventions are most effective. If it doesn’t — if the observed difference is just statistical noise — pursuing it wastes hundreds of millions in development costs and, worse, could lead to false diagnoses.
What Is Hypothesis Testing?
Think of hypothesis testing as a courtroom trial for your scientific claim.
- The defendant is the null hypothesis (H0): “There is no effect.” In the courtroom, the defendant is presumed innocent.
- The prosecution’s evidence is your data. You are trying to show the evidence is so overwhelming that the “innocence” explanation is implausible.
- The verdict is either “guilty” (reject H0) or “not proven” (fail to reject H0). Notice: the jury never declares the defendant “innocent” — just that the evidence was insufficient.
Key insight: Hypothesis testing never proves your theory is true. It only tells you whether the data are inconsistent enough with “no effect” that you can reject that explanation with a specified level of confidence.
The Five Steps of Hypothesis Testing
| Step | Description | Alzheimer’s Example |
|---|---|---|
| 1. State H0 and H1 | Define the null and alternative | H0: mu_AD = mu_control; H1: mu_AD > mu_control |
| 2. Choose alpha | Set significance threshold | alpha = 0.05 |
| 3. Compute test statistic | Summarize evidence against H0 | z = (x_bar1 - x_bar2) / SE |
| 4. Find p-value | Probability of seeing this extreme a result under H0 | p = P(Z >= z_obs) |
| 5. Make decision | Compare p to alpha | If p < 0.05, reject H0 |
The Null Hypothesis (H0)
The null hypothesis is the “boring” explanation — the default assumption of no effect, no difference, no relationship. It is what you assume until the data force you to abandon it.
| Research Question | Null Hypothesis (H0) |
|---|---|
| Does the drug reduce tumor size? | Mean tumor size is the same with and without drug |
| Is this SNP associated with diabetes? | Allele frequencies are the same in cases and controls |
| Does expression differ between tissues? | Mean expression is equal in both tissues |
| Is the mutation rate elevated? | Mutation rate equals the background rate |
The Alternative Hypothesis (H1)
The alternative is what you actually believe — the “interesting” claim.
- Two-tailed: H1: mu1 != mu2 (the groups differ in either direction)
- One-tailed: H1: mu1 > mu2 (specifically higher) or H1: mu1 < mu2 (specifically lower)
Common pitfall: Do not choose one-tailed vs two-tailed after looking at your data. This decision must be made before the experiment, based on your scientific question. Switching from two-tailed to one-tailed after seeing results halves your p-value — that is scientific fraud.
The p-Value: Most Misunderstood Number in Science
The p-value is the probability of observing data as extreme as (or more extreme than) what you got, assuming H0 is true.
What the p-value IS:
- A measure of how surprising your data are under the null hypothesis
- A continuous measure of evidence — smaller p = more evidence against H0
- The probability of the data given H0: P(data | H0)
What the p-value IS NOT:
- The probability that H0 is true: NOT P(H0 | data)
- The probability your result is due to chance
- The probability of making an error
- A measure of effect size or practical importance
| p-value | Informal Interpretation |
|---|---|
| p > 0.10 | Little evidence against H0 |
| 0.05 < p < 0.10 | Weak evidence against H0 |
| 0.01 < p < 0.05 | Moderate evidence against H0 |
| 0.001 < p < 0.01 | Strong evidence against H0 |
| p < 0.001 | Very strong evidence against H0 |
Type I and Type II Errors
Every decision carries the risk of being wrong:
| H0 is True | H0 is False | |
|---|---|---|
| Reject H0 | Type I error (false positive), probability = alpha | Correct (true positive), probability = 1 - beta = power |
| Fail to reject H0 | Correct (true negative) | Type II error (false negative), probability = beta |
- Type I error (alpha): You claim a drug works when it doesn’t. A patient receives ineffective treatment.
- Type II error (beta): You miss a real effect. An effective drug gets shelved.
Clinical relevance: In drug safety testing, alpha is typically set very low (0.01 or even 0.001) because a Type I error means approving a dangerous drug. In exploratory genomics, higher alpha (0.05 or even 0.10) is acceptable because you will validate hits in follow-up experiments.
Statistical vs Practical Significance
A p-value of 0.001 does not mean the effect is large or important. With enough data, trivially small effects become “statistically significant.”
| Scenario | p-value | Effect Size | Verdict |
|---|---|---|---|
| Gene expression differs by 0.01% (n=100,000) | p < 0.001 | Negligible | Statistically significant, practically meaningless |
| Drug reduces tumor by 40% (n=12) | p = 0.03 | Large | Both statistically and practically significant |
| Biomarker differs by 5% (n=20) | p = 0.08 | Moderate | Not significant — but maybe underpowered |
Always report effect sizes alongside p-values. We will dedicate Day 19 entirely to this topic.
The z-Test: The Simplest Hypothesis Test
When the population standard deviation sigma is known (rare, but a good starting point), the z-test compares a sample mean to a known value:
z = (x-bar - mu0) / (sigma / sqrt(n))
Under H0, z follows a standard normal distribution N(0, 1).
One-Tailed vs Two-Tailed Tests
| Test Type | H1 | p-value Calculation | Use When |
|---|---|---|---|
| Two-tailed | mu != mu0 | 2 x P(Z > abs(z)) | You care about differences in either direction |
| Right-tailed | mu > mu0 | P(Z > z) | You only care if the value is higher |
| Left-tailed | mu < mu0 | P(Z < z) | You only care if the value is lower |
Hypothesis Testing in BioLang
z-Test on Biomarker Data
# Alzheimer's biomarker study
# Known population SD from large reference database: sigma = 1.2 pg/mL
# Expected normal level: mu0 = 2.9 pg/mL
# Observed in 40 Alzheimer's patients:
let ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4,
3.0, 3.9, 4.2, 3.3, 3.7, 4.0, 3.5, 3.8, 4.1, 3.2,
3.6, 3.4, 4.3, 3.1, 3.9, 3.7, 4.0, 3.5, 3.8, 3.3]
# Compute z-test manually (known sigma)
let n = len(ad_levels)
let x_bar = mean(ad_levels)
let se = 1.2 / sqrt(n)
let z_stat = (x_bar - 2.9) / se
let p_val = 2.0 * (1.0 - pnorm(abs(z_stat), 0, 1))
print("z-statistic: {z_stat:.4}")
print("p-value (two-tailed): {p_val:.6}")
print("Mean observed: {x_bar:.3} pg/mL")
if p_val < 0.05 {
print("Reject H0: Alzheimer's group significantly differs from normal level")
} else {
print("Fail to reject H0: Insufficient evidence of difference")
}
Visualizing the Null Distribution
# Show where our test statistic falls on the null distribution
let z_obs = 4.71 # from the z-test above
# Generate the null distribution (standard normal)
let x_vals = range(-4.0, 4.0, 0.01)
let y_vals = x_vals |> map(|x| dnorm(x, 0, 1))
# Plot the null distribution with our observed z marked
density(x_vals, {title: "Null Distribution (Standard Normal)", x_label: "z-statistic", y_label: "Density", vlines: [z_obs], shade_above: 1.96, shade_below: -1.96})
print("Critical value (two-tailed, alpha=0.05): +/- 1.96")
print("Our z = {z_obs} falls far in the rejection region")
Binomial Test: Is This Mutation Rate Elevated?
# In a cancer cohort, 18 of 100 patients carry a specific BRCA2 variant
# Population frequency is known to be 8%
# Compute binomial test using dbinom: P(X >= 18) when X ~ Binom(100, 0.08)
let p_val = 0.0
for k in range(18, 101) {
p_val = p_val + dbinom(k, 100, 0.08)
}
print("Observed proportion: 18/100 = 18%")
print("Expected under H0: 8%")
print("p-value (one-sided): {p_val:.6}")
if p_val < 0.05 {
print("Reject H0: Mutation rate is significantly elevated in this cohort")
} else {
print("Fail to reject H0")
}
Complete Hypothesis Test Workflow
# Full workflow: Is mean platelet count elevated in a disease cohort?
# Reference population: mu = 250 (x10^3/uL), sigma = 50
# Our 25 patients:
let platelets = [280, 310, 265, 295, 275, 320, 290, 305, 260, 285,
300, 270, 315, 288, 292, 278, 308, 282, 298, 272,
310, 295, 268, 302, 288]
# Step 1: State hypotheses
print("H0: mu = 250 (platelet count is normal)")
print("H1: mu > 250 (platelet count is elevated)")
print("alpha = 0.05, one-tailed test\n")
# Step 2: Compute test statistic
let n = len(platelets)
let x_bar = mean(platelets)
let se = 50 / sqrt(n) # sigma is known
let z = (x_bar - 250) / se
# Step 3: Find p-value (one-tailed)
let p = 1.0 - pnorm(z, 0, 1)
# Step 4: Decision
print("Sample mean: {x_bar:.1}")
print("z-statistic: {z:.4}")
print("p-value (one-tailed): {p:.6}")
if p < 0.05 {
print("\nDecision: Reject H0 at alpha = 0.05")
print("Conclusion: Platelet count is significantly elevated in this cohort")
} else {
print("\nDecision: Fail to reject H0")
}
Interpreting p-Values with Simulated Data
set_seed(42)
# Demonstrate: under H0 (no effect), p-values are uniformly distributed
let p_values = []
for i in 1..1000 {
# Generate two samples from the SAME distribution (H0 is true)
let group1 = rnorm(20, 10, 3)
let group2 = rnorm(20, 10, 3)
let z_stat = (mean(group1) - mean(group2)) / (3.0 / sqrt(20))
let p_val = 2.0 * (1.0 - pnorm(abs(z_stat), 0, 1))
p_values = append(p_values, p_val)
}
# Count false positives at alpha = 0.05
let false_pos = p_values |> filter(|p| p < 0.05) |> len()
print("False positives out of 1000 null tests: {false_pos}")
print("Expected: ~50 (5% of 1000)")
histogram(p_values, {title: "p-Value Distribution Under the Null", x_label: "p-value", bins: 20})
Connecting CIs and Hypothesis Tests
# Demonstrate: a 95% CI that excludes the null value corresponds to p < 0.05
let ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4]
# z-test: is the mean different from 2.9?
let n = len(ad_levels)
let x_bar = mean(ad_levels)
let se = 1.2 / sqrt(n)
let z_stat = (x_bar - 2.9) / se
let z_p = 2.0 * (1.0 - pnorm(abs(z_stat), 0, 1))
print("z-test p-value: {z_p:.6}")
# 95% CI for the mean (using known sigma)
let se2 = 1.2 / sqrt(n)
let ci_lower = x_bar - 1.96 * se2
let ci_upper = x_bar + 1.96 * se2
print("95% CI: [{ci_lower:.3}, {ci_upper:.3}]")
print("Null value (2.9) is {'outside' if ci_lower > 2.9 or ci_upper < 2.9 else 'inside'} the CI")
print("This matches the hypothesis test: p < 0.05 <=> CI excludes null value")
Python:
import numpy as np
from scipy import stats
ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4,
3.0, 3.9, 4.2, 3.3, 3.7, 4.0, 3.5, 3.8, 4.1, 3.2,
3.6, 3.4, 4.3, 3.1, 3.9, 3.7, 4.0, 3.5, 3.8, 3.3]
# z-test (manual — scipy doesn't have a built-in z-test for means)
z = (np.mean(ad_levels) - 2.9) / (1.2 / np.sqrt(len(ad_levels)))
p = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"z = {z:.4f}, p = {p:.6f}")
# Binomial test
result = stats.binomtest(18, 100, 0.08, alternative='greater')
print(f"Binomial test p = {result.pvalue:.6f}")
R:
ad_levels <- c(3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4,
3.0, 3.9, 4.2, 3.3, 3.7, 4.0, 3.5, 3.8, 4.1, 3.2,
3.6, 3.4, 4.3, 3.1, 3.9, 3.7, 4.0, 3.5, 3.8, 3.3)
# z-test (using BSDA package, or manual)
z <- (mean(ad_levels) - 2.9) / (1.2 / sqrt(length(ad_levels)))
p <- 2 * pnorm(-abs(z))
cat(sprintf("z = %.4f, p = %.6f\n", z, p))
# Binomial test
binom.test(18, 100, p = 0.08, alternative = "greater")
Exercises
Exercise 1: Formulate Hypotheses
For each scenario, write the null and alternative hypotheses. State whether you would use a one-tailed or two-tailed test and why.
a) Does a new antibiotic reduce bacterial colony counts compared to placebo? b) Is the GC content of a newly sequenced genome different from the expected 42%? c) Do patients with the variant allele have higher LDL cholesterol?
Exercise 2: z-Test on Gene Expression
A reference database reports the mean expression of housekeeping gene GAPDH as 8.5 log2-CPM with sigma = 0.8 across thousands of samples. Your RNA-seq experiment on 15 samples yields a mean of 7.9. Is your experiment’s GAPDH level significantly different?
let gapdh_expression = [7.5, 8.1, 7.8, 7.6, 8.3, 7.2, 8.0, 7.9,
8.2, 7.4, 7.7, 8.1, 7.6, 8.4, 7.3]
# TODO: Perform z-test with mu=8.5, sigma=0.8
# TODO: Interpret the result — what might explain a difference?
Exercise 3: Simulate Type I Error Rate
Run 10,000 z-tests where H0 is true (both groups from the same distribution). Count what fraction of p-values fall below 0.05, 0.01, and 0.001. Do these match expectations?
# TODO: Simulate 10,000 null tests
# TODO: Count p < 0.05, p < 0.01, p < 0.001
# TODO: Compare to expected rates (5%, 1%, 0.1%)
Exercise 4: One-Tailed vs Two-Tailed
Using the Alzheimer’s biomarker data, compute the p-value for both a one-tailed test (H1: AD levels are higher) and a two-tailed test. What is the relationship between the two p-values?
let ad_levels = [3.2, 4.1, 3.8, 2.7, 4.5, 3.3, 3.9, 4.2, 3.1, 3.6,
4.0, 3.5, 4.3, 3.7, 2.9, 3.8, 4.1, 3.4, 3.6, 4.4]
# TODO: Compute z-stat manually, then get two-tailed and one-tailed p-values
# Two-tailed: 2 * (1 - pnorm(abs(z), 0, 1))
# One-tailed: 1 - pnorm(z, 0, 1)
# TODO: What is the mathematical relationship?
Key Takeaways
- Hypothesis testing uses the courtroom analogy: H0 (innocence) is assumed until the evidence (data) is overwhelming
- The p-value is the probability of data this extreme under H0 — it is NOT the probability H0 is true
- Type I error (false positive) is controlled by alpha; Type II error (false negative) is controlled by power
- Statistical significance (small p) does not imply practical significance (large effect)
- The z-test is the simplest hypothesis test, applicable when sigma is known
- Always state hypotheses and choose alpha before looking at data
- Under the null, p-values are uniformly distributed: at alpha = 0.05, exactly 5% of null tests will be “significant” by chance
What’s Next
Tomorrow we move from the z-test (which requires known sigma) to the workhorse of biological research: the t-test. You will learn independent, paired, and Welch’s versions, check assumptions with Shapiro-Wilk and Levene’s tests, and quantify effect sizes with Cohen’s d. If hypothesis testing is the question, the t-test is the answer for two-group comparisons.