Day 5: Sampling, Bias, and Why n Matters
The Problem
Dr. Elena Vasquez is the lead biostatistician for a pharmaceutical company. A new immunotherapy drug has shown promising results in cell lines and mouse models. Now the clinical team is designing the Phase II trial and they want her sign-off on the sample size.
The clinical lead proposes 20 patients per arm — treatment and placebo. “It’s faster, cheaper, and we can get to Phase III sooner,” he argues. Elena runs the numbers and shakes her head. With 20 patients per arm and the expected effect size, the trial has only a 23% chance of detecting the drug’s benefit even if it truly works. That means a 77% chance of concluding the drug is ineffective when it actually saves lives.
She recommends 200 patients per arm. The clinical lead winces at the cost — $12 million more and 18 extra months of enrollment. But Elena is firm: “Would you rather spend $12 million now and know the answer, or spend $50 million on a Phase III that was doomed from the start because Phase II was too small to see the signal?”
This tension — between the cost of collecting more data and the cost of drawing wrong conclusions from too little — is the central drama of experimental design. Today you will understand why sample size is not a bureaucratic detail but the most consequential decision in any study.
What Are Populations and Samples?
The Population
The population is the complete set of items you want to understand. It is usually too large, too expensive, or physically impossible to measure in its entirety.
| Research Question | Population |
|---|---|
| Does this drug lower blood pressure? | All humans with hypertension |
| Is gene X differentially expressed in tumors? | All tumors of this type, past and future |
| What is the allele frequency of rs1234? | All humans alive today |
| Does this sequencing protocol introduce bias? | All possible runs of this protocol |
The Sample
The sample is the subset you actually observe. Everything you learn comes from the sample, but everything you want to know is about the population. Statistics is the bridge between the two.
The quality of the bridge depends entirely on two factors:
- How the sample was selected (bias)
- How large the sample is (precision)
Key insight: A large biased sample is worse than a small unbiased one. The 1936 Literary Digest poll surveyed 2.4 million people and predicted Alf Landon would win the presidential election in a landslide. George Gallup surveyed 50,000 and correctly predicted Roosevelt. The Literary Digest sample was drawn from telephone directories and automobile registrations — overrepresenting wealthy voters. Size could not compensate for bias.
The Sampling Distribution
This is one of the most important concepts in all of statistics, and it is the one that most students find counterintuitive at first.
Imagine you draw a sample of 30 patients, measure their blood pressure, and compute the mean. You get 125 mmHg. Now imagine you draw a different sample of 30 patients and compute the mean. You might get 121 mmHg. A third sample: 128 mmHg.
If you repeated this process thousands of times — each time drawing 30 patients and computing the mean — you would get thousands of sample means. These sample means form a distribution called the sampling distribution of the mean.
The sampling distribution is NOT the distribution of individual data points. It is the distribution of a statistic (like the mean) computed from repeated samples.
Seeing It in Action
set_seed(42)
# Simulate the sampling distribution
# The "population": blood pressure values (slightly right-skewed)
let population = rnorm(100000, 125, 18)
# Draw 1000 samples of size 30, compute mean of each
let sample_means_30 = []
for i in 0..1000 {
let s = sample(population, 30)
sample_means_30 = sample_means_30 + [mean(s)]
}
# Draw 1000 samples of size 200, compute mean of each
let sample_means_200 = []
for i in 0..1000 {
let s = sample(population, 200)
sample_means_200 = sample_means_200 + [mean(s)]
}
# Compare the distributions
print("Population: mean = {mean(population):.1}, SD = {stdev(population):.1}")
print("Sample means (n=30): mean = {mean(sample_means_30):.1}, SD = {stdev(sample_means_30):.1}")
print("Sample means (n=200): mean = {mean(sample_means_200):.1}, SD = {stdev(sample_means_200):.1}")
histogram(sample_means_30, {bins: 40, title: "Sampling Distribution (n=30)"})
histogram(sample_means_200, {bins: 40, title: "Sampling Distribution (n=200)"})
Two crucial observations:
- Both sampling distributions are centered at the true population mean (~125). Samples are unbiased estimators.
- The n=200 distribution is much narrower than n=30. Larger samples give more precise estimates.
The Central Limit Theorem
The Central Limit Theorem (CLT) is perhaps the single most important result in statistics. It says:
Regardless of the shape of the population distribution, the sampling distribution of the mean approaches a normal distribution as sample size increases.
This is remarkable. The underlying data can be skewed, bimodal, uniform, or any shape at all. As long as you take large enough samples and compute means, those means will be approximately normally distributed.
Demonstrating the CLT
set_seed(42)
# Create a wildly non-normal population: exponential (very right-skewed)
let skewed_pop = rnorm(100000, 2, 1) |> map(|x| exp(x))
# The population is extremely skewed
histogram(skewed_pop, {bins: 50, title: "Population: Exponential (Very Skewed)"})
let pop_stats = summary(skewed_pop)
print("Population skewness: {pop_stats.skewness:.2}")
# Sample means with n=5 (still somewhat skewed)
let means_n5 = []
for i in 0..2000 {
let s = sample(skewed_pop, 5)
means_n5 = means_n5 + [mean(s)]
}
histogram(means_n5, {bins: 50, title: "Sample Means, n=5"})
print("n=5 skewness: {skewness(means_n5):.2}")
# Sample means with n=30 (approaching normal)
let means_n30 = []
for i in 0..2000 {
let s = sample(skewed_pop, 30)
means_n30 = means_n30 + [mean(s)]
}
histogram(means_n30, {bins: 50, title: "Sample Means, n=30"})
print("n=30 skewness: {skewness(means_n30):.2}")
# Sample means with n=100 (very close to normal)
let means_n100 = []
for i in 0..2000 {
let s = sample(skewed_pop, 100)
means_n100 = means_n100 + [mean(s)]
}
histogram(means_n100, {bins: 50, title: "Sample Means, n=100"})
print("n=100 skewness: {skewness(means_n100):.2}")
# Verify normality visually with Q-Q plot
qq_plot(means_n100, {title: "Q-Q Plot: Sample Means n=100"})
Watch the skewness drop toward zero as n increases. By n=100, the sampling distribution is indistinguishable from a normal curve, even though the underlying data is wildly skewed.
Key insight: The CLT is why the normal distribution dominates statistics. Even when individual observations are non-normal, means of samples are approximately normal. Since most statistical tests are fundamentally about comparing means, the normal distribution is the right reference distribution for the test statistic — even when the raw data is not normal.
When Does the CLT “Kick In”?
The speed of convergence to normality depends on how non-normal the population is:
| Population Shape | n Needed for CLT |
|---|---|
| Already normal | Any n (even n=1) |
| Slightly skewed | n ≥ 15 |
| Moderately skewed | n ≥ 30 |
| Heavily skewed | n ≥ 50-100 |
| Extremely skewed or heavy-tailed | n ≥ 100+ |
The “n ≥ 30” rule of thumb is a rough guideline, not a universal truth.
Standard Error: The Precision of Your Estimate
The standard deviation of the sampling distribution has a special name: the standard error (SE).
SE = SD / √n
This formula encodes the fundamental relationship between sample size and precision:
- Double your sample size -> SE decreases by a factor of sqrt(2) ~ 1.41
- Quadruple your sample size -> SE halves
- To cut SE in half, you need 4 times as many observations
set_seed(42)
# Demonstrate how SE shrinks with sample size
let population_sd = 18.0 # Blood pressure SD
let sample_sizes = [5, 10, 20, 30, 50, 100, 200, 500, 1000]
print("Sample Size | Theoretical SE | Observed SE")
print("------------|----------------|------------")
for n in sample_sizes {
let theoretical_se = population_sd / sqrt(n)
# Simulate to verify
let means = []
for i in 0..1000 {
let s = rnorm(n, 125, population_sd)
means = means + [mean(s)]
}
let observed_se = stdev(means)
print(" {n:>6} | {theoretical_se:>6.2} | {observed_se:.2}")
}
| Sample Size | SE (mmHg) | 95% CI Width |
|---|---|---|
| 20 | 4.02 | ± 7.9 |
| 50 | 2.55 | ± 5.0 |
| 100 | 1.80 | ± 3.5 |
| 200 | 1.27 | ± 2.5 |
| 1000 | 0.57 | ± 1.1 |
With 20 patients, your estimate of mean blood pressure could easily be off by 8 mmHg — enough to misclassify a treatment as effective or ineffective. With 200 patients, you are unlikely to be off by more than 2.5 mmHg.
Common pitfall: Researchers often confuse SD and SE. The SD describes variability among individual observations. The SE describes precision of the sample mean. They answer different questions. Report the right one. SD for describing data; SE for describing the precision of an estimate.
Types of Bias
Sample size controls precision, but even infinite precision cannot fix a biased sample. Bias is a systematic error that pushes your estimate in a consistent direction.
Selection Bias
Your sample is not representative of the population you want to study.
Example: A study of gene expression in breast cancer recruits patients only from a single academic medical center. These patients tend to have more advanced disease (referral bias), are more likely to be white (geographic bias), and have better follow-up (compliance bias). The results may not generalize to community hospitals or diverse populations.
Genomics example: If you study “healthy controls” by recruiting university employees, your sample overrepresents educated, relatively affluent people — not the general population.
Survivorship Bias
You only observe subjects who “survived” some selection process, missing those who did not.
Classic example: During WWII, the military examined bullet holes in returning planes and planned to add armor where holes were most common. Statistician Abraham Wald pointed out the error: they were only seeing planes that survived. The areas with no holes were where planes had been hit and crashed. Armor should go where holes were absent.
Biological example: If you study long-term cancer survivors to find prognostic biomarkers, you miss the patients who died quickly. Your biomarkers will predict survival among survivors, not among all patients.
Ascertainment Bias
The way you identify subjects systematically skews who gets included.
Example: A study finds that children with autism have more genetic variants than controls. But the autistic children were ascertained through clinical evaluation (which involves deep phenotyping and genetic testing), while controls were population-based. The ascertainment process itself led to more thorough variant discovery in cases.
Measurement Bias
The way you measure introduces systematic error.
Example: A technician consistently reads gel bands as slightly brighter than they are. All expression measurements are systematically inflated. If this bias is constant across all samples, relative comparisons are still valid. If it varies between conditions (e.g., the technician knows which samples are treatment), it corrupts everything.
Genomics example: GC content bias in sequencing — regions with extreme GC content are systematically under-represented in coverage, biasing any analysis that depends on read depth.
Publication Bias
Studies with significant results are more likely to be published than studies with null results. The published literature systematically overestimates effect sizes.
Example: 20 groups test whether gene X is associated with disease Y. One group (by chance) finds p < 0.05 and publishes. The other 19 find nothing and file the results away. The published literature now says gene X is associated with disease Y, but the full evidence says otherwise.
| Bias Type | What Goes Wrong | Genomics Example |
|---|---|---|
| Selection | Non-representative sample | Single-center cohort |
| Survivorship | Missing failures | Studying only long-term survivors |
| Ascertainment | Systematic identification skew | More testing in cases vs controls |
| Measurement | Systematic instrument/observer error | GC bias, batch effects |
| Publication | Only positive results published | “Significant” GWAS hits that don’t replicate |
Key insight: Bias cannot be fixed by increasing sample size. A biased study with 10,000 subjects gives you a very precise wrong answer. Always evaluate bias before interpreting results.
Why n Matters: The Power Preview
Statistical power is the probability of detecting a real effect when it exists. It is 1 minus the Type II error rate (1 - β). Convention targets 80% power, meaning a 20% chance of missing a real effect.
Power depends on four factors:
- Effect size — How large is the true difference? Bigger effects are easier to detect.
- Sample size (n) — More data = more power.
- Variability (σ) — Less noise = more power.
- Significance threshold (α) — More stringent threshold = less power.
Think of detecting a treatment effect as hearing a whisper in a crowd. The whisper is the signal (effect size). The crowd noise is variability. Adding more listeners (larger n) helps. Making the crowd quieter (reducing variability) helps. Demanding absolute certainty before you will believe you heard something (lower α) makes it harder.
Simulating Power
set_seed(42)
# Simulate a clinical trial to understand power
# True effect: treatment mean = 125 (control = 130, lower is better)
let control_mean = 130.0
let treatment_mean = 125.0 # 5 mmHg real difference
let sd = 18.0
# Function to run one trial and check if we detect the difference
# Returns 1 if p < 0.05, 0 otherwise
let run_trial = fn(n_per_arm) {
let control = rnorm(n_per_arm, control_mean, sd)
let treatment = rnorm(n_per_arm, treatment_mean, sd)
let result = ttest(treatment, control)
if result.p_value < 0.05 { 1 } else { 0 }
}
# Run 1000 simulated trials for different sample sizes
let sizes = [20, 50, 100, 200, 500]
print("n per arm | Estimated Power")
print("----------|----------------")
for n in sizes {
let detections = 0
for i in 0..1000 {
detections = detections + run_trial(n)
}
let power = detections / 1000.0
print(" {n:>5} | {power * 100:.1}%")
}
Typical results:
| n per arm | Power | Interpretation |
|---|---|---|
| 20 | ~23% | Miss the effect 77% of the time — nearly useless |
| 50 | ~47% | Coin flip — unacceptable for a clinical trial |
| 100 | ~71% | Getting close but still below the 80% standard |
| 200 | ~94% | Excellent — high confidence in detecting the effect |
| 500 | ~99.9% | Virtually certain to detect even subtle effects |
This is Dr. Vasquez’s argument in numbers. With 20 patients per arm, the trial has a 77% chance of producing a false negative — concluding the drug does not work when it does. That is not an experiment; it is a waste of money.
The Bootstrap: Estimation Without Formulas
The bootstrap is a resampling method that estimates the sampling distribution empirically. Instead of relying on mathematical formulas, you resample your data with replacement thousands of times and compute your statistic each time.
The bootstrap is invaluable when:
- The formula for the standard error is unknown or complicated
- The CLT may not apply (small n, skewed data)
- You want confidence intervals for any statistic (median, correlation, ratio)
set_seed(42)
# Bootstrap estimation of the standard error of the median
# Original sample: 50 gene expression values
let expression = rnorm(50, 3.0, 1.5) |> map(|x| exp(x))
let observed_median = median(expression)
print("Observed median: {observed_median:.2}")
# Bootstrap: resample with replacement 5000 times
let boot_medians = []
for i in 0..5000 {
let resample = sample(expression, len(expression))
boot_medians = boot_medians + [median(resample)]
}
# Bootstrap SE
let boot_se = stdev(boot_medians)
print("Bootstrap SE of median: {boot_se:.2}")
# Bootstrap 95% confidence interval (percentile method)
let ci_lower = quantile(boot_medians, 0.025)
let ci_upper = quantile(boot_medians, 0.975)
print("95% Bootstrap CI: [{ci_lower:.2}, {ci_upper:.2}]")
histogram(boot_medians, {bins: 50, title: "Bootstrap Distribution of the Median"})
Key insight: The bootstrap treats your sample as a stand-in for the population. By resampling from your sample, you simulate what would happen if you could repeatedly sample from the population. It is remarkably effective even for small samples.
Hands-On: CLT with Allele Frequencies
Let us experience the Central Limit Theorem using realistic genetic data.
set_seed(42)
# Simulate allele frequency estimation from 1000 Genomes-style data
# True allele frequency of a common variant
let true_af = 0.23
# Simulate genotyping different numbers of individuals
let sample_sizes = [10, 30, 100, 500]
for n in sample_sizes {
# Simulate 2000 studies, each genotyping n individuals
let estimated_afs = []
for study in 0..2000 {
# Each individual contributes 2 alleles (diploid)
let n_alleles = 2 * n
let alt_count = rbinom(1, n_alleles, true_af) |> sum()
let af_estimate = alt_count / n_alleles
estimated_afs = estimated_afs + [af_estimate]
}
let se = stdev(estimated_afs)
let theoretical_se = sqrt(true_af * (1.0 - true_af) / (2.0 * n))
print("n={n}: SE={se:.4} (theoretical: {theoretical_se:.4})")
histogram(estimated_afs, {bins: 40, title: "Allele Frequency Estimates (n={n})"})
}
# With n=10: estimates range wildly (0.05 to 0.50)
# With n=500: estimates tightly clustered around 0.23
This simulation shows exactly why GWAS studies need thousands of individuals. With 10 people, you cannot reliably estimate an allele frequency to better than ±10 percentage points. With 500, you can nail it to within ±1-2 percentage points.
Python and R Equivalents
Python:
import numpy as np
from scipy import stats
# Sampling distribution simulation
population = np.random.normal(125, 18, 100000)
sample_means = [np.mean(np.random.choice(population, 30)) for _ in range(1000)]
print(f"SE: {np.std(sample_means):.2f}") # Should be close to 18/sqrt(30)
# Bootstrap
from scipy.stats import bootstrap
data = np.random.exponential(1, 50)
res = bootstrap((data,), np.median, n_resamples=5000)
print(f"95% CI: [{res.confidence_interval.low:.2f}, {res.confidence_interval.high:.2f}]")
# Standard error
se = np.std(data, ddof=1) / np.sqrt(len(data))
R:
# Sampling distribution
population <- rnorm(100000, mean = 125, sd = 18)
sample_means <- replicate(1000, mean(sample(population, 30)))
sd(sample_means) # Empirical SE
# Bootstrap
library(boot)
boot_fn <- function(data, indices) median(data[indices])
results <- boot(data, boot_fn, R = 5000)
boot.ci(results, type = "perc")
# Standard error
se <- sd(data) / sqrt(length(data))
# Central Limit Theorem demo
par(mfrow = c(2, 2))
for (n in c(5, 10, 30, 100)) {
means <- replicate(2000, mean(rexp(n, rate = 1)))
hist(means, breaks = 40, main = paste("n =", n))
}
Exercises
Exercise 1: Experience the CLT
Take a uniform distribution (flat, definitely not normal) and show the CLT in action.
set_seed(42)
# Uniform population: values equally likely between 0 and 100
let uniform_pop = rnorm(100000, 50, 28.87)
# (Approximation — true uniform has SD = range/sqrt(12))
# TODO: Draw histograms of the population (should be flat-ish)
# TODO: Take 1000 samples of size n=5, compute means, draw histogram
# TODO: Repeat for n=10, n=30, n=100
# TODO: At what n does the sampling distribution look convincingly normal?
# TODO: Compute skewness at each n to quantify the convergence
Exercise 2: SE and Confidence
You measure tumor volumes in 25 mice (mean = 450 mm³, SD = 120 mm³).
let n = 25
let sample_mean = 450.0
let sample_sd = 120.0
# TODO: Compute the standard error
# TODO: Compute an approximate 95% CI using mean +/- 2*SE
# TODO: How large would n need to be for the 95% CI to have a width of +/-10 mm3?
# TODO: How large for +/-5 mm3?
Exercise 3: Bias Identification
For each scenario, identify the type of bias and explain how it could affect results.
- A study of BRCA1 mutation frequency recruits subjects from a cancer genetics clinic.
- A survival analysis of pancreatic cancer uses patients diagnosed 5+ years ago (all long-term survivors by definition).
- RNA-seq libraries are prepared on two different days — all treatment samples on Day 1, all controls on Day 2.
- A GWAS consortium publishes results for the 10 strongest associations and files the rest.
Exercise 4: Bootstrap a Correlation
Estimate the uncertainty in a correlation coefficient using the bootstrap.
set_seed(42)
# Gene expression vs. protein abundance (moderate correlation)
let n = 40
let gene_expr = rnorm(n, 5.0, 2.0)
let noise = rnorm(n, 0, 1.5)
let protein = gene_expr |> map(|x| 0.7 * x) |> zip(noise) |> map(|pair| pair.0 + pair.1)
let observed_r = cor(gene_expr, protein)
print("Observed correlation: {observed_r:.3}")
# TODO: Bootstrap the correlation 5000 times
# TODO: Compute the 95% bootstrap CI
# TODO: Is the correlation significantly different from zero?
# TODO: Plot the bootstrap distribution of r
Exercise 5: Power Simulation
Explore how effect size and variability interact with sample size.
set_seed(42)
# TODO: Run the trial simulation from the chapter, but now vary the effect size
# Test with differences of 2, 5, 10, and 20 mmHg (SD=18 throughout)
# At n=50 per arm, which effect sizes can you reliably detect?
# TODO: Now fix the difference at 5 mmHg and vary SD (10, 18, 30)
# How does variability affect the required sample size?
Key Takeaways
- Population vs. sample: You study a sample to learn about a population. The quality of inference depends on sample size and sampling method.
- The sampling distribution is the distribution of a statistic computed from repeated samples. It is narrower than the data distribution and centered at the true value.
- The Central Limit Theorem guarantees that sample means are approximately normal regardless of the population distribution, given sufficient sample size. This is why normal-based tests work so broadly.
- Standard error (SE = SD/√n) quantifies the precision of your estimate. Quadrupling n halves the SE.
- Bias (selection, survivorship, ascertainment, measurement, publication) is a systematic distortion that cannot be fixed by increasing n. Identify and prevent bias at the design stage.
- Statistical power is the probability of detecting a real effect. Underpowered studies waste resources and miss real effects. The four determinants of power are effect size, sample size, variability, and significance threshold.
- The bootstrap provides empirical estimates of standard errors and confidence intervals for any statistic, without relying on distributional assumptions.
What’s Next
You have now completed the foundations. You know how to summarize data (Day 2), understand its distributional shape (Day 3), reason about probabilities (Day 4), and appreciate the central role of sample size and sampling variability (Day 5). Starting next week, we put these foundations to work. Day 6 introduces confidence intervals — the formal framework for saying “I’m 95% sure the true value lies between here and here.” You will see how the standard error you learned today transforms into a rigorous statement about uncertainty, and why confidence intervals are more informative than p-values alone. The testing begins.