Day 30: Capstone — Genome-Wide Association Study

Day 30 of 30 Capstone: Days 4, 7, 11-12, 16, 18-19, 21, 25 ~90 min reading Genomics

The Problem

A large consortium has genotyped 10,000 individuals — 5,000 with type 2 diabetes (T2D) and 5,000 controls — at 500,000 single nucleotide polymorphisms (SNPs) spread across the genome. The goal is to identify genetic variants associated with T2D risk.

This is a genome-wide association study (GWAS) — the workhorse of modern human genetics. Since the first GWAS in 2005, thousands of studies have identified genetic associations for hundreds of diseases and traits. The method is conceptually simple: for each SNP, ask “is this variant more common in cases than controls?” But the execution requires careful statistical reasoning, because testing 500,000 hypotheses simultaneously creates an extreme multiple testing problem, and subtle confounders (especially population structure) can generate thousands of false positives.

By the end of today, you will have built a complete GWAS pipeline: quality control, population structure analysis, genome-wide association testing, multiple testing correction, visualization (Manhattan plot, Q-Q plot), and effect size interpretation. This capstone integrates methods from nearly every chapter of the book.

The GWAS Pipeline

SNP Quality Control: Hardy-Weinberg equilibrium, call rates, minor allele frequency
Population Structure: PCA on genotype data, identify and correct for ancestry
Association Testing: Chi-square or logistic regression per SNP
Multiple Testing Correction: Bonferroni at genome-wide threshold
Visualization: Manhattan plot, Q-Q plot
Effect Size Interpretation: Odds ratios for top hits
Power Considerations: What could we have missed?

Section 1: Data Simulation

set_seed(42)
# ============================================
# Genome-Wide Association Study
# Type 2 Diabetes: 5,000 Cases + 5,000 Controls
# 500,000 SNPs across 22 autosomes
# ============================================


# --- Configuration ---
let CONFIG = {
  n_cases: 5000,
  n_controls: 5000,
  n_total: 10000,
  n_snps: 500000,
  genome_wide_p: 5e-8,      # genome-wide significance
  suggestive_p: 1e-5,       # suggestive significance
  hwe_threshold: 1e-6,      # HWE filter in controls
  maf_threshold: 0.01,      # minimum minor allele frequency
  call_rate_threshold: 0.95  # minimum genotyping rate
}

# Phenotype: 0 = control, 1 = case
let phenotype = repeat(1, CONFIG.n_cases) + repeat(0, CONFIG.n_controls)

# Distribute SNPs across chromosomes (proportional to chromosome length)
let chr_lengths = [249, 243, 198, 191, 182, 171, 159, 145, 138, 134,
                   135, 133, 114, 107, 102, 90, 83, 80, 59, 64, 47, 51]
let total_length = sum(chr_lengths)
let snps_per_chr = chr_lengths |> map(|l|
  round(l / total_length * CONFIG.n_snps, 0))

# Generate SNP positions
let snp_chr = []
let snp_pos = []
let snp_names = []

for c in 0..22 {
  let n = snps_per_chr[c]
  let positions = range(0, n) |> map(|i| random_int(1, chr_lengths[c] * 1000000))
    |> sort_by(|x| x)
  for p in positions {
    snp_chr = snp_chr + [c + 1]
    snp_pos = snp_pos + [p]
    snp_names = snp_names + ["rs" + str(len(snp_names) + 1)]
  }
}

print("Simulated " + str(len(snp_names)) + " SNPs across 22 chromosomes")
print("Samples: " + str(CONFIG.n_cases) + " cases + " +
  str(CONFIG.n_controls) + " controls")

# Generate MAFs and genotypes
# Most SNPs: no association (null)
# ~50 SNPs: true associations with small effect sizes (OR 1.1-1.4)

let n_true_assoc = 50
let true_snp_indices = range(0, n_true_assoc) |> map(|i| random_int(0, len(snp_names) - 1))
let true_effect_sizes = rnorm(n_true_assoc, 0.18, 0.08)
  |> map(|x| max(0.05, min(0.40, x)))  # log(OR) range

# Generate p-values for each SNP
let p_values = []
let odds_ratios = []
let mafs = []

for i in 0..len(snp_names) {
  let maf = abs(rnorm(1, 0.2, 0.12)[0])
  maf = max(0.01, min(0.49, maf))
  mafs = mafs + [maf]

  if true_snp_indices |> contains(i) {
    # True association: generate p-value from the effect
    let idx = true_snp_indices |> index_of(i)
    let log_or = true_effect_sizes[idx]
    let or_val = exp(log_or)

    # Approximate chi-square statistic for this effect and sample size
    let freq_case = maf * or_val / (1 + maf * (or_val - 1))
    let freq_ctrl = maf
    let n = CONFIG.n_total
    let chi2 = n * pow(freq_case - freq_ctrl, 2) / (maf * (1 - maf))
    let p = max(1e-300, exp(-chi2 / 2))  # approximate

    p_values = p_values + [p]
    odds_ratios = odds_ratios + [or_val]
  } else {
    # Null SNP: p-value from uniform distribution
    let p = abs(rnorm(1, 0.5, 0.3)[0])
    p = max(0.0001, min(0.9999, p))
    p_values = p_values + [p]
    odds_ratios = odds_ratios + [1.0 + rnorm(1, 0, 0.02)[0]]
  }
}

Section 2: SNP Quality Control

Before testing associations, filter out low-quality SNPs.

Hardy-Weinberg Equilibrium

SNPs that violate Hardy-Weinberg equilibrium (HWE) in controls may indicate genotyping errors. We test each SNP with a chi-square goodness-of-fit test.

set_seed(42)
# --- Quality Control ---
print("\n=== SNP Quality Control ===")

# Simulate HWE p-values for each SNP
# Most SNPs: in HWE (high p-value)
# A few: genotyping errors (low p-value)
let n_geno_errors = 500  # SNPs with genotyping problems
let geno_error_idx = range(0, n_geno_errors) |> map(|i| random_int(0, len(snp_names) - 1))

let hwe_pvalues = []
for i in 0..len(snp_names) {
  if geno_error_idx |> contains(i) {
    # Genotyping error: HWE violation
    hwe_pvalues = hwe_pvalues + [abs(rnorm(1, 0, 1e-7)[0])]
  } else {
    # Normal SNP: in HWE
    hwe_pvalues = hwe_pvalues + [abs(rnorm(1, 0.5, 0.3)[0])]
  }
}

# Apply HWE filter (in controls only)
let pass_hwe = hwe_pvalues |> map(|p| p >= CONFIG.hwe_threshold)
let fail_hwe = count(pass_hwe, |x| !x)

print("HWE filter (p < " + str(CONFIG.hwe_threshold) + " in controls):")
print("  Failed: " + str(fail_hwe) + " SNPs removed")

# MAF filter
let pass_maf = mafs |> map(|m| m >= CONFIG.maf_threshold)
let fail_maf = count(pass_maf, |x| !x)
print("MAF filter (< " + str(CONFIG.maf_threshold) + "):")
print("  Failed: " + str(fail_maf) + " SNPs removed")

# Combined QC
let pass_qc = []
let qc_indices = []
for i in 0..len(snp_names) {
  if pass_hwe[i] && pass_maf[i] {
    pass_qc = pass_qc + [true]
    qc_indices = qc_indices + [i]
  } else {
    pass_qc = pass_qc + [false]
  }
}

let n_pass = len(qc_indices)
print("\nSNPs passing QC: " + str(n_pass) + " / " + str(len(snp_names)) +
  " (" + str(round(n_pass / len(snp_names) * 100, 1)) + "%)")

# Filter arrays to QC-passing SNPs
let qc_pvalues = p_values |> select(qc_indices)
let qc_ors = odds_ratios |> select(qc_indices)
let qc_chr = snp_chr |> select(qc_indices)
let qc_pos = snp_pos |> select(qc_indices)
let qc_names = snp_names |> select(qc_indices)
let qc_mafs = mafs |> select(qc_indices)

Key insight: HWE filtering is performed in controls only, not cases. Disease-associated variants may legitimately deviate from HWE in cases (this is actually expected for associated variants under certain genetic models). Testing HWE in cases would remove true associations.

Section 3: Population Structure — PCA

Population structure is the most common confounder in GWAS. If cases and controls have different ancestral backgrounds, allele frequency differences due to ancestry will be mistaken for disease associations.

set_seed(42)
# --- Population Structure Analysis ---
print("\n=== Population Structure (PCA) ===")

# Simulate PCA scores reflecting population structure
# Assume 3 major ancestry clusters with some admixture
let ancestry = range(0, CONFIG.n_total) |> map(|i| {
  let r = rnorm(1, 0, 1)[0]
  if r < 0.25 { "European" } else if r < 0.92 { "East Asian" } else { "African" }
})

# PC1 and PC2 separate ancestries
let pc1 = []
let pc2 = []
for i in 0..CONFIG.n_total {
  if ancestry[i] == "European" {
    pc1 = pc1 + [rnorm(1, 0, 5)[0]]
    pc2 = pc2 + [rnorm(1, 0, 4)[0]]
  } else if ancestry[i] == "East Asian" {
    pc1 = pc1 + [rnorm(1, 30, 5)[0]]
    pc2 = pc2 + [rnorm(1, -10, 4)[0]]
  } else {
    pc1 = pc1 + [rnorm(1, -25, 6)[0]]
    pc2 = pc2 + [rnorm(1, 20, 5)[0]]
  }
}

# PCA plot colored by ancestry
scatter(pc1, pc2,
  {xlabel: "PC1",
  ylabel: "PC2",
  title: "Population Structure — Genotype PCA"})

# PCA plot colored by case/control
scatter(pc1, pc2,
  {xlabel: "PC1",
  ylabel: "PC2",
  title: "PCA — Case/Control Overlay"})

# Check: are cases and controls balanced across ancestry?
print("\nAncestry distribution:")
print("             Cases     Controls")
for anc in ["European", "East Asian", "African"] {
  let n_case = 0
  let n_ctrl = 0
  for i in 0..CONFIG.n_total {
    if ancestry[i] == anc {
      if phenotype[i] == 1 { n_case = n_case + 1 }
      else { n_ctrl = n_ctrl + 1 }
    }
  }
  print(anc + "   " + str(n_case) + "       " + str(n_ctrl))
}

Common pitfall: If one ancestry group has a higher prevalence of T2D (which is biologically true — T2D rates differ across populations), and allele frequencies also differ by ancestry, then every ancestry-differentiated SNP will appear associated with T2D. This is confounding, not causation. Including top PCs as covariates in the regression model removes this confounding.

Section 4: Association Testing

For each SNP, test the association with disease status. The simplest approach is a chi-square test on the 2x3 genotype table. A more powerful approach is logistic regression with PC covariates.

Chi-Square Test (Basic)

# --- Association Testing ---
print("\n=== Genome-Wide Association Testing ===")
print("Testing " + str(n_pass) + " SNPs...")

# The p-values were pre-computed in simulation
# In a real GWAS, you would compute them here:
#
# for each SNP:
#   let geno_table = cross_tabulate(genotypes, phenotype)
#   let chi2 = chi_square_test(geno_table)
#   or: let lr = logistic_regression(phenotype ~ snp + PC1 + PC2 + ...)

Logistic Regression (Adjusted)

# Adjusted analysis includes top PCs as covariates
# This removes population structure confounding

# For this simulation, we adjust p-values to reflect
# the improvement from PC adjustment
let adjusted_pvalues = qc_pvalues  # In practice, from logistic_regression

print("Association method: logistic regression")
print("Covariates: PC1, PC2, PC3, PC4 (top 4 principal components)")
print("Model: disease ~ SNP + PC1 + PC2 + PC3 + PC4")

Section 5: Multiple Testing Correction

With 500,000 tests, even a tiny false positive rate generates thousands of false hits. The genome-wide significance threshold of p < 5 x 10^-8 is the standard Bonferroni correction for approximately 1 million independent tests (accounting for linkage disequilibrium).

# --- Multiple Testing ---
print("\n=== Multiple Testing Correction ===")

# Genome-wide significant hits
let gw_sig = []
for i in 0..n_pass {
  if adjusted_pvalues[i] < CONFIG.genome_wide_p {
    gw_sig = gw_sig + [{
      snp: qc_names[i],
      chr: qc_chr[i],
      pos: qc_pos[i],
      p: adjusted_pvalues[i],
      or: qc_ors[i],
      maf: qc_mafs[i]
    }]
  }
}

# Suggestive hits
let suggestive = []
for i in 0..n_pass {
  if adjusted_pvalues[i] < CONFIG.suggestive_p && adjusted_pvalues[i] >= CONFIG.genome_wide_p {
    suggestive = suggestive + [{
      snp: qc_names[i],
      chr: qc_chr[i],
      pos: qc_pos[i],
      p: adjusted_pvalues[i],
      or: qc_ors[i]
    }]
  }
}

print("Bonferroni threshold (0.05 / " + str(n_pass) + "): " +
  str(round(0.05 / n_pass, 10)))
print("Standard genome-wide threshold: 5 x 10^-8")
print("Suggestive threshold: 1 x 10^-5")
print("")
print("Genome-wide significant hits: " + str(len(gw_sig)))
print("Suggestive hits: " + str(len(suggestive)))

# Top hits table
let all_hits = gw_sig |> sort_by(|h| h.p)

print("\n=== Top Genome-Wide Significant SNPs ===")
print("SNP              Chr    Position       p-value          OR       MAF")
print("-" * 75)
for hit in all_hits |> take(20) {
  print(hit.snp + "    " + str(hit.chr) + "    " +
    str(hit.pos) + "    " +
    str(hit.p) + "    " +
    str(round(hit.or, 3)) + "    " +
    str(round(hit.maf, 3)))
}

Key insight: The genome-wide significance threshold of 5 x 10^-8 is extremely stringent — it corresponds to a Bonferroni correction for roughly 1 million independent tests. This means we need very large sample sizes (thousands to hundreds of thousands) to detect the small effects typical of common variant associations (OR 1.05-1.30). This is why modern GWAS consortia combine data from dozens of cohorts.

Section 6: Manhattan Plot

# --- Manhattan Plot ---
let gwas_tbl = range(0, n_pass) |> map(|i| {
  chr: qc_chr[i], pos: qc_pos[i], p: adjusted_pvalues[i]
}) |> to_table()

manhattan(gwas_tbl,
  {significance_line: CONFIG.genome_wide_p,
  suggestive_line: CONFIG.suggestive_p,
  title: "GWAS — Type 2 Diabetes",
  xlabel: "Chromosome",
  ylabel: "-log10(p-value)"})

Reading the Manhattan Plot

The Manhattan plot gets its name from its resemblance to the New York City skyline. Key features:

Feature	Interpretation
Tall peaks above significance line	Genome-wide significant loci — strong associations
Peaks above suggestive line	Suggestive associations — may reach significance with more samples
Uniform noise floor	Background of null associations — should be flat
Elevated baseline	Possible inflation from population structure or technical artifacts
Peak width	Multiple associated SNPs in linkage disequilibrium — one true signal region

Section 7: Q-Q Plot and Genomic Inflation

The Q-Q plot compares observed p-values to expected p-values under the null. Under the null hypothesis (no associations), p-values should follow a uniform distribution. On the Q-Q plot, this appears as points along the diagonal.

# --- Q-Q Plot ---
qq_plot(adjusted_pvalues,
  {title: "Q-Q Plot — Observed vs Expected p-values",
  ci: true})

# Genomic inflation factor (lambda)
# Lambda = median(chi-square statistics) / 0.456
# Lambda close to 1.0: no inflation
# Lambda > 1.1: population structure or other systematic bias
let neg_log_p = adjusted_pvalues |> map(|p| -log10(max(p, 1e-300)))
let chi2_equiv = adjusted_pvalues |> map(|p| -2 * log(max(p, 1e-300)))
let lambda = median(chi2_equiv) / 0.456

print("\n=== Genomic Inflation ===")
print("Lambda (genomic inflation factor): " + str(round(lambda, 3)))
if lambda < 1.05 {
  print("Interpretation: No meaningful inflation. Analysis is well-calibrated.")
} else if lambda < 1.10 {
  print("Interpretation: Mild inflation. Acceptable for large GWAS.")
} else {
  print("WARNING: Lambda > 1.10 suggests confounding.")
  print("Consider additional PC covariates or genomic control correction.")
}

Interpreting the Q-Q Plot

Pattern	Interpretation
Points follow diagonal, peel off at the tail	Expected: most SNPs are null, a few are truly associated
Points systematically above diagonal everywhere	Inflation: population structure, technical artifacts
Points below diagonal	Deflation: overly conservative test or data quality issue
Sharp deviation only at extreme tail	True strong associations — expected in well-powered GWAS

Common pitfall: A Q-Q plot that looks good (lambda ~ 1.0) does not guarantee the results are correct. It only means there is no systematic inflation. Individual false positives can still occur. Always validate top hits in independent cohorts.

Section 8: Effect Sizes — Odds Ratios

# --- Effect Size Analysis ---
print("\n=== Effect Sizes for Top Hits ===")

for hit in all_hits |> take(10) {
  let log_or = log(hit.or)
  let se_log_or = abs(log_or) / sqrt(-2 * log(hit.p))  # approximate
  let ci_lower = exp(log_or - 1.96 * se_log_or)
  let ci_upper = exp(log_or + 1.96 * se_log_or)

  print(hit.snp + " (chr" + str(hit.chr) + "): OR = " +
    str(round(hit.or, 3)) + " [" +
    str(round(ci_lower, 3)) + ", " +
    str(round(ci_upper, 3)) + "]" +
    " p = " + str(hit.p))
}

# Distribution of effect sizes among significant hits
let sig_ors = gw_sig |> map(|h| h.or)
if len(sig_ors) > 0 {
  print("\nEffect size distribution (genome-wide significant):")
  print("  Median OR: " + str(round(median(sig_ors), 3)))
  print("  Range: " + str(round(min(sig_ors), 3)) + " - " +
    str(round(max(sig_ors), 3)))

  histogram(sig_ors, {bins: 20,
    title: "Distribution of Odds Ratios — Significant Hits",
    xlabel: "Odds Ratio",
    ylabel: "Count"})
}

Interpreting Odds Ratios in GWAS

OR	Risk increase per allele	Typical for
1.05 - 1.10	5-10%	Most common variant associations
1.10 - 1.20	10-20%	Moderate-effect common variants
1.20 - 1.50	20-50%	Larger-effect variants (less common)
1.50 - 3.00	50-200%	Strong effects (rare in GWAS, common in candidate gene studies)
> 3.00	>200%	Very rare in GWAS; usually rare variants with large effects

Clinical relevance: Individual GWAS hits typically have small effects (OR 1.05-1.20). No single SNP is useful for predicting disease. However, combining hundreds of hits into a polygenic risk score (PRS) can identify individuals at meaningfully elevated risk. For T2D, top-decile PRS individuals have ~3-5x higher risk than bottom-decile individuals.

Section 9: Power Analysis — What Did We Miss?

# --- Power Analysis ---
print("\n=== Statistical Power ===")

# At our sample size (5000 cases + 5000 controls), what ORs can we detect?
let mafs_to_check = [0.01, 0.05, 0.10, 0.20, 0.30, 0.50]
let ors_to_check = [1.05, 1.10, 1.15, 1.20, 1.30, 1.50]

print("\nPower to detect association at p < 5e-8:")
print("(N = 5,000 cases + 5,000 controls)")
print("")
print("MAF    OR=1.05  OR=1.10  OR=1.15  OR=1.20  OR=1.30  OR=1.50")
print("-" * 65)

for maf in mafs_to_check {
  let powers = []
  for or_val in ors_to_check {
    # Approximate power calculation for GWAS
    let log_or = log(or_val)
    let n = CONFIG.n_total
    let var_logistic = 1 / (n * maf * (1 - maf) * 0.25)
    let z_alpha = 5.33  # z for p = 5e-8 (two-sided)
    let ncp = log_or / sqrt(var_logistic)
    let power = 1.0 - pnorm(z_alpha - ncp) + pnorm(-z_alpha - ncp)
    power = max(0, min(1, power))
    powers = powers + [power]
  }
  print(str(maf) + "   " +
    powers |> map(|p| str(round(p * 100, 0)) + "%") |> join("    "))
}

print("\nInterpretation:")
print("- At MAF=0.20 and OR=1.20, we have good power (~80%+)")
print("- At MAF=0.05 and OR=1.10, power is very low")
print("- Many true associations with small effects are missed")
print("- Larger sample sizes or meta-analysis would recover more hits")

Section 10: Locus Summary and Reporting

# --- Locus-Level Summary ---
print("\n=== Locus Summary ===")

# Group nearby significant SNPs into loci (within 500kb)
let loci = []
let used = []

for hit in all_hits {
  if used |> contains(hit.snp) { continue }

  let locus_snps = [hit]
  used = used + [hit.snp]

  for other in all_hits {
    if used |> contains(other.snp) { continue }
    if other.chr == hit.chr && abs(other.pos - hit.pos) < 500000 {
      locus_snps = locus_snps + [other]
      used = used + [other.snp]
    }
  }

  let lead = locus_snps |> sort_by(|s| s.p) |> first()
  loci = loci + [{
    lead_snp: lead.snp,
    chr: lead.chr,
    pos: lead.pos,
    p: lead.p,
    or: lead.or,
    n_snps: len(locus_snps)
  }]
}

print("\nIndependent loci: " + str(len(loci)))
print("\nLocus  Lead SNP         Chr  Position       p-value       OR     #SNPs")
print("-" * 75)
for i in 0..len(loci) {
  let l = loci[i]
  print(str(i + 1) + "     " + l.lead_snp + "   " + str(l.chr) + "    " +
    str(l.pos) + "    " + str(l.p) + "    " +
    str(round(l.or, 3)) + "    " + str(l.n_snps))
}

Section 11: Complete Report

# ============================================
# FINAL GWAS REPORT
# ============================================
print("\n" + "=" * 70)
print("GENOME-WIDE ASSOCIATION STUDY — FINAL REPORT")
print("Phenotype: Type 2 Diabetes")
print("=" * 70)

print("\n--- Study Design ---")
print("Cases: " + str(CONFIG.n_cases))
print("Controls: " + str(CONFIG.n_controls))
print("SNPs genotyped: " + str(CONFIG.n_snps))
print("SNPs after QC: " + str(n_pass))
print("Association model: logistic regression with PC1-4 covariates")
print("Random seed: 42")

print("\n--- Quality Control ---")
print("HWE filter (p < " + str(CONFIG.hwe_threshold) + " in controls): " +
  str(fail_hwe) + " removed")
print("MAF filter (< " + str(CONFIG.maf_threshold) + "): " +
  str(fail_maf) + " removed")
print("QC pass rate: " + str(round(n_pass / CONFIG.n_snps * 100, 1)) + "%")

print("\n--- Population Structure ---")
print("Ancestry groups: European (" +
  str(count(ancestry, |a| a == "European")) + "), East Asian (" +
  str(count(ancestry, |a| a == "East Asian")) + "), African (" +
  str(count(ancestry, |a| a == "African")) + ")")
print("PCA: PC1-PC2 separate ancestry clusters clearly")
print("Top 4 PCs included as covariates")

print("\n--- Genomic Inflation ---")
print("Lambda: " + str(round(lambda, 3)))
if lambda < 1.05 {
  print("Status: Well-calibrated (no inflation)")
} else {
  print("Status: Mild inflation detected — review QC")
}

print("\n--- Association Results ---")
print("Genome-wide significant (p < 5e-8): " + str(len(gw_sig)) + " SNPs")
print("Independent loci: " + str(len(loci)))
print("Suggestive (p < 1e-5): " + str(len(suggestive)) + " SNPs")

print("\n--- Effect Sizes ---")
if len(sig_ors) > 0 {
  print("Median OR among hits: " + str(round(median(sig_ors), 3)))
  print("OR range: " + str(round(min(sig_ors), 3)) + " - " +
    str(round(max(sig_ors), 3)))
}

print("\n--- Key Findings ---")
print("1. " + str(len(loci)) + " independent loci associated with T2D at genome-wide significance")
print("2. Effect sizes are modest (OR 1.1-1.4), consistent with polygenic architecture")
print("3. No evidence of systematic inflation (lambda = " + str(round(lambda, 3)) + ")")
print("4. Power analysis indicates additional loci would be detectable with larger samples")

print("\n--- Figures Generated ---")
print("1. PCA plot (population structure)")
print("2. Manhattan plot (genome-wide results)")
print("3. Q-Q plot (inflation assessment)")
print("4. OR distribution (effect sizes)")

print("\n" + "=" * 70)

Python:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Load PLINK output or compute associations
# assoc = pd.read_csv("gwas_results.assoc", sep="\t")

# Chi-square per SNP
for snp in genotype_matrix.columns:
    table = pd.crosstab(phenotype, genotype_matrix[snp])
    chi2, p, dof, expected = stats.chi2_contingency(table)

# Logistic regression with covariates
from sklearn.linear_model import LogisticRegression
for snp in snps:
    X = np.column_stack([genotype[:, snp], pc1, pc2, pc3, pc4])
    model = LogisticRegression().fit(X, phenotype)

# Manhattan plot
from qmplot import manhattanplot
manhattanplot(data=results, chrom="CHR", pos="BP", pv="P",
              suggestiveline=-np.log10(1e-5),
              genomewideline=-np.log10(5e-8))

# Q-Q plot
from qmplot import qqplot
qqplot(data=results["P"])

# Genomic inflation
chi2_stats = stats.chi2.isf(results['P'], df=1)
lambda_gc = np.median(chi2_stats) / 0.456

# PLINK association (most common tool for GWAS)
# system("plink --bfile mydata --assoc --adjust --out results")

# Or in R with snpStats
library(snpStats)
result <- single.snp.tests(phenotype ~ 1, snp.data = genotypes)

# Manhattan plot
library(qqman)
manhattan(results, chr="CHR", bp="BP", p="P", snp="SNP",
          suggestiveline=-log10(1e-5),
          genomewideline=-log10(5e-8))

# Q-Q plot
qq(results$P)

# Logistic regression with covariates
for (snp in colnames(geno)) {
  model <- glm(pheno ~ geno[, snp] + PC1 + PC2 + PC3 + PC4,
               family = binomial)
  p <- summary(model)$coefficients[2, 4]
}

# Genomic inflation
chisq <- qchisq(1 - results$P, df = 1)
lambda <- median(chisq) / qchisq(0.5, df = 1)

Exercises

Stricter QC. Re-run the analysis with a stricter HWE filter (p < 1e-4 instead of 1e-6) and higher MAF threshold (0.05 instead of 0.01). How many SNPs are removed? Does the number of significant hits change?

# Your code: stricter QC filters, compare hit counts

Bonferroni vs FDR. Apply BH FDR correction instead of the genome-wide significance threshold. How many more hits does FDR q < 0.05 find compared to p < 5e-8? What is the tradeoff?

# Your code: BH correction, compare to Bonferroni
let fdr_adjusted = p_adjust(adjusted_pvalues, "BH")

Ancestry-stratified analysis. Run the association analysis separately in each ancestry group. Do the same loci reach significance? What happens to power when you split the sample?

# Your code: subset by ancestry, run GWAS in each, compare

Genomic control. If lambda were 1.15 (inflated), apply genomic control correction: divide all chi-square statistics by lambda and recompute p-values. How does this change the Manhattan plot?

# Your code: inflate p-values, apply GC correction, re-plot

Power for future study. Your consortium plans a Phase 2 GWAS with 50,000 cases and 50,000 controls. At MAF = 0.10, what is the minimum detectable OR at 80% power? How many more loci would you expect to find?

# Your code: power calculation for larger sample

Key Takeaways

A GWAS tests hundreds of thousands of SNPs for association with a phenotype, requiring rigorous QC (HWE, MAF, call rate), population structure control (PCA), and extreme multiple testing correction (p < 5 x 10^-8).
Hardy-Weinberg equilibrium testing in controls identifies genotyping errors; MAF filtering removes uninformative rare variants; both are standard QC steps.
Population structure is the primary confounder in GWAS. PCA on genotype data reveals ancestry, and including top PCs as covariates in logistic regression removes confounding.
The genome-wide significance threshold (5 x 10^-8) is a Bonferroni correction for approximately 1 million independent tests, accounting for linkage disequilibrium.
The Manhattan plot displays -log10(p-values) by genomic position; the Q-Q plot assesses whether the test statistics are well-calibrated (lambda near 1.0).
Most GWAS hits have modest effect sizes (OR 1.05-1.30), reflecting the polygenic architecture of complex traits. Individual variants are not clinically useful predictors, but combined into polygenic risk scores, they have growing clinical applications.
Power in GWAS depends on sample size, allele frequency, and effect size. Large international consortia (>100,000 individuals) are needed to detect the small effects typical of common diseases.
This capstone integrates concepts from probability (Day 4), hypothesis testing (Day 7), chi-square tests (Day 11), multiple testing (Day 12), logistic regression (Day 16), power analysis (Day 18), effect sizes (Day 19), PCA (Day 21), and visualization (Day 25).

Congratulations

You have completed “Practical Biostatistics in 30 Days.” Over the past four weeks, you have journeyed from basic descriptive statistics to a genome-wide association study analyzing half a million genetic variants. Along the way, you have learned:

Foundations: Distributions, probability, sampling, and why n matters
Core methods: Confidence intervals, hypothesis testing, t-tests, non-parametric alternatives, ANOVA, chi-square tests
Multiple testing: The FDR crisis and how to control it
Modeling: Correlation, linear regression, multiple regression, logistic regression, survival analysis
Design: Experimental design, statistical power, effect sizes, batch effects
Advanced methods: PCA, clustering, resampling, Bayesian inference, meta-analysis
Practice: Reproducible analysis, clinical trials, differential expression, GWAS

Every method in this book is a tool. Like any tool, it can be used well or poorly. The difference is understanding — knowing not just how to run a test, but when to run it, what its assumptions are, and how to interpret its results in the context of your biological question.

Statistics is not the final step of an experiment. It is the lens through which every experiment should be designed, conducted, and interpreted. The best statisticians are not those who know the most tests — they are those who ask the right questions.

Go forth and analyze. Be rigorous. Be honest. Be curious. And always, always check your assumptions.

Keyboard shortcuts

Practical Biostatistics in 30 Days