Day 15: Multiple Regression and Model Selection
The Problem
Dr. Maria Chen is a clinical researcher studying pancreatic cancer. She has tumor samples from 120 patients, each profiled with 10 biomarkers: CA19-9, CEA, MKI67, TP53 status, tumor size, age, albumin, CRP, neutrophil-lymphocyte ratio (NLR), and platelet count. She wants to predict tumor stage (a continuous composite score from 1.0 to 4.0) from these biomarkers.
But there’s a problem. CA19-9 and CEA are highly correlated (r = 0.88) — they measure overlapping biology. Including both inflates standard errors and makes coefficients uninterpretable. And with 10 potential predictors, how does she find the best subset without overfitting?
She needs multiple regression with careful model selection.
What Is Multiple Regression?
Multiple regression extends simple regression to multiple predictors:
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon$$
Each coefficient βⱼ represents the effect of Xⱼ holding all other predictors constant. This is fundamentally different from running p separate simple regressions.
| Simple Regression | Multiple Regression |
|---|---|
| β₁ = total effect of X₁ on Y | β₁ = effect of X₁ after accounting for X₂…Xₚ |
| One predictor at a time | All predictors simultaneously |
| Can’t separate correlated effects | Separates correlated effects (when possible) |
| May show spurious associations | Controls for confounders |
Key insight: In simple regression, tumor size might predict stage with β = 0.8. In multiple regression controlling for CA19-9, the tumor size coefficient might drop to β = 0.3 — because CA19-9 captures much of the same information.
Multicollinearity: When Predictors Are Too Similar
Multicollinearity occurs when predictors are highly correlated with each other. It doesn’t bias predictions, but it wreaks havoc on coefficient interpretation:
| Effect | Consequence |
|---|---|
| Inflated standard errors | Coefficients appear non-significant when they are |
| Unstable coefficients | Small data changes cause wild coefficient swings |
| Sign flipping | A predictor with a positive true effect can get a negative coefficient |
| Uninterpretable | “Effect of X₁ holding X₂ constant” is meaningless if X₁ and X₂ always move together |
Detecting Multicollinearity: VIF
The Variance Inflation Factor quantifies how much each predictor is explained by the others:
$$VIF_j = \frac{1}{1 - R_j^2}$$
where R²ⱼ is the R² from regressing Xⱼ on all other predictors.
| VIF | Interpretation | Action |
|---|---|---|
| 1 | No collinearity | Good |
| 1-5 | Moderate | Usually acceptable |
| 5-10 | High | Investigate |
| > 10 | Severe | Remove or combine predictors |
Common pitfall: VIF > 10 doesn’t always mean “drop the variable.” In genomics, biologically meaningful predictors may be correlated. Consider combining them (e.g., principal component) or using regularization instead.
Model Selection: Finding the Best Model
With p predictors, there are 2ᵖ possible models. For p = 10, that’s 1024 models. We need principled ways to choose.
Information Criteria
| Criterion | Formula | Preference |
|---|---|---|
| AIC (Akaike) | -2·ln(L) + 2k | Lower = better; balances fit and complexity |
| BIC (Bayesian) | -2·ln(L) + k·ln(n) | Lower = better; penalizes complexity more than AIC |
| Adjusted R² | 1 - (1-R²)·(n-1)/(n-k-1) | Higher = better; penalizes added predictors |
Where L = likelihood, k = number of parameters, n = sample size.
AIC tends to select slightly larger models (better prediction). BIC tends to select smaller models (better interpretation). When they disagree, consider your goal.
Stepwise Regression
Automated search through predictor combinations:
| Direction | Strategy | Risk |
|---|---|---|
| Forward | Start empty, add best predictor one at a time | May miss suppressor effects |
| Backward | Start full, remove worst predictor one at a time | May keep redundant predictors |
| Both | Add and remove at each step | Best coverage, slower |
Common pitfall: Stepwise regression is exploratory, not confirmatory. The selected model may not replicate. Always validate on held-out data or use cross-validation.
Regularized Regression: Handling Many Predictors
When you have many predictors (especially in genomics where p >> n), ordinary least squares fails. Regularization adds a penalty term:
Ridge Regression (L2)
$$\min \sum(y_i - \hat{y}_i)^2 + \lambda \sum \beta_j^2$$
- Shrinks coefficients toward zero but never to exactly zero
- Handles multicollinearity gracefully
- Good when all predictors might be relevant
Lasso Regression (L1)
$$\min \sum(y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j|$$
- Can shrink coefficients to exactly zero (automatic variable selection)
- Produces sparse models (easy to interpret)
- Preferred when you suspect only a few predictors matter
Elastic Net
$$\min \sum(y_i - \hat{y}_i)^2 + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2$$
- Combines L1 and L2 penalties
- Good when predictors are correlated groups (keeps all or drops all from a group)
- Controlled by mixing parameter α (0 = ridge, 1 = lasso)
| Method | Feature Selection | Correlated Predictors | Best For |
|---|---|---|---|
| Ridge | No (shrinks all) | Handles well | Many weak effects |
| Lasso | Yes (zeros out) | Picks one arbitrarily | Few strong effects |
| Elastic Net | Yes (grouped) | Handles well | Correlated groups |
Polynomial Regression
When the relationship is curved but you want to stay in the regression framework:
$$Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon$$
Useful for dose-response curves, growth curves, and non-linear biomarker relationships. But beware overfitting with high-degree polynomials.
Clinical relevance: The relationship between BMI and mortality is U-shaped — both low and high BMI increase risk. A linear model misses this entirely; a quadratic model captures it.
Multiple Regression in BioLang
Building a Multiple Regression Model
set_seed(42)
# Pancreatic cancer: predict tumor stage from biomarkers
let n = 120
# Simulate correlated biomarkers
let age = rnorm(n, 65, 10)
let tumor_size = rnorm(n, 3.5, 1.2)
let ca19_9 = tumor_size * 50 + rnorm(n, 100, 40)
let cea = ca19_9 * 0.3 + rnorm(n, 5, 3) # correlated with CA19-9
let mki67 = rnorm(n, 30, 15)
let albumin = rnorm(n, 3.5, 0.5)
let crp = rnorm(n, 15, 10)
let nlr = rnorm(n, 4, 2)
# True model: stage depends on tumor_size, ca19_9, mki67, age
let stage = 1.0 + 0.4 * tumor_size + 0.003 * ca19_9 + 0.01 * mki67
+ 0.01 * age - 0.3 * albumin
+ rnorm(n, 0, 0.3)
# Fit multiple regression with all predictors
let data = table({
"stage": stage, "age": age, "tumor_size": tumor_size,
"ca19_9": ca19_9, "cea": cea, "mki67": mki67,
"albumin": albumin, "crp": crp, "nlr": nlr
})
let model_full = lm("stage ~ age + tumor_size + ca19_9 + cea + mki67 + albumin + crp + nlr", data)
print("=== Full Model ===")
print("R²: {model_full.r_squared |> round(3)}")
print("Adjusted R²: {model_full.adj_r_squared |> round(3)}")
Checking Multicollinearity
# Check multicollinearity via pairwise correlations
let pred_names = ["age", "tumor_size", "ca19_9", "cea", "mki67",
"albumin", "crp", "nlr"]
let predictors = [age, tumor_size, ca19_9, cea, mki67, albumin, crp, nlr]
print("=== Pairwise Correlations (VIF proxy) ===")
for i in 0..8 {
for j in (i+1)..8 {
let r = cor(predictors[i], predictors[j])
if abs(r) > 0.7 {
print(" {pred_names[i]} vs {pred_names[j]}: r = {r |> round(3)} *** HIGH")
}
}
}
# CA19-9 and CEA likely show high correlation
Stepwise Model Selection
# Manual model comparison: fit reduced models and compare R²
# Drop CEA (collinear with CA19-9) and noise variables (CRP, NLR)
let data_reduced = table({
"stage": stage, "age": age, "tumor_size": tumor_size,
"ca19_9": ca19_9, "mki67": mki67, "albumin": albumin
})
let model_reduced = lm("stage ~ age + tumor_size + ca19_9 + mki67 + albumin", data_reduced)
print("=== Model Comparison ===")
print("Full model R²: {model_full.r_squared |> round(3)}")
print("Full model Adj R²: {model_full.adj_r_squared |> round(3)}")
print("Reduced model R²: {model_reduced.r_squared |> round(3)}")
print("Reduced model Adj R²: {model_reduced.adj_r_squared |> round(3)}")
print("If Adj R² is similar, the simpler model is preferred")
Regularized Regression
# Regularized regression concepts
# Note: Ridge, Lasso, and Elastic Net are advanced methods
# typically used when p >> n (many predictors, few samples).
# BioLang provides lm() for standard regression; for regularization,
# use Python (scikit-learn) or R (glmnet) as shown below.
# Demonstrate the concept: compare full vs sparse models
# A "lasso-like" approach: fit models dropping one predictor at a time
# and see which predictors contribute the least
let predictors_list = ["age", "tumor_size", "ca19_9", "cea", "mki67", "albumin", "crp", "nlr"]
print("=== Variable Importance (drop-one analysis) ===")
let full_r2 = model_full.r_squared
print("Full model R²: {full_r2 |> round(4)}")
# Compare by dropping noise predictors
let model_no_crp = lm("stage ~ age + tumor_size + ca19_9 + cea + mki67 + albumin + nlr", data)
let model_no_nlr = lm("stage ~ age + tumor_size + ca19_9 + cea + mki67 + albumin + crp", data)
print("Without CRP: R² = {model_no_crp.r_squared |> round(4)}")
print("Without NLR: R² = {model_no_nlr.r_squared |> round(4)}")
print("Predictors with minimal R² drop are candidates for removal")
Polynomial Regression
set_seed(42)
# Non-linear biomarker relationship
let bmi = rnorm(100, 29, 5)
let risk = 0.5 + 0.1 * (bmi - 25) ** 2 + rnorm(100, 0, 2)
# Linear fit — misses the U-shape
let model_linear = lm(risk, bmi)
print("Linear R²: {model_linear.r_squared |> round(3)}")
# Polynomial fit — add bmi² term
let bmi_sq = bmi |> map(|x| x ** 2)
let poly_data = table({"risk": risk, "bmi": bmi, "bmi_sq": bmi_sq})
let model_poly = lm("risk ~ bmi + bmi_sq", poly_data)
print("Quadratic R²: {model_poly.r_squared |> round(3)}")
# Visualize the improvement
let plot_data = table({"BMI": bmi, "Risk": risk})
plot(plot_data, {type: "scatter", x: "BMI", y: "Risk",
title: "BMI vs Risk: Linear vs Quadratic Fit"})
Predicted vs. Actual Plot
# The ultimate model validation plot
# Compute predicted values from the reduced model
let predicted_stage = model_reduced.fitted
let pred_actual = table({"Actual": stage, "Predicted": predicted_stage})
plot(pred_actual, {type: "scatter", x: "Actual", y: "Predicted",
title: "Predicted vs Actual (Reduced Model)"})
# Points along the diagonal = good predictions
# Systematic deviations = model problems
let r_pred = cor(stage, predicted_stage)
print("Correlation (actual vs predicted): {r_pred |> round(3)}")
Python:
import statsmodels.api as sm
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Multiple regression
X = sm.add_constant(df[['age', 'tumor_size', 'ca19_9', 'cea', 'mki67']])
model = sm.OLS(stage, X).fit()
print(model.summary())
# VIF
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Stepwise (manual — no built-in in statsmodels)
# Use mlxtend: from mlxtend.feature_selection import SequentialFeatureSelector
# Lasso
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5).fit(X_scaled, y)
print(f"Non-zero: {(lasso.coef_ != 0).sum()}")
R:
# Multiple regression
model <- lm(stage ~ age + tumor_size + ca19_9 + cea + mki67 +
albumin + crp + nlr)
summary(model)
# VIF
library(car)
vif(model)
# Stepwise
step_model <- step(model, direction = "both")
# Lasso
library(glmnet)
cv_lasso <- cv.glmnet(X_matrix, stage, alpha = 1)
coef(cv_lasso, s = "lambda.min")
# Ridge
cv_ridge <- cv.glmnet(X_matrix, stage, alpha = 0)
# Elastic net
cv_enet <- cv.glmnet(X_matrix, stage, alpha = 0.5)
Exercises
Exercise 1: Build and Compare Models
Given 150 samples with 6 predictors, build a full model, then use stepwise selection to find a reduced model. Compare using AIC, BIC, and adjusted R².
set_seed(42)
let n = 150
let x1 = rnorm(n, 10, 3)
let x2 = rnorm(n, 5, 2)
let x3 = x1 * 0.5 + rnorm(n, 0, 1) # correlated with x1
let x4 = rnorm(n, 20, 5)
let x5 = rnorm(n, 0, 1) # noise
let x6 = rnorm(n, 0, 1) # noise
let y = 5 + 2 * x1 + 1.5 * x2 + 0.5 * x4 + rnorm(n, 0, 3)
# 1. Fit full model with all 6 predictors using lm()
# 2. Check pairwise cor() — which predictors are collinear?
# 3. Compare full vs reduced models by adj_r_squared
# 4. Which predictors matter? Do they match the true model?
Exercise 2: Ridge vs. Lasso
Compare ridge and lasso on a dataset where only 3 of 8 predictors truly matter. Which method correctly identifies the true predictors?
set_seed(42)
let n = 100
# 8 predictors, only 3 are real
# Fit lm() with all 8, then with only the true 3
# Compare R² — does the reduced model perform similarly?
# Note: for true ridge/lasso, use Python (scikit-learn) or R (glmnet)
Exercise 3: Polynomial vs. Linear
Fit linear, quadratic, and cubic models to a dose-response curve. Use AIC to select the best model, and show that adding unnecessary complexity (cubic) hurts.
set_seed(42)
let dose = rnorm(80, 25, 12) |> map(|d| max(d, 1))
let response = 10 + 5 * log2(dose) + rnorm(80, 0, 3)
# 1. Fit lm() with dose, dose + dose², dose + dose² + dose³
# 2. Compare R² and adj_r_squared for each
# 3. Which degree gives the best balance of fit and simplicity?
Exercise 4: Predicted vs. Actual
Build a multiple regression model for gene expression prediction. Create a predicted vs. actual scatter plot. Assess where the model succeeds and fails.
set_seed(42)
let n = 200
# Simulate: predict gene expression from 4 TF binding signals
# Build model using lm(), generate predicted vs actual plot
# Calculate mean absolute error
# Identify the 5 worst predictions — what makes them outliers?
Key Takeaways
- Multiple regression estimates the effect of each predictor controlling for all others — fundamentally different from separate simple regressions
- Multicollinearity (VIF > 10) inflates standard errors and makes coefficients uninterpretable — detect it before interpreting
- Model selection balances fit and complexity: AIC favors prediction, BIC favors parsimony, adjusted R² penalizes added terms
- Stepwise regression is useful for exploration but should be validated on held-out data
- Lasso performs automatic variable selection (zeros out irrelevant predictors); Ridge shrinks all coefficients but keeps them; Elastic net combines both
- Polynomial regression captures non-linear relationships within the regression framework but risks overfitting
- Always check residuals, predicted vs. actual plots, and VIF before trusting your model
- With genomics data (p >> n), regularization is not optional — it’s essential
What’s Next
What if your outcome isn’t continuous but binary — responder vs. non-responder, alive vs. dead, mutant vs. wild-type? Day 16 introduces logistic regression, where we predict probabilities of categorical outcomes using ROC curves, odds ratios, and the powerful GLM framework.