Day 15: Multiple Regression and Model Selection

The Problem

Dr. Maria Chen is a clinical researcher studying pancreatic cancer. She has tumor samples from 120 patients, each profiled with 10 biomarkers: CA19-9, CEA, MKI67, TP53 status, tumor size, age, albumin, CRP, neutrophil-lymphocyte ratio (NLR), and platelet count. She wants to predict tumor stage (a continuous composite score from 1.0 to 4.0) from these biomarkers.

But there’s a problem. CA19-9 and CEA are highly correlated (r = 0.88) — they measure overlapping biology. Including both inflates standard errors and makes coefficients uninterpretable. And with 10 potential predictors, how does she find the best subset without overfitting?

She needs multiple regression with careful model selection.

What Is Multiple Regression?

Multiple regression extends simple regression to multiple predictors:

$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon$$

Each coefficient βⱼ represents the effect of Xⱼ holding all other predictors constant. This is fundamentally different from running p separate simple regressions.

Simple Regression	Multiple Regression
β₁ = total effect of X₁ on Y	β₁ = effect of X₁ after accounting for X₂…Xₚ
One predictor at a time	All predictors simultaneously
Can’t separate correlated effects	Separates correlated effects (when possible)
May show spurious associations	Controls for confounders

Key insight: In simple regression, tumor size might predict stage with β = 0.8. In multiple regression controlling for CA19-9, the tumor size coefficient might drop to β = 0.3 — because CA19-9 captures much of the same information.

Multicollinearity: When Predictors Are Too Similar

Multicollinearity occurs when predictors are highly correlated with each other. It doesn’t bias predictions, but it wreaks havoc on coefficient interpretation:

Effect	Consequence
Inflated standard errors	Coefficients appear non-significant when they are
Unstable coefficients	Small data changes cause wild coefficient swings
Sign flipping	A predictor with a positive true effect can get a negative coefficient
Uninterpretable	“Effect of X₁ holding X₂ constant” is meaningless if X₁ and X₂ always move together

Detecting Multicollinearity: VIF

The Variance Inflation Factor quantifies how much each predictor is explained by the others:

$$VIF_j = \frac{1}{1 - R_j^2}$$

where R²ⱼ is the R² from regressing Xⱼ on all other predictors.

VIF	Interpretation	Action
1	No collinearity	Good
1-5	Moderate	Usually acceptable
5-10	High	Investigate
> 10	Severe	Remove or combine predictors

Common pitfall: VIF > 10 doesn’t always mean “drop the variable.” In genomics, biologically meaningful predictors may be correlated. Consider combining them (e.g., principal component) or using regularization instead.

Model Selection: Finding the Best Model

With p predictors, there are 2ᵖ possible models. For p = 10, that’s 1024 models. We need principled ways to choose.

Information Criteria

Criterion	Formula	Preference
AIC (Akaike)	-2·ln(L) + 2k	Lower = better; balances fit and complexity
BIC (Bayesian)	-2·ln(L) + k·ln(n)	Lower = better; penalizes complexity more than AIC
Adjusted R²	1 - (1-R²)·(n-1)/(n-k-1)	Higher = better; penalizes added predictors

Where L = likelihood, k = number of parameters, n = sample size.

AIC tends to select slightly larger models (better prediction). BIC tends to select smaller models (better interpretation). When they disagree, consider your goal.

Stepwise Regression

Automated search through predictor combinations:

Direction	Strategy	Risk
Forward	Start empty, add best predictor one at a time	May miss suppressor effects
Backward	Start full, remove worst predictor one at a time	May keep redundant predictors
Both	Add and remove at each step	Best coverage, slower

Common pitfall: Stepwise regression is exploratory, not confirmatory. The selected model may not replicate. Always validate on held-out data or use cross-validation.

Regularized Regression: Handling Many Predictors

When you have many predictors (especially in genomics where p >> n), ordinary least squares fails. Regularization adds a penalty term:

Ridge Regression (L2)

$$\min \sum(y_i - \hat{y}_i)^2 + \lambda \sum \beta_j^2$$

Shrinks coefficients toward zero but never to exactly zero
Handles multicollinearity gracefully
Good when all predictors might be relevant

Lasso Regression (L1)

$$\min \sum(y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j|$$

Can shrink coefficients to exactly zero (automatic variable selection)
Produces sparse models (easy to interpret)
Preferred when you suspect only a few predictors matter

Elastic Net

$$\min \sum(y_i - \hat{y}_i)^2 + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2$$

Combines L1 and L2 penalties
Good when predictors are correlated groups (keeps all or drops all from a group)
Controlled by mixing parameter α (0 = ridge, 1 = lasso)

Method	Feature Selection	Correlated Predictors	Best For
Ridge	No (shrinks all)	Handles well	Many weak effects
Lasso	Yes (zeros out)	Picks one arbitrarily	Few strong effects
Elastic Net	Yes (grouped)	Handles well	Correlated groups

Polynomial Regression

When the relationship is curved but you want to stay in the regression framework:

$$Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon$$

Useful for dose-response curves, growth curves, and non-linear biomarker relationships. But beware overfitting with high-degree polynomials.

Clinical relevance: The relationship between BMI and mortality is U-shaped — both low and high BMI increase risk. A linear model misses this entirely; a quadratic model captures it.

Multiple Regression in BioLang

Building a Multiple Regression Model

set_seed(42)
# Pancreatic cancer: predict tumor stage from biomarkers
let n = 120

# Simulate correlated biomarkers
let age = rnorm(n, 65, 10)
let tumor_size = rnorm(n, 3.5, 1.2)
let ca19_9 = tumor_size * 50 + rnorm(n, 100, 40)
let cea = ca19_9 * 0.3 + rnorm(n, 5, 3)  # correlated with CA19-9
let mki67 = rnorm(n, 30, 15)
let albumin = rnorm(n, 3.5, 0.5)
let crp = rnorm(n, 15, 10)
let nlr = rnorm(n, 4, 2)

# True model: stage depends on tumor_size, ca19_9, mki67, age
let stage = 1.0 + 0.4 * tumor_size + 0.003 * ca19_9 + 0.01 * mki67
    + 0.01 * age - 0.3 * albumin
    + rnorm(n, 0, 0.3)

# Fit multiple regression with all predictors
let data = table({
    "stage": stage, "age": age, "tumor_size": tumor_size,
    "ca19_9": ca19_9, "cea": cea, "mki67": mki67,
    "albumin": albumin, "crp": crp, "nlr": nlr
})
let model_full = lm("stage ~ age + tumor_size + ca19_9 + cea + mki67 + albumin + crp + nlr", data)

print("=== Full Model ===")
print("R²: {model_full.r_squared |> round(3)}")
print("Adjusted R²: {model_full.adj_r_squared |> round(3)}")

Checking Multicollinearity

# Check multicollinearity via pairwise correlations
let pred_names = ["age", "tumor_size", "ca19_9", "cea", "mki67",
                  "albumin", "crp", "nlr"]
let predictors = [age, tumor_size, ca19_9, cea, mki67, albumin, crp, nlr]

print("=== Pairwise Correlations (VIF proxy) ===")
for i in 0..8 {
    for j in (i+1)..8 {
        let r = cor(predictors[i], predictors[j])
        if abs(r) > 0.7 {
            print("  {pred_names[i]} vs {pred_names[j]}: r = {r |> round(3)} *** HIGH")
        }
    }
}

# CA19-9 and CEA likely show high correlation

Stepwise Model Selection

# Manual model comparison: fit reduced models and compare R²
# Drop CEA (collinear with CA19-9) and noise variables (CRP, NLR)
let data_reduced = table({
    "stage": stage, "age": age, "tumor_size": tumor_size,
    "ca19_9": ca19_9, "mki67": mki67, "albumin": albumin
})
let model_reduced = lm("stage ~ age + tumor_size + ca19_9 + mki67 + albumin", data_reduced)

print("=== Model Comparison ===")
print("Full model R²:    {model_full.r_squared |> round(3)}")
print("Full model Adj R²:    {model_full.adj_r_squared |> round(3)}")
print("Reduced model R²: {model_reduced.r_squared |> round(3)}")
print("Reduced model Adj R²: {model_reduced.adj_r_squared |> round(3)}")
print("If Adj R² is similar, the simpler model is preferred")

Regularized Regression

# Regularized regression concepts
# Note: Ridge, Lasso, and Elastic Net are advanced methods
# typically used when p >> n (many predictors, few samples).
# BioLang provides lm() for standard regression; for regularization,
# use Python (scikit-learn) or R (glmnet) as shown below.

# Demonstrate the concept: compare full vs sparse models
# A "lasso-like" approach: fit models dropping one predictor at a time
# and see which predictors contribute the least
let predictors_list = ["age", "tumor_size", "ca19_9", "cea", "mki67", "albumin", "crp", "nlr"]

print("=== Variable Importance (drop-one analysis) ===")
let full_r2 = model_full.r_squared
print("Full model R²: {full_r2 |> round(4)}")

# Compare by dropping noise predictors
let model_no_crp = lm("stage ~ age + tumor_size + ca19_9 + cea + mki67 + albumin + nlr", data)
let model_no_nlr = lm("stage ~ age + tumor_size + ca19_9 + cea + mki67 + albumin + crp", data)
print("Without CRP: R² = {model_no_crp.r_squared |> round(4)}")
print("Without NLR: R² = {model_no_nlr.r_squared |> round(4)}")
print("Predictors with minimal R² drop are candidates for removal")

Polynomial Regression

set_seed(42)
# Non-linear biomarker relationship
let bmi = rnorm(100, 29, 5)
let risk = 0.5 + 0.1 * (bmi - 25) ** 2 + rnorm(100, 0, 2)

# Linear fit — misses the U-shape
let model_linear = lm(risk, bmi)
print("Linear R²: {model_linear.r_squared |> round(3)}")

# Polynomial fit — add bmi² term
let bmi_sq = bmi |> map(|x| x ** 2)
let poly_data = table({"risk": risk, "bmi": bmi, "bmi_sq": bmi_sq})
let model_poly = lm("risk ~ bmi + bmi_sq", poly_data)
print("Quadratic R²: {model_poly.r_squared |> round(3)}")

# Visualize the improvement
let plot_data = table({"BMI": bmi, "Risk": risk})
plot(plot_data, {type: "scatter", x: "BMI", y: "Risk",
    title: "BMI vs Risk: Linear vs Quadratic Fit"})

Predicted vs. Actual Plot

# The ultimate model validation plot
# Compute predicted values from the reduced model
let predicted_stage = model_reduced.fitted

let pred_actual = table({"Actual": stage, "Predicted": predicted_stage})
plot(pred_actual, {type: "scatter", x: "Actual", y: "Predicted",
    title: "Predicted vs Actual (Reduced Model)"})

# Points along the diagonal = good predictions
# Systematic deviations = model problems
let r_pred = cor(stage, predicted_stage)
print("Correlation (actual vs predicted): {r_pred |> round(3)}")

Python:

import statsmodels.api as sm
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Multiple regression
X = sm.add_constant(df[['age', 'tumor_size', 'ca19_9', 'cea', 'mki67']])
model = sm.OLS(stage, X).fit()
print(model.summary())

# VIF
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Stepwise (manual — no built-in in statsmodels)
# Use mlxtend: from mlxtend.feature_selection import SequentialFeatureSelector

# Lasso
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5).fit(X_scaled, y)
print(f"Non-zero: {(lasso.coef_ != 0).sum()}")

# Multiple regression
model <- lm(stage ~ age + tumor_size + ca19_9 + cea + mki67 +
            albumin + crp + nlr)
summary(model)

# VIF
library(car)
vif(model)

# Stepwise
step_model <- step(model, direction = "both")

# Lasso
library(glmnet)
cv_lasso <- cv.glmnet(X_matrix, stage, alpha = 1)
coef(cv_lasso, s = "lambda.min")

# Ridge
cv_ridge <- cv.glmnet(X_matrix, stage, alpha = 0)

# Elastic net
cv_enet <- cv.glmnet(X_matrix, stage, alpha = 0.5)

Exercises

Exercise 1: Build and Compare Models

Given 150 samples with 6 predictors, build a full model, then use stepwise selection to find a reduced model. Compare using AIC, BIC, and adjusted R².

set_seed(42)
let n = 150
let x1 = rnorm(n, 10, 3)
let x2 = rnorm(n, 5, 2)
let x3 = x1 * 0.5 + rnorm(n, 0, 1)  # correlated with x1
let x4 = rnorm(n, 20, 5)
let x5 = rnorm(n, 0, 1)  # noise
let x6 = rnorm(n, 0, 1)  # noise

let y = 5 + 2 * x1 + 1.5 * x2 + 0.5 * x4 + rnorm(n, 0, 3)

# 1. Fit full model with all 6 predictors using lm()
# 2. Check pairwise cor() — which predictors are collinear?
# 3. Compare full vs reduced models by adj_r_squared
# 4. Which predictors matter? Do they match the true model?

Exercise 2: Ridge vs. Lasso

Compare ridge and lasso on a dataset where only 3 of 8 predictors truly matter. Which method correctly identifies the true predictors?

set_seed(42)
let n = 100

# 8 predictors, only 3 are real
# Fit lm() with all 8, then with only the true 3
# Compare R² — does the reduced model perform similarly?
# Note: for true ridge/lasso, use Python (scikit-learn) or R (glmnet)

Exercise 3: Polynomial vs. Linear

Fit linear, quadratic, and cubic models to a dose-response curve. Use AIC to select the best model, and show that adding unnecessary complexity (cubic) hurts.

set_seed(42)
let dose = rnorm(80, 25, 12) |> map(|d| max(d, 1))
let response = 10 + 5 * log2(dose) + rnorm(80, 0, 3)

# 1. Fit lm() with dose, dose + dose², dose + dose² + dose³
# 2. Compare R² and adj_r_squared for each
# 3. Which degree gives the best balance of fit and simplicity?

Exercise 4: Predicted vs. Actual

Build a multiple regression model for gene expression prediction. Create a predicted vs. actual scatter plot. Assess where the model succeeds and fails.

set_seed(42)
let n = 200

# Simulate: predict gene expression from 4 TF binding signals
# Build model using lm(), generate predicted vs actual plot
# Calculate mean absolute error
# Identify the 5 worst predictions — what makes them outliers?

Key Takeaways

Multiple regression estimates the effect of each predictor controlling for all others — fundamentally different from separate simple regressions
Multicollinearity (VIF > 10) inflates standard errors and makes coefficients uninterpretable — detect it before interpreting
Model selection balances fit and complexity: AIC favors prediction, BIC favors parsimony, adjusted R² penalizes added terms
Stepwise regression is useful for exploration but should be validated on held-out data
Lasso performs automatic variable selection (zeros out irrelevant predictors); Ridge shrinks all coefficients but keeps them; Elastic net combines both
Polynomial regression captures non-linear relationships within the regression framework but risks overfitting
Always check residuals, predicted vs. actual plots, and VIF before trusting your model
With genomics data (p >> n), regularization is not optional — it’s essential

What’s Next

What if your outcome isn’t continuous but binary — responder vs. non-responder, alive vs. dead, mutant vs. wild-type? Day 16 introduces logistic regression, where we predict probabilities of categorical outcomes using ROC curves, odds ratios, and the powerful GLM framework.

Keyboard shortcuts

Practical Biostatistics in 30 Days