Day 14: Linear Regression — Prediction from Data
The Problem
Dr. James Park is a pharmacogenomicist working with the NCI-60 cell line panel — 60 cancer cell lines spanning 9 tissue types. He has gene expression data and drug sensitivity measurements (IC50 values) for each line. His question: Can we predict how sensitive a cell line will be to a new kinase inhibitor based on its expression of the drug’s target gene?
He knows the target gene and IC50 seem correlated (Day 13 confirmed r = -0.72). But correlation just says “they move together.” James needs to go further: given a new cell line with a known expression level, what IC50 should he predict? And how confident should he be?
This is the leap from association to prediction — the domain of linear regression.
What Is Linear Regression?
Linear regression fits a straight line through data points to model the relationship between a predictor (X) and a response (Y):
$$Y = \beta_0 + \beta_1 X + \varepsilon$$
| Term | Meaning | Biological Example |
|---|---|---|
| Y | Response variable | Drug IC50 |
| X | Predictor variable | Target gene expression |
| β₀ | Intercept (Y when X = 0) | Baseline sensitivity |
| β₁ | Slope (change in Y per unit X) | Sensitivity per expression unit |
| ε | Error (noise) | Biological + technical variation |
The method of least squares finds the β₀ and β₁ that minimize the sum of squared residuals:
$$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$$
Key insight: Regression has a clear asymmetry that correlation lacks. X predicts Y — there is a designated predictor and a designated outcome. The regression of Y on X is NOT the same as X on Y.
From Correlation to Regression
Correlation and simple regression are intimately linked:
| Metric | What It Tells You |
|---|---|
| Pearson r | Strength and direction of linear association |
| R² = r² | Proportion of variance in Y explained by X |
| β₁ (slope) | How much Y changes per unit change in X |
| p-value of β₁ | Whether the slope is significantly different from zero |
If r = -0.72, then R² = 0.52, meaning 52% of the variation in IC50 is explained by target expression. The remaining 48% is unexplained — due to other genes, pathway redundancy, or noise.
Interpreting Regression Output
A regression produces a table of coefficients:
| Term | Estimate | Std. Error | t-value | p-value |
|---|---|---|---|---|
| Intercept (β₀) | 85.3 | 6.2 | 13.8 | < 0.001 |
| Expression (β₁) | -4.7 | 0.8 | -5.9 | < 0.001 |
Reading this:
- Intercept: When expression = 0, predicted IC50 is 85.3 μM (often not biologically meaningful)
- Slope: Each 1-unit increase in expression decreases IC50 by 4.7 μM (more expression → more sensitive)
- p-value: The slope is significantly different from zero — expression truly predicts sensitivity
- R²: How well the line fits overall
Common pitfall: The intercept often represents an extrapolation beyond the data range. If expression ranges from 5-15, interpreting “IC50 when expression = 0” is meaningless. Focus on the slope.
Prediction: Point Estimates and Intervals
Once we have a fitted model, we can predict Y for new X values. But predictions come with two types of uncertainty:
Confidence Interval (for the mean response): “We’re 95% confident the average IC50 for all cell lines with expression = 10 lies in this range.”
Prediction Interval (for a single new observation): “We’re 95% confident the IC50 for one specific new cell line with expression = 10 lies in this range.”
Prediction intervals are always wider than confidence intervals because they include individual-level noise (ε).
Clinical relevance: In precision medicine, you’re usually predicting for one patient (prediction interval), not the population average (confidence interval). The prediction interval honestly reflects your uncertainty.
Residual Analysis: Checking Your Model
A regression model makes assumptions. Residuals (observed - predicted) reveal violations:
| Plot | What to Check | Warning Sign |
|---|---|---|
| Residuals vs. Fitted | Constant variance (homoscedasticity) | Fan/funnel shape |
| Q-Q plot of residuals | Normality of errors | Curved pattern |
| Residuals vs. order | Independence | Systematic trend |
| Scale-Location | Variance trend | Upward slope |
The four assumptions of linear regression:
- Linearity: Y is a linear function of X (check scatter plot)
- Independence: Observations are independent (study design)
- Normality: Residuals are normally distributed (Q-Q plot)
- Homoscedasticity: Constant error variance (residuals vs. fitted)
Common pitfall: Many biological relationships are not linear on the original scale but become linear after log-transformation. Always consider transforming IC50 or expression values before fitting.
When Regression Goes Wrong
Extrapolation
Predicting beyond the range of your data is dangerous. If expression ranges from 5-15 in your training data, predicting IC50 for expression = 25 assumes the linear trend continues — it may not.
Confounding
A significant regression doesn’t prove causation. The apparent effect of expression on IC50 could be mediated by a third variable (e.g., tissue type).
Non-linearity
If the true relationship is curved (threshold effect, saturation), a line fits poorly. Residual plots expose this.
Influential Points
A single extreme observation can dramatically change the fitted line. Leverage and Cook’s distance help identify such points.
Linear Regression in BioLang
Simple Linear Regression
set_seed(42)
# NCI-60 pharmacogenomics: predict IC50 from target expression
let n = 60
# Simulate expression and drug sensitivity
let expression = rnorm(n, 10, 3)
let ic50 = 85 - 4.5 * expression + rnorm(n, 0, 8)
# Fit simple linear regression
let model = lm(ic50, expression)
# Print model summary
print("=== Linear Regression Summary ===")
print("Intercept: {model.intercept}")
print("Slope: {model.slope}")
print("R²: {model.r_squared}")
print("Adjusted R²: {model.adj_r_squared}")
print("F-statistic p-value: {model.p_value}")
Interpreting Coefficients
# Detailed coefficient table
print("=== Coefficients ===")
print(" Intercept: {model.intercept}")
print(" Expression β: {model.slope}")
print(" p-value: {model.p_value}")
# Interpretation
let slope = model.slope
print("\nFor each 1-unit increase in expression,")
print("IC50 decreases by {slope |> abs |> round(2)} μM")
print("R² = {model.r_squared |> round(3)}: expression explains")
print("{(model.r_squared * 100) |> round(1)}% of IC50 variation")
Prediction with Intervals
# Predict IC50 for new cell lines
let new_expression = [7.0, 10.0, 13.0]
# Point predictions using model coefficients
print("=== Predictions ===")
for i in 0..3 {
let predicted = model.slope * new_expression[i] + model.intercept
print("Expression = {new_expression[i]}: Predicted IC50 = {predicted |> round(1)} μM")
}
Residual Analysis
# Compute residuals and fitted values
let fitted = expression |> map(|x| model.slope * x + model.intercept)
let resid = []
for i in 0..n {
resid = resid + [ic50[i] - fitted[i]]
}
# 1. Residuals vs Fitted — check for patterns
let resid_table = table({"Fitted": fitted, "Residual": resid})
plot(resid_table, {type: "scatter", x: "Fitted", y: "Residual",
title: "Residuals vs Fitted"})
# 2. Check residual distribution
print("Residual summary:")
print(summary(resid))
Visualization: Scatter with Regression Line
# Scatter plot with regression line
let plot_data = table({"Expression": expression, "IC50": ic50})
plot(plot_data, {type: "scatter", x: "Expression", y: "IC50",
title: "Drug Sensitivity vs Target Expression (NCI-60)",
x_label: "Target Gene Expression (log2 FPKM)",
y_label: "Drug IC50 (μM)"})
Demonstrating Problems: Extrapolation
set_seed(42)
# Show danger of extrapolation
let x = rnorm(50, 10, 3)
let y = 100 - 3 * x + 0.2 * x ** 2 + rnorm(50, 0, 3)
# Fit linear model in observed range
let model_extrap = lm(y, x)
print("R² in training range: {model_extrap.r_squared |> round(3)}")
# Predict within range — reasonable
let pred_in = model_extrap.slope * 10.0 + model_extrap.intercept
print("Prediction at x=10 (in range): {pred_in |> round(1)}")
# Predict outside range — dangerous!
let pred_out = model_extrap.slope * 25.0 + model_extrap.intercept
print("Prediction at x=25 (extrapolation): {pred_out |> round(1)}")
print("True value at x=25 would be: {(100 - 3*25 + 0.2*25**2) |> round(1)}")
print("Extrapolation error demonstrates the danger!")
Working with Log-Transformed Data
set_seed(42)
# Many biological relationships are linear on the log scale
let dose = rnorm(40, 50, 25) |> map(|d| max(d, 0.1))
let response = 50 / (1 + (dose / 10) ** 1.5) + rnorm(40, 0, 3)
# Linear model on raw scale — poor fit
let model_raw = lm(response, dose)
print("R² (raw scale): {model_raw.r_squared |> round(3)}")
# Log-transform dose — much better
let log_dose = dose |> map(|d| log2(d))
let model_log = lm(response, log_dose)
print("R² (log scale): {model_log.r_squared |> round(3)}")
# Compare residual plots to see the difference
Python:
import numpy as np
from scipy import stats
import statsmodels.api as sm
# Simple linear regression
X = sm.add_constant(expression) # adds intercept
model = sm.OLS(ic50, X).fit()
print(model.summary())
# Prediction with intervals
predictions = model.get_prediction(sm.add_constant([7, 10, 13]))
pred_summary = predictions.summary_frame(alpha=0.05)
# obs_ci_lower/obs_ci_upper = prediction interval
# mean_ci_lower/mean_ci_upper = confidence interval
# Residual analysis
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(model.fittedvalues, model.resid)
axes[0].axhline(0, color='red')
stats.probplot(model.resid, plot=axes[1])
R:
# Simple linear regression
model <- lm(ic50 ~ expression)
summary(model)
# Prediction with intervals
new_data <- data.frame(expression = c(7, 10, 13))
predict(model, new_data, interval = "confidence")
predict(model, new_data, interval = "prediction")
# Residual diagnostics — 4 diagnostic plots
par(mfrow = c(2, 2))
plot(model)
Exercises
Exercise 1: Fit and Interpret
A study measures tumor mutation burden (TMB) and neoantigen count across 80 melanoma samples. Fit a linear regression and answer: How many additional neoantigens does each additional mutation predict?
set_seed(42)
let n = 80
let tmb = rnorm(n, 200, 80)
let neoantigens = 5 + 0.15 * tmb + rnorm(n, 0, 12)
# 1. Fit lm(neoantigens, tmb)
# 2. What is the slope? Interpret it biologically
# 3. What is model.r_squared? Is TMB a good predictor of neoantigen count?
# 4. Predict neoantigen count for TMB = 100, 200, 400
Exercise 2: Residual Diagnostics
Fit a linear model to the dose-response data below and use residual plots to determine whether the linear assumption holds. If not, what transformation improves the fit?
set_seed(42)
let dose = rnorm(60, 25, 12) |> map(|d| max(d, 1))
let effect = 20 * log2(dose) + rnorm(60, 0, 5)
# 1. Fit lm(effect, dose)
# 2. Create residuals vs fitted plot — what pattern do you see?
# 3. Try log-transforming dose and re-fitting
# 4. Compare R² values and residual patterns
Exercise 3: Confidence vs. Prediction Intervals
Demonstrate why prediction intervals are always wider than confidence intervals. Generate data, fit a model, and plot both intervals across the predictor range.
set_seed(42)
let x = rnorm(50, 10, 5)
let y = 10 + 2 * x + rnorm(50, 0, 4)
# 1. Fit lm(y, x)
# 2. Generate predictions for x = 0, 2, 4, ..., 20
# 3. Use model.slope * xi + model.intercept for each
# 4. Compare point estimates at different x values
Exercise 4: The Outlier Effect
Add a single influential outlier to well-behaved data and show how it changes the slope, R², and predictions. Then remove it and compare.
set_seed(42)
let x_clean = rnorm(49, 10, 2)
let y_clean = 5 + 1.5 * x_clean + rnorm(49, 0, 2)
# Add one extreme outlier at x=10, y=50 (should be ~20)
# 1. Fit model with and without the outlier
# 2. Compare slopes and R² values
# 3. How much does prediction at x=12 change?
Key Takeaways
- Linear regression models Y = β₀ + β₁X + ε, using least squares to find the best-fit line
- R² tells you the proportion of variance explained; the slope tells you the rate of change
- Prediction intervals (for individuals) are always wider than confidence intervals (for the mean) — use the right one for your question
- Residual analysis is mandatory: check linearity, normality, and constant variance before trusting results
- Log-transforming variables often linearizes biological relationships (dose-response, expression-phenotype)
- Beware extrapolation — the linear trend may not continue beyond your data range
- A significant regression does not prove causation — confounders may drive the relationship
- Always visualize your data with a scatter plot before and after fitting
What’s Next
One predictor is rarely enough. Day 15 introduces multiple regression, where we predict outcomes from many variables simultaneously — and face new challenges like multicollinearity, model selection, and regularization.