Day 14: Linear Regression — Prediction from Data

The Problem

Dr. James Park is a pharmacogenomicist working with the NCI-60 cell line panel — 60 cancer cell lines spanning 9 tissue types. He has gene expression data and drug sensitivity measurements (IC50 values) for each line. His question: Can we predict how sensitive a cell line will be to a new kinase inhibitor based on its expression of the drug’s target gene?

He knows the target gene and IC50 seem correlated (Day 13 confirmed r = -0.72). But correlation just says “they move together.” James needs to go further: given a new cell line with a known expression level, what IC50 should he predict? And how confident should he be?

This is the leap from association to prediction — the domain of linear regression.

What Is Linear Regression?

Linear regression fits a straight line through data points to model the relationship between a predictor (X) and a response (Y):

$$Y = \beta_0 + \beta_1 X + \varepsilon$$

Term	Meaning	Biological Example
Y	Response variable	Drug IC50
X	Predictor variable	Target gene expression
β₀	Intercept (Y when X = 0)	Baseline sensitivity
β₁	Slope (change in Y per unit X)	Sensitivity per expression unit
ε	Error (noise)	Biological + technical variation

The method of least squares finds the β₀ and β₁ that minimize the sum of squared residuals:

$$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$$

Key insight: Regression has a clear asymmetry that correlation lacks. X predicts Y — there is a designated predictor and a designated outcome. The regression of Y on X is NOT the same as X on Y.

From Correlation to Regression

Correlation and simple regression are intimately linked:

Metric	What It Tells You
Pearson r	Strength and direction of linear association
R² = r²	Proportion of variance in Y explained by X
β₁ (slope)	How much Y changes per unit change in X
p-value of β₁	Whether the slope is significantly different from zero

If r = -0.72, then R² = 0.52, meaning 52% of the variation in IC50 is explained by target expression. The remaining 48% is unexplained — due to other genes, pathway redundancy, or noise.

Interpreting Regression Output

A regression produces a table of coefficients:

Term	Estimate	Std. Error	t-value	p-value
Intercept (β₀)	85.3	6.2	13.8	< 0.001
Expression (β₁)	-4.7	0.8	-5.9	< 0.001

Reading this:

Intercept: When expression = 0, predicted IC50 is 85.3 μM (often not biologically meaningful)
Slope: Each 1-unit increase in expression decreases IC50 by 4.7 μM (more expression → more sensitive)
p-value: The slope is significantly different from zero — expression truly predicts sensitivity
R²: How well the line fits overall

Common pitfall: The intercept often represents an extrapolation beyond the data range. If expression ranges from 5-15, interpreting “IC50 when expression = 0” is meaningless. Focus on the slope.

Prediction: Point Estimates and Intervals

Once we have a fitted model, we can predict Y for new X values. But predictions come with two types of uncertainty:

Confidence Interval (for the mean response): “We’re 95% confident the average IC50 for all cell lines with expression = 10 lies in this range.”

Prediction Interval (for a single new observation): “We’re 95% confident the IC50 for one specific new cell line with expression = 10 lies in this range.”

Prediction intervals are always wider than confidence intervals because they include individual-level noise (ε).

Clinical relevance: In precision medicine, you’re usually predicting for one patient (prediction interval), not the population average (confidence interval). The prediction interval honestly reflects your uncertainty.

Residual Analysis: Checking Your Model

A regression model makes assumptions. Residuals (observed - predicted) reveal violations:

Plot	What to Check	Warning Sign
Residuals vs. Fitted	Constant variance (homoscedasticity)	Fan/funnel shape
Q-Q plot of residuals	Normality of errors	Curved pattern
Residuals vs. order	Independence	Systematic trend
Scale-Location	Variance trend	Upward slope

The four assumptions of linear regression:

Linearity: Y is a linear function of X (check scatter plot)
Independence: Observations are independent (study design)
Normality: Residuals are normally distributed (Q-Q plot)
Homoscedasticity: Constant error variance (residuals vs. fitted)

Common pitfall: Many biological relationships are not linear on the original scale but become linear after log-transformation. Always consider transforming IC50 or expression values before fitting.

When Regression Goes Wrong

Extrapolation

Predicting beyond the range of your data is dangerous. If expression ranges from 5-15 in your training data, predicting IC50 for expression = 25 assumes the linear trend continues — it may not.

Confounding

A significant regression doesn’t prove causation. The apparent effect of expression on IC50 could be mediated by a third variable (e.g., tissue type).

Non-linearity

If the true relationship is curved (threshold effect, saturation), a line fits poorly. Residual plots expose this.

Influential Points

A single extreme observation can dramatically change the fitted line. Leverage and Cook’s distance help identify such points.

Linear Regression in BioLang

Simple Linear Regression

set_seed(42)
# NCI-60 pharmacogenomics: predict IC50 from target expression
let n = 60

# Simulate expression and drug sensitivity
let expression = rnorm(n, 10, 3)
let ic50 = 85 - 4.5 * expression + rnorm(n, 0, 8)

# Fit simple linear regression
let model = lm(ic50, expression)

# Print model summary
print("=== Linear Regression Summary ===")
print("Intercept: {model.intercept}")
print("Slope: {model.slope}")
print("R²: {model.r_squared}")
print("Adjusted R²: {model.adj_r_squared}")
print("F-statistic p-value: {model.p_value}")

Interpreting Coefficients

# Detailed coefficient table
print("=== Coefficients ===")
print("  Intercept: {model.intercept}")
print("  Expression β: {model.slope}")
print("  p-value: {model.p_value}")

# Interpretation
let slope = model.slope
print("\nFor each 1-unit increase in expression,")
print("IC50 decreases by {slope |> abs |> round(2)} μM")
print("R² = {model.r_squared |> round(3)}: expression explains")
print("{(model.r_squared * 100) |> round(1)}% of IC50 variation")

Prediction with Intervals

# Predict IC50 for new cell lines
let new_expression = [7.0, 10.0, 13.0]

# Point predictions using model coefficients
print("=== Predictions ===")
for i in 0..3 {
    let predicted = model.slope * new_expression[i] + model.intercept
    print("Expression = {new_expression[i]}: Predicted IC50 = {predicted |> round(1)} μM")
}

Residual Analysis

# Compute residuals and fitted values
let fitted = expression |> map(|x| model.slope * x + model.intercept)
let resid = []
for i in 0..n {
    resid = resid + [ic50[i] - fitted[i]]
}

# 1. Residuals vs Fitted — check for patterns
let resid_table = table({"Fitted": fitted, "Residual": resid})
plot(resid_table, {type: "scatter", x: "Fitted", y: "Residual",
    title: "Residuals vs Fitted"})

# 2. Check residual distribution
print("Residual summary:")
print(summary(resid))

Visualization: Scatter with Regression Line

# Scatter plot with regression line
let plot_data = table({"Expression": expression, "IC50": ic50})
plot(plot_data, {type: "scatter", x: "Expression", y: "IC50",
    title: "Drug Sensitivity vs Target Expression (NCI-60)",
    x_label: "Target Gene Expression (log2 FPKM)",
    y_label: "Drug IC50 (μM)"})

Demonstrating Problems: Extrapolation

set_seed(42)
# Show danger of extrapolation
let x = rnorm(50, 10, 3)
let y = 100 - 3 * x + 0.2 * x ** 2 + rnorm(50, 0, 3)

# Fit linear model in observed range
let model_extrap = lm(y, x)
print("R² in training range: {model_extrap.r_squared |> round(3)}")

# Predict within range — reasonable
let pred_in = model_extrap.slope * 10.0 + model_extrap.intercept
print("Prediction at x=10 (in range): {pred_in |> round(1)}")

# Predict outside range — dangerous!
let pred_out = model_extrap.slope * 25.0 + model_extrap.intercept
print("Prediction at x=25 (extrapolation): {pred_out |> round(1)}")
print("True value at x=25 would be: {(100 - 3*25 + 0.2*25**2) |> round(1)}")
print("Extrapolation error demonstrates the danger!")

Working with Log-Transformed Data

set_seed(42)
# Many biological relationships are linear on the log scale
let dose = rnorm(40, 50, 25) |> map(|d| max(d, 0.1))
let response = 50 / (1 + (dose / 10) ** 1.5) + rnorm(40, 0, 3)

# Linear model on raw scale — poor fit
let model_raw = lm(response, dose)
print("R² (raw scale): {model_raw.r_squared |> round(3)}")

# Log-transform dose — much better
let log_dose = dose |> map(|d| log2(d))
let model_log = lm(response, log_dose)
print("R² (log scale): {model_log.r_squared |> round(3)}")

# Compare residual plots to see the difference

Python:

import numpy as np
from scipy import stats
import statsmodels.api as sm

# Simple linear regression
X = sm.add_constant(expression)  # adds intercept
model = sm.OLS(ic50, X).fit()
print(model.summary())

# Prediction with intervals
predictions = model.get_prediction(sm.add_constant([7, 10, 13]))
pred_summary = predictions.summary_frame(alpha=0.05)
# obs_ci_lower/obs_ci_upper = prediction interval
# mean_ci_lower/mean_ci_upper = confidence interval

# Residual analysis
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(model.fittedvalues, model.resid)
axes[0].axhline(0, color='red')
stats.probplot(model.resid, plot=axes[1])

# Simple linear regression
model <- lm(ic50 ~ expression)
summary(model)

# Prediction with intervals
new_data <- data.frame(expression = c(7, 10, 13))
predict(model, new_data, interval = "confidence")
predict(model, new_data, interval = "prediction")

# Residual diagnostics — 4 diagnostic plots
par(mfrow = c(2, 2))
plot(model)

Exercises

Exercise 1: Fit and Interpret

A study measures tumor mutation burden (TMB) and neoantigen count across 80 melanoma samples. Fit a linear regression and answer: How many additional neoantigens does each additional mutation predict?

set_seed(42)
let n = 80
let tmb = rnorm(n, 200, 80)
let neoantigens = 5 + 0.15 * tmb + rnorm(n, 0, 12)

# 1. Fit lm(neoantigens, tmb)
# 2. What is the slope? Interpret it biologically
# 3. What is model.r_squared? Is TMB a good predictor of neoantigen count?
# 4. Predict neoantigen count for TMB = 100, 200, 400

Exercise 2: Residual Diagnostics

Fit a linear model to the dose-response data below and use residual plots to determine whether the linear assumption holds. If not, what transformation improves the fit?

set_seed(42)
let dose = rnorm(60, 25, 12) |> map(|d| max(d, 1))
let effect = 20 * log2(dose) + rnorm(60, 0, 5)

# 1. Fit lm(effect, dose)
# 2. Create residuals vs fitted plot — what pattern do you see?
# 3. Try log-transforming dose and re-fitting
# 4. Compare R² values and residual patterns

Exercise 3: Confidence vs. Prediction Intervals

Demonstrate why prediction intervals are always wider than confidence intervals. Generate data, fit a model, and plot both intervals across the predictor range.

set_seed(42)
let x = rnorm(50, 10, 5)
let y = 10 + 2 * x + rnorm(50, 0, 4)

# 1. Fit lm(y, x)
# 2. Generate predictions for x = 0, 2, 4, ..., 20
# 3. Use model.slope * xi + model.intercept for each
# 4. Compare point estimates at different x values

Exercise 4: The Outlier Effect

Add a single influential outlier to well-behaved data and show how it changes the slope, R², and predictions. Then remove it and compare.

set_seed(42)
let x_clean = rnorm(49, 10, 2)
let y_clean = 5 + 1.5 * x_clean + rnorm(49, 0, 2)

# Add one extreme outlier at x=10, y=50 (should be ~20)
# 1. Fit model with and without the outlier
# 2. Compare slopes and R² values
# 3. How much does prediction at x=12 change?

Key Takeaways

Linear regression models Y = β₀ + β₁X + ε, using least squares to find the best-fit line
R² tells you the proportion of variance explained; the slope tells you the rate of change
Prediction intervals (for individuals) are always wider than confidence intervals (for the mean) — use the right one for your question
Residual analysis is mandatory: check linearity, normality, and constant variance before trusting results
Log-transforming variables often linearizes biological relationships (dose-response, expression-phenotype)
Beware extrapolation — the linear trend may not continue beyond your data range
A significant regression does not prove causation — confounders may drive the relationship
Always visualize your data with a scatter plot before and after fitting

What’s Next

One predictor is rarely enough. Day 15 introduces multiple regression, where we predict outcomes from many variables simultaneously — and face new challenges like multicollinearity, model selection, and regularization.

Keyboard shortcuts

Practical Biostatistics in 30 Days