Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 14: Linear Regression — Prediction from Data

The Problem

Dr. James Park is a pharmacogenomicist working with the NCI-60 cell line panel — 60 cancer cell lines spanning 9 tissue types. He has gene expression data and drug sensitivity measurements (IC50 values) for each line. His question: Can we predict how sensitive a cell line will be to a new kinase inhibitor based on its expression of the drug’s target gene?

He knows the target gene and IC50 seem correlated (Day 13 confirmed r = -0.72). But correlation just says “they move together.” James needs to go further: given a new cell line with a known expression level, what IC50 should he predict? And how confident should he be?

This is the leap from association to prediction — the domain of linear regression.

What Is Linear Regression?

Linear regression fits a straight line through data points to model the relationship between a predictor (X) and a response (Y):

$$Y = \beta_0 + \beta_1 X + \varepsilon$$

TermMeaningBiological Example
YResponse variableDrug IC50
XPredictor variableTarget gene expression
β₀Intercept (Y when X = 0)Baseline sensitivity
β₁Slope (change in Y per unit X)Sensitivity per expression unit
εError (noise)Biological + technical variation

The method of least squares finds the β₀ and β₁ that minimize the sum of squared residuals:

$$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$$

Key insight: Regression has a clear asymmetry that correlation lacks. X predicts Y — there is a designated predictor and a designated outcome. The regression of Y on X is NOT the same as X on Y.

Linear Regression: Fitted Line and Residuals Residuals = vertical distance from each point to the line Target Gene Expression (log2 FPKM) Drug IC50 (uM) Y = b0 + b1*X residual Regression line (least squares) Residuals (errors)

From Correlation to Regression

Correlation and simple regression are intimately linked:

MetricWhat It Tells You
Pearson rStrength and direction of linear association
R² = r²Proportion of variance in Y explained by X
β₁ (slope)How much Y changes per unit change in X
p-value of β₁Whether the slope is significantly different from zero

If r = -0.72, then R² = 0.52, meaning 52% of the variation in IC50 is explained by target expression. The remaining 48% is unexplained — due to other genes, pathway redundancy, or noise.

Interpreting Regression Output

A regression produces a table of coefficients:

TermEstimateStd. Errort-valuep-value
Intercept (β₀)85.36.213.8< 0.001
Expression (β₁)-4.70.8-5.9< 0.001

Reading this:

  • Intercept: When expression = 0, predicted IC50 is 85.3 μM (often not biologically meaningful)
  • Slope: Each 1-unit increase in expression decreases IC50 by 4.7 μM (more expression → more sensitive)
  • p-value: The slope is significantly different from zero — expression truly predicts sensitivity
  • R²: How well the line fits overall

Common pitfall: The intercept often represents an extrapolation beyond the data range. If expression ranges from 5-15, interpreting “IC50 when expression = 0” is meaningless. Focus on the slope.

Prediction: Point Estimates and Intervals

Once we have a fitted model, we can predict Y for new X values. But predictions come with two types of uncertainty:

Confidence Interval (for the mean response): “We’re 95% confident the average IC50 for all cell lines with expression = 10 lies in this range.”

Prediction Interval (for a single new observation): “We’re 95% confident the IC50 for one specific new cell line with expression = 10 lies in this range.”

Prediction intervals are always wider than confidence intervals because they include individual-level noise (ε).

Clinical relevance: In precision medicine, you’re usually predicting for one patient (prediction interval), not the population average (confidence interval). The prediction interval honestly reflects your uncertainty.

Confidence Interval vs Prediction Interval Prediction intervals are always wider (include individual noise) X (predictor) Y (response) Prediction Interval (95%) Confidence Interval (95%) mean of X (narrowest)

Residual Analysis: Checking Your Model

A regression model makes assumptions. Residuals (observed - predicted) reveal violations:

PlotWhat to CheckWarning Sign
Residuals vs. FittedConstant variance (homoscedasticity)Fan/funnel shape
Q-Q plot of residualsNormality of errorsCurved pattern
Residuals vs. orderIndependenceSystematic trend
Scale-LocationVariance trendUpward slope

The four assumptions of linear regression:

  1. Linearity: Y is a linear function of X (check scatter plot)
  2. Independence: Observations are independent (study design)
  3. Normality: Residuals are normally distributed (Q-Q plot)
  4. Homoscedasticity: Constant error variance (residuals vs. fitted)
Residual Diagnostics: Good vs. Bad Patterns Good: Random Fitted values 0 Bad: Curved Fitted values Bad: Fan Shape Fitted values Assumptions met Non-linear relationship Heteroscedasticity

Common pitfall: Many biological relationships are not linear on the original scale but become linear after log-transformation. Always consider transforming IC50 or expression values before fitting.

When Regression Goes Wrong

Extrapolation

Predicting beyond the range of your data is dangerous. If expression ranges from 5-15 in your training data, predicting IC50 for expression = 25 assumes the linear trend continues — it may not.

Confounding

A significant regression doesn’t prove causation. The apparent effect of expression on IC50 could be mediated by a third variable (e.g., tissue type).

Non-linearity

If the true relationship is curved (threshold effect, saturation), a line fits poorly. Residual plots expose this.

Influential Points

A single extreme observation can dramatically change the fitted line. Leverage and Cook’s distance help identify such points.

Linear Regression in BioLang

Simple Linear Regression

set_seed(42)
# NCI-60 pharmacogenomics: predict IC50 from target expression
let n = 60

# Simulate expression and drug sensitivity
let expression = rnorm(n, 10, 3)
let ic50 = 85 - 4.5 * expression + rnorm(n, 0, 8)

# Fit simple linear regression
let model = lm(ic50, expression)

# Print model summary
print("=== Linear Regression Summary ===")
print("Intercept: {model.intercept}")
print("Slope: {model.slope}")
print("R²: {model.r_squared}")
print("Adjusted R²: {model.adj_r_squared}")
print("F-statistic p-value: {model.p_value}")

Interpreting Coefficients

# Detailed coefficient table
print("=== Coefficients ===")
print("  Intercept: {model.intercept}")
print("  Expression β: {model.slope}")
print("  p-value: {model.p_value}")

# Interpretation
let slope = model.slope
print("\nFor each 1-unit increase in expression,")
print("IC50 decreases by {slope |> abs |> round(2)} μM")
print("R² = {model.r_squared |> round(3)}: expression explains")
print("{(model.r_squared * 100) |> round(1)}% of IC50 variation")

Prediction with Intervals

# Predict IC50 for new cell lines
let new_expression = [7.0, 10.0, 13.0]

# Point predictions using model coefficients
print("=== Predictions ===")
for i in 0..3 {
    let predicted = model.slope * new_expression[i] + model.intercept
    print("Expression = {new_expression[i]}: Predicted IC50 = {predicted |> round(1)} μM")
}

Residual Analysis

# Compute residuals and fitted values
let fitted = expression |> map(|x| model.slope * x + model.intercept)
let resid = []
for i in 0..n {
    resid = resid + [ic50[i] - fitted[i]]
}

# 1. Residuals vs Fitted — check for patterns
let resid_table = table({"Fitted": fitted, "Residual": resid})
plot(resid_table, {type: "scatter", x: "Fitted", y: "Residual",
    title: "Residuals vs Fitted"})

# 2. Check residual distribution
print("Residual summary:")
print(summary(resid))

Visualization: Scatter with Regression Line

# Scatter plot with regression line
let plot_data = table({"Expression": expression, "IC50": ic50})
plot(plot_data, {type: "scatter", x: "Expression", y: "IC50",
    title: "Drug Sensitivity vs Target Expression (NCI-60)",
    x_label: "Target Gene Expression (log2 FPKM)",
    y_label: "Drug IC50 (μM)"})

Demonstrating Problems: Extrapolation

set_seed(42)
# Show danger of extrapolation
let x = rnorm(50, 10, 3)
let y = 100 - 3 * x + 0.2 * x ** 2 + rnorm(50, 0, 3)

# Fit linear model in observed range
let model_extrap = lm(y, x)
print("R² in training range: {model_extrap.r_squared |> round(3)}")

# Predict within range — reasonable
let pred_in = model_extrap.slope * 10.0 + model_extrap.intercept
print("Prediction at x=10 (in range): {pred_in |> round(1)}")

# Predict outside range — dangerous!
let pred_out = model_extrap.slope * 25.0 + model_extrap.intercept
print("Prediction at x=25 (extrapolation): {pred_out |> round(1)}")
print("True value at x=25 would be: {(100 - 3*25 + 0.2*25**2) |> round(1)}")
print("Extrapolation error demonstrates the danger!")

Working with Log-Transformed Data

set_seed(42)
# Many biological relationships are linear on the log scale
let dose = rnorm(40, 50, 25) |> map(|d| max(d, 0.1))
let response = 50 / (1 + (dose / 10) ** 1.5) + rnorm(40, 0, 3)

# Linear model on raw scale — poor fit
let model_raw = lm(response, dose)
print("R² (raw scale): {model_raw.r_squared |> round(3)}")

# Log-transform dose — much better
let log_dose = dose |> map(|d| log2(d))
let model_log = lm(response, log_dose)
print("R² (log scale): {model_log.r_squared |> round(3)}")

# Compare residual plots to see the difference

Python:

import numpy as np
from scipy import stats
import statsmodels.api as sm

# Simple linear regression
X = sm.add_constant(expression)  # adds intercept
model = sm.OLS(ic50, X).fit()
print(model.summary())

# Prediction with intervals
predictions = model.get_prediction(sm.add_constant([7, 10, 13]))
pred_summary = predictions.summary_frame(alpha=0.05)
# obs_ci_lower/obs_ci_upper = prediction interval
# mean_ci_lower/mean_ci_upper = confidence interval

# Residual analysis
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(model.fittedvalues, model.resid)
axes[0].axhline(0, color='red')
stats.probplot(model.resid, plot=axes[1])

R:

# Simple linear regression
model <- lm(ic50 ~ expression)
summary(model)

# Prediction with intervals
new_data <- data.frame(expression = c(7, 10, 13))
predict(model, new_data, interval = "confidence")
predict(model, new_data, interval = "prediction")

# Residual diagnostics — 4 diagnostic plots
par(mfrow = c(2, 2))
plot(model)

Exercises

Exercise 1: Fit and Interpret

A study measures tumor mutation burden (TMB) and neoantigen count across 80 melanoma samples. Fit a linear regression and answer: How many additional neoantigens does each additional mutation predict?

set_seed(42)
let n = 80
let tmb = rnorm(n, 200, 80)
let neoantigens = 5 + 0.15 * tmb + rnorm(n, 0, 12)

# 1. Fit lm(neoantigens, tmb)
# 2. What is the slope? Interpret it biologically
# 3. What is model.r_squared? Is TMB a good predictor of neoantigen count?
# 4. Predict neoantigen count for TMB = 100, 200, 400

Exercise 2: Residual Diagnostics

Fit a linear model to the dose-response data below and use residual plots to determine whether the linear assumption holds. If not, what transformation improves the fit?

set_seed(42)
let dose = rnorm(60, 25, 12) |> map(|d| max(d, 1))
let effect = 20 * log2(dose) + rnorm(60, 0, 5)

# 1. Fit lm(effect, dose)
# 2. Create residuals vs fitted plot — what pattern do you see?
# 3. Try log-transforming dose and re-fitting
# 4. Compare R² values and residual patterns

Exercise 3: Confidence vs. Prediction Intervals

Demonstrate why prediction intervals are always wider than confidence intervals. Generate data, fit a model, and plot both intervals across the predictor range.

set_seed(42)
let x = rnorm(50, 10, 5)
let y = 10 + 2 * x + rnorm(50, 0, 4)

# 1. Fit lm(y, x)
# 2. Generate predictions for x = 0, 2, 4, ..., 20
# 3. Use model.slope * xi + model.intercept for each
# 4. Compare point estimates at different x values

Exercise 4: The Outlier Effect

Add a single influential outlier to well-behaved data and show how it changes the slope, R², and predictions. Then remove it and compare.

set_seed(42)
let x_clean = rnorm(49, 10, 2)
let y_clean = 5 + 1.5 * x_clean + rnorm(49, 0, 2)

# Add one extreme outlier at x=10, y=50 (should be ~20)
# 1. Fit model with and without the outlier
# 2. Compare slopes and R² values
# 3. How much does prediction at x=12 change?

Key Takeaways

  • Linear regression models Y = β₀ + β₁X + ε, using least squares to find the best-fit line
  • tells you the proportion of variance explained; the slope tells you the rate of change
  • Prediction intervals (for individuals) are always wider than confidence intervals (for the mean) — use the right one for your question
  • Residual analysis is mandatory: check linearity, normality, and constant variance before trusting results
  • Log-transforming variables often linearizes biological relationships (dose-response, expression-phenotype)
  • Beware extrapolation — the linear trend may not continue beyond your data range
  • A significant regression does not prove causation — confounders may drive the relationship
  • Always visualize your data with a scatter plot before and after fitting

What’s Next

One predictor is rarely enough. Day 15 introduces multiple regression, where we predict outcomes from many variables simultaneously — and face new challenges like multicollinearity, model selection, and regularization.