Day 16: Logistic Regression — Binary Outcomes
The Problem
Dr. Priya Sharma is an immuno-oncologist analyzing data from 180 melanoma patients who received anti-PD-1 immunotherapy. For each patient, she has three biomarkers — tumor mutational burden (TMB), PD-L1 expression, and microsatellite instability (MSI) status — and one outcome: response (tumor shrank ≥30%) or non-response.
Her first instinct is to use linear regression, predicting response (coded 1/0) from the biomarkers. But the predictions come out as 1.3 for one patient and -0.2 for another. Probabilities can’t be greater than 1 or less than 0.
She needs a method designed for binary outcomes — logistic regression.
Why Linear Regression Fails for Binary Outcomes
When Y is binary (0 or 1), linear regression has fundamental problems:
| Problem | Consequence |
|---|---|
| Predictions outside [0, 1] | Impossible probabilities (negative or > 100%) |
| Non-normal residuals | Residuals are binary, violating normality assumption |
| Heteroscedasticity | Variance depends on the predicted value |
| Non-linear relationship | The true probability follows an S-curve, not a line |
Key insight: We need a function that maps any real number to the range [0, 1]. The logistic (sigmoid) function does exactly this.
The Logistic Function
The logistic regression model predicts the probability of the outcome being 1:
$$P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}}$$
The sigmoid function transforms the linear predictor (which ranges from -∞ to +∞) into a probability (which ranges from 0 to 1):
| Linear predictor (β₀ + β₁X) | Probability P(Y=1) |
|---|---|
| -5 | 0.007 |
| -2 | 0.12 |
| 0 | 0.50 |
| +2 | 0.88 |
| +5 | 0.993 |
The curve is steepest at P = 0.5 (the decision boundary) and flattens at the extremes — exactly how biological responses behave.
Interpreting Coefficients: Log-Odds and Odds Ratios
Logistic regression coefficients are on the log-odds scale:
$$\ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots$$
This is less intuitive than linear regression. The key transformation:
$$\text{Odds Ratio} = e^{\beta}$$
| β (log-odds) | OR = e^β | Interpretation |
|---|---|---|
| 0 | 1.0 | No effect |
| 0.5 | 1.65 | 65% higher odds per unit increase |
| 1.0 | 2.72 | 2.7× higher odds |
| -0.5 | 0.61 | 39% lower odds |
| -1.0 | 0.37 | 63% lower odds |
Key insight: An odds ratio of 2.0 means the odds of response double for each 1-unit increase in the predictor. It does NOT mean the probability doubles — that depends on the baseline probability.
Odds vs. Probability
| Probability | Odds | Interpretation |
|---|---|---|
| 0.50 | 1:1 | Equal chance either way |
| 0.75 | 3:1 | Three times as likely to respond |
| 0.20 | 1:4 | Four times as likely NOT to respond |
| 0.90 | 9:1 | Strong favoring response |
Common pitfall: When the outcome is common (prevalence > 10%), odds ratios substantially overestimate relative risks. An OR of 3.0 for a 50% baseline event means the probability goes from 50% to 75% — a relative risk of only 1.5. Report both OR and absolute probabilities.
ROC Curves and AUC
The model outputs probabilities. To make yes/no predictions, you need a threshold — above it, predict “responder.” But which threshold?
A Receiver Operating Characteristic (ROC) curve plots sensitivity vs. (1 - specificity) across all possible thresholds:
| Metric | Definition | Trade-off |
|---|---|---|
| Sensitivity (TPR) | True responders correctly identified | Higher = catch more responders |
| Specificity (TNR) | True non-responders correctly identified | Higher = fewer false alarms |
| PPV (Precision) | Positive predictions that are correct | Higher = trust positive results |
| NPV | Negative predictions that are correct | Higher = trust negative results |
The Area Under the Curve (AUC) summarizes overall discrimination:
| AUC | Performance |
|---|---|
| 0.50 | Random guessing (useless) |
| 0.60-0.70 | Poor |
| 0.70-0.80 | Acceptable |
| 0.80-0.90 | Good |
| 0.90-1.00 | Excellent |
Clinical relevance: In cancer diagnostics, the threshold choice depends on context. Screening tests favor high sensitivity (don’t miss cancers), while confirmatory tests favor high specificity (don’t cause unnecessary biopsies).
The GLM Framework
Logistic regression is a special case of the Generalized Linear Model (GLM) — a flexible framework for non-normal outcomes:
| Outcome Type | Distribution | Link Function | Name |
|---|---|---|---|
| Continuous | Normal | Identity | Linear regression |
| Binary (0/1) | Binomial | Logit | Logistic regression |
| Counts | Poisson | Log | Poisson regression |
| Counts (overdispersed) | Negative binomial | Log | NB regression |
| Positive continuous | Gamma | Inverse | Gamma regression |
The GLM framework unifies these under one interface, letting you choose the appropriate model for your data type.
Clinical relevance: RNA-seq read counts follow a negative binomial distribution. Tools like DESeq2 use GLMs with a negative binomial family — the same framework as logistic regression, just with a different distribution assumption.
Logistic Regression in BioLang
Fitting a Logistic Model
set_seed(42)
# Immunotherapy response prediction
let n = 180
# Simulate predictors
let tmb = rnorm(n, 10, 5) |> map(|x| max(x, 0))
let pdl1 = rnorm(n, 30, 20) |> map(|x| max(min(x, 100), 0))
let msi = rnorm(n, 0, 1) |> map(|x| if x > 1.0 { 1 } else { 0 }) # ~15% MSI-high
# True response probability (logistic model)
let log_odds = -3 + 0.15 * tmb + 0.02 * pdl1 + 1.5 * msi
let prob = log_odds |> map(|x| 1.0 / (1.0 + exp(-x)))
let response = prob |> map(|p| if rnorm(1, 0, 1)[0] < p { 1 } else { 0 })
print("Response rate: {(response |> sum) / n * 100 |> round(1)}%")
# Fit logistic regression
let glm_data = table({"response": response, "TMB": tmb, "PDL1": pdl1, "MSI": msi})
let model = glm("response ~ TMB + PDL1 + MSI", glm_data, "binomial")
print("=== Logistic Regression Results ===")
print("Intercept: {model.intercept |> round(3)}")
Interpreting Odds Ratios
# Convert coefficients to odds ratios
let coef_names = ["TMB", "PD-L1", "MSI"]
let coefficients = model.coefficients
print("=== Odds Ratios ===")
for i in 0..3 {
let beta = coefficients[i]
let or_val = exp(beta)
print("{coef_names[i]}: β = {beta |> round(3)}, OR = {or_val |> round(2)}")
}
# Interpretation:
# TMB OR = 1.16 means each additional mut/Mb increases odds of response by 16%
# MSI OR = 4.5 means MSI-high patients have 4.5x the odds of responding
ROC Curve and AUC
# Compute predicted probabilities from model
let pred_prob = []
for i in 0..n {
let lp = model.intercept + model.coefficients[0] * tmb[i]
+ model.coefficients[1] * pdl1[i] + model.coefficients[2] * msi[i]
pred_prob = pred_prob + [1.0 / (1.0 + exp(-lp))]
}
# ROC curve
let roc_data = table({"actual": response, "predicted": pred_prob})
roc_curve(roc_data)
# Compute AUC
let auc_val = model.auc
print("AUC: {auc_val |> round(3)}")
# Interpretation
if auc_val >= 0.80 {
print("Good discrimination")
} else if auc_val >= 0.70 {
print("Acceptable discrimination")
} else {
print("Poor discrimination — model needs additional predictors")
}
Finding the Optimal Threshold
# Sensitivity/specificity at different thresholds
let thresholds = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
print("=== Threshold Analysis ===")
print("Threshold Sensitivity Specificity PPV NPV")
for t in thresholds {
let predicted_class = pred_prob |> map(|p| if p >= t { 1 } else { 0 })
let tp = 0
let fp = 0
let tn = 0
let fn = 0
for i in 0..n {
if predicted_class[i] == 1 && response[i] == 1 { tp = tp + 1 }
if predicted_class[i] == 1 && response[i] == 0 { fp = fp + 1 }
if predicted_class[i] == 0 && response[i] == 0 { tn = tn + 1 }
if predicted_class[i] == 0 && response[i] == 1 { fn = fn + 1 }
}
let sens = tp / (tp + fn)
let spec = tn / (tn + fp)
let ppv = if tp + fp > 0 { tp / (tp + fp) } else { 0 }
let npv = if tn + fn > 0 { tn / (tn + fn) } else { 0 }
print("{t} {sens |> round(3)} {spec |> round(3)} {ppv |> round(3)} {npv |> round(3)}")
}
# Youden's J: optimal balance of sensitivity and specificity
Using the GLM Framework
# Logistic regression via the general GLM interface
let model_glm = glm("response ~ TMB + PDL1 + MSI", glm_data, "binomial")
# Same results, but now you can swap families:
# Poisson regression for count data (e.g., number of mutations)
# let count_data = table({"mutations": mutation_count, "exposure": exposure, "age": age})
# let count_model = glm("mutations ~ exposure + age", count_data, "poisson")
Visualizing Predicted Probabilities
# Box plot: predicted probabilities by actual outcome
let resp_probs = []
let nonresp_probs = []
for i in 0..n {
if response[i] == 1 {
resp_probs = resp_probs + [pred_prob[i]]
} else {
nonresp_probs = nonresp_probs + [pred_prob[i]]
}
}
let bp_table = table({"Non-Responders": nonresp_probs, "Responders": resp_probs})
boxplot(bp_table, {title: "Predicted Probabilities by Actual Outcome"})
# Good model: minimal overlap between the two boxes
Effect of Individual Predictors
# Show how each predictor shifts the probability curve
# Fix other predictors at their means
let tmb_range = [0, 5, 10, 15, 20, 25, 30]
let pdl1_mean = 30
let msi_0 = 0
print("=== TMB Effect on Response Probability ===")
print("(PD-L1 = {pdl1_mean}, MSI = negative)")
for t in tmb_range {
let lp = model.intercept + model.coefficients[0] * t
+ model.coefficients[1] * pdl1_mean + model.coefficients[2] * msi_0
let p = 1.0 / (1.0 + exp(-lp))
print(" TMB = {t}: P(response) = {(p * 100) |> round(1)}%")
}
Python:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, classification_report
import statsmodels.api as sm
# Statsmodels (full inference)
X = sm.add_constant(df[['TMB', 'PDL1', 'MSI']])
model = sm.Logit(response, X).fit()
print(model.summary())
# Odds ratios
print(np.exp(model.params))
print(np.exp(model.conf_int()))
# ROC curve
fpr, tpr, thresholds = roc_curve(response, model.predict(X))
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
# Scikit-learn (prediction-focused)
clf = LogisticRegression().fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)[:, 1]
R:
# Logistic regression
model <- glm(response ~ TMB + PDL1 + MSI, family = binomial)
summary(model)
# Odds ratios with CI
exp(cbind(OR = coef(model), confint(model)))
# ROC curve
library(pROC)
roc_obj <- roc(response, fitted(model))
plot(roc_obj, print.auc = TRUE)
# Optimal threshold (Youden's J)
coords(roc_obj, "best", best.method = "youden")
Exercises
Exercise 1: Build and Interpret a Logistic Model
Predict cancer diagnosis (1 = cancer, 0 = benign) from three biomarkers. Interpret each odds ratio in clinical terms.
set_seed(42)
let n = 200
let biomarker_a = rnorm(n, 50, 15)
let biomarker_b = rnorm(n, 10, 4)
let age = rnorm(n, 55, 12)
let log_odds = -5 + 0.04 * biomarker_a + 0.2 * biomarker_b + 0.03 * age
let prob = log_odds |> map(|x| 1.0 / (1.0 + exp(-x)))
let cancer = prob |> map(|p| if rnorm(1, 0, 1)[0] < p { 1 } else { 0 })
# 1. Fit glm("cancer ~ biomarker_a + biomarker_b + age", data, "binomial")
# 2. Compute and interpret odds ratios: exp(coefficient) for each predictor
# 3. Which biomarker has the strongest effect?
# 4. What does the OR for age mean clinically?
Exercise 2: ROC Analysis
Build a logistic model and evaluate it with an ROC curve. Find the threshold that maximizes Youden’s J index (sensitivity + specificity - 1).
# Use the model from Exercise 1
# 1. Generate predicted probabilities from model coefficients
# 2. Plot ROC curve with roc_curve(table)
# 3. Compute sensitivity and specificity at thresholds 0.3, 0.5, 0.7
# 4. Which threshold maximizes Youden's J?
# 5. If this is a screening test, would you prefer a different threshold? Why?
Exercise 3: Comparing Two Models
Build two logistic models: one with TMB alone, another with TMB + PD-L1 + MSI. Compare their AUC values. Does adding predictors improve discrimination?
set_seed(42)
let n = 150
# Simulate data where TMB is a moderate predictor and MSI adds substantial value
# 1. Fit model_simple = glm("response ~ TMB", data, "binomial")
# 2. Fit model_full = glm("response ~ TMB + PDL1 + MSI", data, "binomial")
# 3. Compare AUC values
# 4. Plot both ROC curves
# 5. Is the improvement worth the added complexity?
Exercise 4: The Separation Problem
What happens when a predictor perfectly separates outcomes? Simulate MSI status that perfectly predicts response and observe what logistic regression does.
set_seed(42)
let n = 100
let msi = rnorm(n, 0, 1) |> map(|x| if x > 0.84 { 1 } else { 0 })
let response = msi # perfect separation!
# 1. Try fitting glm("response ~ msi", data, "binomial")
# 2. What happens to the coefficient and its standard error?
# 3. Why is this a problem? (Hint: the MLE doesn't exist)
# 4. How would you handle this in practice?
Key Takeaways
- Logistic regression models binary outcomes by predicting probabilities via the sigmoid function — use it instead of linear regression when Y is 0/1
- Coefficients are on the log-odds scale; exponentiate to get odds ratios (OR): e^β
- An OR of 2.0 means the odds double per unit increase — but this is NOT the same as doubling the probability
- ROC curves show the sensitivity/specificity trade-off across all thresholds; AUC summarizes overall discrimination
- The optimal threshold depends on clinical context: screening favors sensitivity, confirmation favors specificity
- Logistic regression is a special case of the GLM framework — the same approach extends to count data (Poisson), survival data, and more
- Always report odds ratios with confidence intervals, not just p-values
- Watch for separation (a predictor perfectly predicts the outcome), which causes coefficient inflation
What’s Next
Sometimes the outcome isn’t just binary — it’s time-to-event: how long until a patient relapses, how long a cell line survives treatment. Day 17 introduces survival analysis, where censoring makes standard methods fail and Kaplan-Meier curves become essential.