Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 16: Logistic Regression — Binary Outcomes

The Problem

Dr. Priya Sharma is an immuno-oncologist analyzing data from 180 melanoma patients who received anti-PD-1 immunotherapy. For each patient, she has three biomarkers — tumor mutational burden (TMB), PD-L1 expression, and microsatellite instability (MSI) status — and one outcome: response (tumor shrank ≥30%) or non-response.

Her first instinct is to use linear regression, predicting response (coded 1/0) from the biomarkers. But the predictions come out as 1.3 for one patient and -0.2 for another. Probabilities can’t be greater than 1 or less than 0.

She needs a method designed for binary outcomes — logistic regression.

Why Linear Regression Fails for Binary Outcomes

When Y is binary (0 or 1), linear regression has fundamental problems:

ProblemConsequence
Predictions outside [0, 1]Impossible probabilities (negative or > 100%)
Non-normal residualsResiduals are binary, violating normality assumption
HeteroscedasticityVariance depends on the predicted value
Non-linear relationshipThe true probability follows an S-curve, not a line

Key insight: We need a function that maps any real number to the range [0, 1]. The logistic (sigmoid) function does exactly this.

Why Linear Regression Fails for Binary Data Linear Regression (Bad) Predictor (X) P(Y=1) 0 0.5 1.0 P > 1.0 ! P < 0.0 ! Logistic Regression (Good) Predictor (X) 0 0.5 1.0 Always in [0, 1]

The Logistic Function

The logistic regression model predicts the probability of the outcome being 1:

$$P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}}$$

The sigmoid function transforms the linear predictor (which ranges from -∞ to +∞) into a probability (which ranges from 0 to 1):

Linear predictor (β₀ + β₁X)Probability P(Y=1)
-50.007
-20.12
00.50
+20.88
+50.993

The curve is steepest at P = 0.5 (the decision boundary) and flattens at the extremes — exactly how biological responses behave.

The Sigmoid (Logistic) Function 0.0 0.25 0.50 1.0 -6 -4 -2 0 +2 +4 +6 Linear predictor (z = B0 + B1*X) P(Y = 1) Inflection at P = 0.5 P = 1 P = 0 Low risk High risk

Interpreting Coefficients: Log-Odds and Odds Ratios

Logistic regression coefficients are on the log-odds scale:

$$\ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots$$

This is less intuitive than linear regression. The key transformation:

$$\text{Odds Ratio} = e^{\beta}$$

β (log-odds)OR = e^βInterpretation
01.0No effect
0.51.6565% higher odds per unit increase
1.02.722.7× higher odds
-0.50.6139% lower odds
-1.00.3763% lower odds

Key insight: An odds ratio of 2.0 means the odds of response double for each 1-unit increase in the predictor. It does NOT mean the probability doubles — that depends on the baseline probability.

Odds vs. Probability

ProbabilityOddsInterpretation
0.501:1Equal chance either way
0.753:1Three times as likely to respond
0.201:4Four times as likely NOT to respond
0.909:1Strong favoring response

Common pitfall: When the outcome is common (prevalence > 10%), odds ratios substantially overestimate relative risks. An OR of 3.0 for a 50% baseline event means the probability goes from 50% to 75% — a relative risk of only 1.5. Report both OR and absolute probabilities.

ROC Curves and AUC

The model outputs probabilities. To make yes/no predictions, you need a threshold — above it, predict “responder.” But which threshold?

A Receiver Operating Characteristic (ROC) curve plots sensitivity vs. (1 - specificity) across all possible thresholds:

MetricDefinitionTrade-off
Sensitivity (TPR)True responders correctly identifiedHigher = catch more responders
Specificity (TNR)True non-responders correctly identifiedHigher = fewer false alarms
PPV (Precision)Positive predictions that are correctHigher = trust positive results
NPVNegative predictions that are correctHigher = trust negative results

The Area Under the Curve (AUC) summarizes overall discrimination:

AUCPerformance
0.50Random guessing (useless)
0.60-0.70Poor
0.70-0.80Acceptable
0.80-0.90Good
0.90-1.00Excellent

Clinical relevance: In cancer diagnostics, the threshold choice depends on context. Screening tests favor high sensitivity (don’t miss cancers), while confirmatory tests favor high specificity (don’t cause unnecessary biopsies).

ROC Curve and AUC 1 - Specificity (False Positive Rate) Sensitivity (True Positive Rate) 0.0 0.25 0.50 0.75 1.0 0.0 0.5 1.0 Random (AUC = 0.50) AUC = 0.85 (Good discrimination) Optimal threshold Model ROC Random

The GLM Framework

Logistic regression is a special case of the Generalized Linear Model (GLM) — a flexible framework for non-normal outcomes:

Outcome TypeDistributionLink FunctionName
ContinuousNormalIdentityLinear regression
Binary (0/1)BinomialLogitLogistic regression
CountsPoissonLogPoisson regression
Counts (overdispersed)Negative binomialLogNB regression
Positive continuousGammaInverseGamma regression

The GLM framework unifies these under one interface, letting you choose the appropriate model for your data type.

Clinical relevance: RNA-seq read counts follow a negative binomial distribution. Tools like DESeq2 use GLMs with a negative binomial family — the same framework as logistic regression, just with a different distribution assumption.

Logistic Regression in BioLang

Fitting a Logistic Model

set_seed(42)
# Immunotherapy response prediction
let n = 180

# Simulate predictors
let tmb = rnorm(n, 10, 5) |> map(|x| max(x, 0))
let pdl1 = rnorm(n, 30, 20) |> map(|x| max(min(x, 100), 0))
let msi = rnorm(n, 0, 1) |> map(|x| if x > 1.0 { 1 } else { 0 })  # ~15% MSI-high

# True response probability (logistic model)
let log_odds = -3 + 0.15 * tmb + 0.02 * pdl1 + 1.5 * msi
let prob = log_odds |> map(|x| 1.0 / (1.0 + exp(-x)))
let response = prob |> map(|p| if rnorm(1, 0, 1)[0] < p { 1 } else { 0 })

print("Response rate: {(response |> sum) / n * 100 |> round(1)}%")

# Fit logistic regression
let glm_data = table({"response": response, "TMB": tmb, "PDL1": pdl1, "MSI": msi})
let model = glm("response ~ TMB + PDL1 + MSI", glm_data, "binomial")

print("=== Logistic Regression Results ===")
print("Intercept: {model.intercept |> round(3)}")

Interpreting Odds Ratios

# Convert coefficients to odds ratios
let coef_names = ["TMB", "PD-L1", "MSI"]
let coefficients = model.coefficients

print("=== Odds Ratios ===")
for i in 0..3 {
    let beta = coefficients[i]
    let or_val = exp(beta)
    print("{coef_names[i]}: β = {beta |> round(3)}, OR = {or_val |> round(2)}")
}

# Interpretation:
# TMB OR = 1.16 means each additional mut/Mb increases odds of response by 16%
# MSI OR = 4.5 means MSI-high patients have 4.5x the odds of responding

ROC Curve and AUC

# Compute predicted probabilities from model
let pred_prob = []
for i in 0..n {
    let lp = model.intercept + model.coefficients[0] * tmb[i]
        + model.coefficients[1] * pdl1[i] + model.coefficients[2] * msi[i]
    pred_prob = pred_prob + [1.0 / (1.0 + exp(-lp))]
}

# ROC curve
let roc_data = table({"actual": response, "predicted": pred_prob})
roc_curve(roc_data)

# Compute AUC
let auc_val = model.auc
print("AUC: {auc_val |> round(3)}")

# Interpretation
if auc_val >= 0.80 {
    print("Good discrimination")
} else if auc_val >= 0.70 {
    print("Acceptable discrimination")
} else {
    print("Poor discrimination — model needs additional predictors")
}

Finding the Optimal Threshold

# Sensitivity/specificity at different thresholds
let thresholds = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]

print("=== Threshold Analysis ===")
print("Threshold  Sensitivity  Specificity  PPV      NPV")

for t in thresholds {
    let predicted_class = pred_prob |> map(|p| if p >= t { 1 } else { 0 })
    let tp = 0
    let fp = 0
    let tn = 0
    let fn = 0

    for i in 0..n {
        if predicted_class[i] == 1 && response[i] == 1 { tp = tp + 1 }
        if predicted_class[i] == 1 && response[i] == 0 { fp = fp + 1 }
        if predicted_class[i] == 0 && response[i] == 0 { tn = tn + 1 }
        if predicted_class[i] == 0 && response[i] == 1 { fn = fn + 1 }
    }

    let sens = tp / (tp + fn)
    let spec = tn / (tn + fp)
    let ppv = if tp + fp > 0 { tp / (tp + fp) } else { 0 }
    let npv = if tn + fn > 0 { tn / (tn + fn) } else { 0 }

    print("{t}        {sens |> round(3)}       {spec |> round(3)}      {ppv |> round(3)}    {npv |> round(3)}")
}

# Youden's J: optimal balance of sensitivity and specificity

Using the GLM Framework

# Logistic regression via the general GLM interface
let model_glm = glm("response ~ TMB + PDL1 + MSI", glm_data, "binomial")

# Same results, but now you can swap families:

# Poisson regression for count data (e.g., number of mutations)
# let count_data = table({"mutations": mutation_count, "exposure": exposure, "age": age})
# let count_model = glm("mutations ~ exposure + age", count_data, "poisson")

Visualizing Predicted Probabilities

# Box plot: predicted probabilities by actual outcome
let resp_probs = []
let nonresp_probs = []

for i in 0..n {
    if response[i] == 1 {
        resp_probs = resp_probs + [pred_prob[i]]
    } else {
        nonresp_probs = nonresp_probs + [pred_prob[i]]
    }
}

let bp_table = table({"Non-Responders": nonresp_probs, "Responders": resp_probs})
boxplot(bp_table, {title: "Predicted Probabilities by Actual Outcome"})

# Good model: minimal overlap between the two boxes

Effect of Individual Predictors

# Show how each predictor shifts the probability curve
# Fix other predictors at their means
let tmb_range = [0, 5, 10, 15, 20, 25, 30]
let pdl1_mean = 30
let msi_0 = 0

print("=== TMB Effect on Response Probability ===")
print("(PD-L1 = {pdl1_mean}, MSI = negative)")

for t in tmb_range {
    let lp = model.intercept + model.coefficients[0] * t
        + model.coefficients[1] * pdl1_mean + model.coefficients[2] * msi_0
    let p = 1.0 / (1.0 + exp(-lp))
    print("  TMB = {t}: P(response) = {(p * 100) |> round(1)}%")
}

Python:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, classification_report
import statsmodels.api as sm

# Statsmodels (full inference)
X = sm.add_constant(df[['TMB', 'PDL1', 'MSI']])
model = sm.Logit(response, X).fit()
print(model.summary())

# Odds ratios
print(np.exp(model.params))
print(np.exp(model.conf_int()))

# ROC curve
fpr, tpr, thresholds = roc_curve(response, model.predict(X))
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')

# Scikit-learn (prediction-focused)
clf = LogisticRegression().fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)[:, 1]

R:

# Logistic regression
model <- glm(response ~ TMB + PDL1 + MSI, family = binomial)
summary(model)

# Odds ratios with CI
exp(cbind(OR = coef(model), confint(model)))

# ROC curve
library(pROC)
roc_obj <- roc(response, fitted(model))
plot(roc_obj, print.auc = TRUE)

# Optimal threshold (Youden's J)
coords(roc_obj, "best", best.method = "youden")

Exercises

Exercise 1: Build and Interpret a Logistic Model

Predict cancer diagnosis (1 = cancer, 0 = benign) from three biomarkers. Interpret each odds ratio in clinical terms.

set_seed(42)
let n = 200

let biomarker_a = rnorm(n, 50, 15)
let biomarker_b = rnorm(n, 10, 4)
let age = rnorm(n, 55, 12)

let log_odds = -5 + 0.04 * biomarker_a + 0.2 * biomarker_b + 0.03 * age
let prob = log_odds |> map(|x| 1.0 / (1.0 + exp(-x)))
let cancer = prob |> map(|p| if rnorm(1, 0, 1)[0] < p { 1 } else { 0 })

# 1. Fit glm("cancer ~ biomarker_a + biomarker_b + age", data, "binomial")
# 2. Compute and interpret odds ratios: exp(coefficient) for each predictor
# 3. Which biomarker has the strongest effect?
# 4. What does the OR for age mean clinically?

Exercise 2: ROC Analysis

Build a logistic model and evaluate it with an ROC curve. Find the threshold that maximizes Youden’s J index (sensitivity + specificity - 1).

# Use the model from Exercise 1
# 1. Generate predicted probabilities from model coefficients
# 2. Plot ROC curve with roc_curve(table)
# 3. Compute sensitivity and specificity at thresholds 0.3, 0.5, 0.7
# 4. Which threshold maximizes Youden's J?
# 5. If this is a screening test, would you prefer a different threshold? Why?

Exercise 3: Comparing Two Models

Build two logistic models: one with TMB alone, another with TMB + PD-L1 + MSI. Compare their AUC values. Does adding predictors improve discrimination?

set_seed(42)
let n = 150

# Simulate data where TMB is a moderate predictor and MSI adds substantial value
# 1. Fit model_simple = glm("response ~ TMB", data, "binomial")
# 2. Fit model_full = glm("response ~ TMB + PDL1 + MSI", data, "binomial")
# 3. Compare AUC values
# 4. Plot both ROC curves
# 5. Is the improvement worth the added complexity?

Exercise 4: The Separation Problem

What happens when a predictor perfectly separates outcomes? Simulate MSI status that perfectly predicts response and observe what logistic regression does.

set_seed(42)
let n = 100
let msi = rnorm(n, 0, 1) |> map(|x| if x > 0.84 { 1 } else { 0 })
let response = msi  # perfect separation!

# 1. Try fitting glm("response ~ msi", data, "binomial")
# 2. What happens to the coefficient and its standard error?
# 3. Why is this a problem? (Hint: the MLE doesn't exist)
# 4. How would you handle this in practice?

Key Takeaways

  • Logistic regression models binary outcomes by predicting probabilities via the sigmoid function — use it instead of linear regression when Y is 0/1
  • Coefficients are on the log-odds scale; exponentiate to get odds ratios (OR): e^β
  • An OR of 2.0 means the odds double per unit increase — but this is NOT the same as doubling the probability
  • ROC curves show the sensitivity/specificity trade-off across all thresholds; AUC summarizes overall discrimination
  • The optimal threshold depends on clinical context: screening favors sensitivity, confirmation favors specificity
  • Logistic regression is a special case of the GLM framework — the same approach extends to count data (Poisson), survival data, and more
  • Always report odds ratios with confidence intervals, not just p-values
  • Watch for separation (a predictor perfectly predicts the outcome), which causes coefficient inflation

What’s Next

Sometimes the outcome isn’t just binary — it’s time-to-event: how long until a patient relapses, how long a cell line survives treatment. Day 17 introduces survival analysis, where censoring makes standard methods fail and Kaplan-Meier curves become essential.