Day 28: Capstone — Clinical Variant Report


Difficulty	Advanced
Biology knowledge	Advanced (genomic variants, clinical genetics, gene-disease associations)
Coding knowledge	Advanced (all prior topics: pipes, tables, APIs, error handling, modules)
Time	~4–5 hours
Prerequisites	Days 1–27 completed, BioLang installed (see Appendix A)
Data needed	Generated locally via `init.bl` (simulated VCF, gene panels, reference files)

CLINICAL DISCLAIMER: This chapter is strictly educational. The pipeline demonstrated here is a simplified illustration of clinical genomics concepts. It must never be used for actual patient care, diagnosis, or clinical decision-making. Real clinical variant interpretation requires validated, accredited pipelines (CAP/CLIA), board-certified review, and adherence to professional guidelines (ACMG/AMP). The classification logic shown here is intentionally simplified and does not reflect the full complexity of clinical variant interpretation.

What You’ll Learn

How to load and parse VCF files for variant analysis
How to apply multi-stage quality and frequency filters
How to annotate variants with gene information via APIs
How to implement ACMG-inspired variant classification logic
How to generate structured clinical-style reports
How to build clinical-grade error handling into every pipeline stage
How to integrate skills from all 27 prior days into a single capstone project

The Problem

“A patient’s whole-exome sequencing results are in — can we build an automated clinical report?”

A 42-year-old patient with a family history of hereditary breast and ovarian cancer has undergone whole-exome sequencing. The sequencing facility has delivered a VCF file containing thousands of variants. Your task: filter the noise, identify clinically relevant variants, classify them according to established guidelines, cross-reference with gene-disease databases, and produce a structured report suitable for clinical review.

This is not a single-tool problem. It requires everything you have learned: file I/O (Day 6–7), variant fundamentals (Day 12), table operations (Day 10), API access (Day 9, 24), statistics (Day 14), error handling (Day 25), modules (Day 27), and pipeline thinking (Day 22). This capstone ties it all together.

Clinical Variant Report Pipeline

Section 1: Clinical Context

Before we write code, let us understand the clinical workflow this pipeline supports.

Variant Classification: The ACMG/AMP Framework

The American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) published guidelines in 2015 for classifying sequence variants into five tiers:

ACMG/AMP Variant Classification

Our pipeline implements a simplified version of this logic. Real clinical labs use dozens of evidence criteria (PS1–PS4, PM1–PM6, PP1–PP5, BA1, BS1–BS4, BP1–BP7). We focus on a subset that can be evaluated computationally: population frequency, predicted impact, ClinVar concordance, and gene-disease association.

What a Clinical Report Contains

A clinical-grade variant report typically includes:

Patient metadata — demographics, ordering physician, indication for testing
Methodology — sequencing platform, coverage metrics, analysis pipeline version
Reportable findings — pathogenic and likely pathogenic variants with gene, transcript, protein change, zygosity, and classification rationale
Variants of uncertain significance — listed separately for transparency
Quality metrics — coverage, call rate, Ti/Tv ratio
Limitations — regions of low coverage, known blind spots
Disclaimer — signed interpretation by board-certified geneticist

Section 2: Setting Up the Data

Run the initialization script to generate synthetic clinical data:

cd days/day-28
bl init.bl

This creates:

File	Description
`data/patient.vcf`	Simulated patient VCF with 20 variants
`data/gene_db.tsv`	Gene annotation database (gene name, function, OMIM)
`data/clinvar_db.tsv`	ClinVar-like classification reference
`data/cancer_panel.tsv`	Hereditary cancer gene panel (25 genes)
`data/acmg_genes.tsv`	ACMG secondary findings gene list
`data/patient_info.tsv`	Patient metadata

All data is synthetic — no real patient information is used.

Section 3: Loading and Validating VCF Data

The first step is loading the VCF file and validating its structure. We use read_vcf() from Day 12:

let variants = read_vcf("data/patient.vcf")

The read_vcf() function returns a table with columns: chrom, pos, id, ref, alt, qual, filter, info. Let us inspect its shape:

let variant_count = variants |> len()
let columns = variants |> keys()

Validation Checks

Clinical pipelines fail loudly. We validate before proceeding:

fn validate_vcf(variants) {
    let n = variants |> len()
    if n == 0 {
        error("VCF file contains no variants")
    }

    let required_cols = ["chrom", "pos", "ref", "alt", "qual"]
    required_cols |> each(|col| {
        let col_names = variants |> keys()
        if contains(col_names, col) == false {
            error(f"Missing required VCF column: {col}")
        }
    })

    return { variant_count: n, status: "valid" }
}

This pattern — validate inputs, fail with clear messages — should feel familiar from Day 25.

Section 4: Quality Filtering

Raw variant calls include many low-confidence calls. We filter on two standard quality metrics:

QUAL — Phred-scaled quality score. QUAL >= 30 means the variant call has a 1-in-1000 chance of being wrong.
DP — Read depth. DP >= 10 ensures sufficient evidence supports the call.

fn quality_filter(variants, min_qual, min_dp) {
    variants |> where(|row| {
        let q = float(row.qual)
        let d = try {
            let info_str = row.info
            let parts = split(info_str, ";")
            let dp_part = parts |> filter(|p| starts_with(p, "DP="))
            if len(dp_part) > 0 {
                int(replace(dp_part[0], "DP=", ""))
            } else {
                0
            }
        } catch err {
            0
        }
        q >= min_qual and d >= min_dp
    })
}

We extract DP from the INFO field, which is semicolon-delimited. The try/catch ensures malformed INFO fields do not crash the pipeline — they simply get depth zero and are filtered out.

let qc_passed = quality_filter(variants, 30.0, 10)

Section 5: Variant Annotation

Next we annotate variants with gene information and ClinVar classifications. We load our reference databases as tables and join:

let gene_db = read_tsv("data/gene_db.tsv")
let clinvar_db = read_tsv("data/clinvar_db.tsv")

Building an Annotation Key

To join variants with our databases, we need a common key. We construct a variant key from chromosome, position, reference allele, and alternate allele:

fn make_variant_key(row) {
    return f"{row.chrom}:{row.pos}:{row.ref}:{row.alt}"
}

let annotated = qc_passed |> mutate("variant_key", |row| make_variant_key(row))

Joining with Gene and ClinVar Data

let with_genes = join_tables(annotated, gene_db, "gene")
let with_clinvar = join_tables(with_genes, clinvar_db, "variant_key")

The join_tables() function (Day 10) performs an inner join on the shared column. Variants without a match in the reference database are retained with empty annotation fields.

API-Based Annotation (Optional)

For real-world analysis, you would query live databases. Here is how you might enrich a variant with Ensembl VEP:

# Annotate a single variant with Ensembl VEP
# WARNING: rate-limited, use sparingly
fn annotate_with_vep(chrom, pos, ref_allele, alt_allele) {
    let hgvs = f"{chrom}:g.{pos}{ref_allele}>{alt_allele}"
    try {
        let vep = ensembl_vep(hgvs)
        return vep
    } catch err {
        return { error: str(err), consequence: "unknown" }
    }
}

And for gene-disease associations via NCBI:

# Look up gene-disease associations
fn lookup_gene_disease(gene_name) {
    try {
        let gene_info = ncbi_gene(gene_name, "human")
        return {
            gene: gene_name,
            description: gene_info.description,
            source: "NCBI"
        }
    } catch err {
        return { gene: gene_name, description: "lookup failed", source: "none" }
    }
}

In our capstone pipeline, we use the pre-built local databases from init.bl to keep the pipeline deterministic and offline-capable. The API calls above show how you would extend it for production use.

Section 6: Frequency Filtering

Common variants are unlikely to cause rare disease. We filter out variants with an allele frequency (AF) above 1% in population databases:

fn frequency_filter(variants, max_af) {
    variants |> where(|row| {
        let af = try {
            let info_str = row.info
            let parts = split(info_str, ";")
            let af_part = parts |> filter(|p| starts_with(p, "AF="))
            if len(af_part) > 0 {
                float(replace(af_part[0], "AF=", ""))
            } else {
                0.0
            }
        } catch err {
            0.0
        }
        af <= max_af
    })
}

let rare_variants = frequency_filter(with_clinvar, 0.01)

The logic mirrors real clinical pipelines: absent AF is treated as zero (novel variant, possibly significant), and we keep only variants with AF <= 0.01 (1%).

Section 7: Gene Panel Filtering

Clinical exome analysis does not report all variants — it focuses on genes relevant to the clinical indication. For hereditary cancer, we apply the cancer gene panel:

let cancer_panel = read_tsv("data/cancer_panel.tsv")
let acmg_genes = read_tsv("data/acmg_genes.tsv")

Panel Matching

fn panel_filter(variants, panel) {
    let panel_genes = panel |> select("gene") |> map(|row| row.gene)
    variants |> where(|row| {
        let gene = try { row.gene } catch err { "" }
        panel_genes |> filter(|g| g == gene) |> len() > 0
    })
}

let panel_variants = panel_filter(rare_variants, cancer_panel)
let acmg_variants = panel_filter(rare_variants, acmg_genes)

We apply both the disease-specific panel and the ACMG secondary findings list. The ACMG recommends reporting pathogenic/likely pathogenic variants in 81 genes regardless of the clinical indication — an important safety net.

Section 8: ACMG-Inspired Classification

Now we classify each variant. Our simplified scoring system uses four evidence dimensions:

Classification Decision Tree

Here is the BioLang implementation:

fn clinvar_score(clinvar_class) {
    if clinvar_class == "pathogenic" { return 3 }
    if clinvar_class == "likely_pathogenic" { return 2 }
    if clinvar_class == "uncertain" { return 0 }
    if clinvar_class == "likely_benign" { return -1 }
    if clinvar_class == "benign" { return -2 }
    return 0
}

fn frequency_score(af) {
    if af == 0.0 { return 1 }
    if af < 0.001 { return 0 }
    return -1
}

fn impact_score(impact) {
    if impact == "frameshift" { return 2 }
    if impact == "nonsense" { return 2 }
    if impact == "splice" { return 2 }
    if impact == "missense" { return 1 }
    if impact == "synonymous" { return -1 }
    return 0
}

fn gene_disease_score(strength) {
    if strength == "definitive" { return 1 }
    return 0
}

fn classify_variant(score) {
    if score >= 4 { return "Pathogenic" }
    if score == 3 { return "Likely Pathogenic" }
    if score >= 1 { return "VUS" }
    if score == 0 { return "Likely Benign" }
    return "Benign"
}

fn score_variant(row) {
    let cv = try { row.clinvar_class } catch err { "unknown" }
    let af = try {
        let parts = split(row.info, ";")
        let af_part = parts |> filter(|p| starts_with(p, "AF="))
        if len(af_part) > 0 { float(replace(af_part[0], "AF=", "")) } else { 0.0 }
    } catch err {
        0.0
    }
    let imp = try { row.impact } catch err { "unknown" }
    let gd = try { row.gene_disease } catch err { "unknown" }

    let s1 = clinvar_score(cv)
    let s2 = frequency_score(af)
    let s3 = impact_score(imp)
    let s4 = gene_disease_score(gd)
    let total = s1 + s2 + s3 + s4

    return {
        variant_key: try { row.variant_key } catch err { "" },
        gene: try { row.gene } catch err { "" },
        chrom: row.chrom,
        pos: row.pos,
        ref_allele: row.ref,
        alt_allele: row.alt,
        impact: imp,
        clinvar: cv,
        af: af,
        score: total,
        classification: classify_variant(total)
    }
}

Apply classification to all panel variants:

let classified = panel_variants |> map(|row| score_variant(row))

Grouping by Classification

let pathogenic = classified |> filter(|v| v.classification == "Pathogenic")
let likely_path = classified |> filter(|v| v.classification == "Likely Pathogenic")
let vus = classified |> filter(|v| v.classification == "VUS")
let likely_benign = classified |> filter(|v| v.classification == "Likely Benign")
let benign = classified |> filter(|v| v.classification == "Benign")

Section 9: Report Generation

The final step is generating a structured clinical report. We build it as a list of lines and write to both text and TSV formats:

fn format_variant_line(v) {
    return f"  {v.gene} | {v.chrom}:{v.pos} | {v.ref_allele}>{v.alt_allele} | {v.impact} | {v.clinvar} | Score:{v.score}"
}

fn build_report(patient_info, classified, qc_stats) {
    let report = [
        "================================================================",
        "       CLINICAL VARIANT ANALYSIS REPORT (EDUCATIONAL ONLY)",
        "================================================================",
        "",
        "DISCLAIMER: This report is generated by an educational pipeline.",
        "It must NOT be used for clinical decision-making.",
        "",
        "--- PATIENT INFORMATION ---",
        f"Patient ID: {patient_info.patient_id}",
        f"Sample ID: {patient_info.sample_id}",
        f"Indication: {patient_info.indication}",
        f"Report Date: {patient_info.report_date}",
        ""
    ]

    let path_variants = classified |> filter(|v| v.classification == "Pathogenic")
    let lp_variants = classified |> filter(|v| v.classification == "Likely Pathogenic")
    let vus_variants = classified |> filter(|v| v.classification == "VUS")
    let lb_variants = classified |> filter(|v| v.classification == "Likely Benign")
    let b_variants = classified |> filter(|v| v.classification == "Benign")

    report = report + [
        "--- SUMMARY ---",
        f"Total variants analyzed: {qc_stats.total_input}",
        f"Passed quality filter: {qc_stats.passed_qc}",
        f"Rare variants (AF <= 1%): {qc_stats.rare_count}",
        f"In gene panel: {qc_stats.panel_count}",
        f"Classified: {len(classified)}",
        "",
        f"  Pathogenic:        {len(path_variants)}",
        f"  Likely Pathogenic: {len(lp_variants)}",
        f"  VUS:               {len(vus_variants)}",
        f"  Likely Benign:     {len(lb_variants)}",
        f"  Benign:            {len(b_variants)}",
        ""
    ]

    if len(path_variants) > 0 {
        report = report + ["--- PATHOGENIC VARIANTS (Reportable) ---"]
        path_variants |> each(|v| {
            report = report + [format_variant_line(v)]
        })
        report = report + [""]
    }

    if len(lp_variants) > 0 {
        report = report + ["--- LIKELY PATHOGENIC VARIANTS (Reportable) ---"]
        lp_variants |> each(|v| {
            report = report + [format_variant_line(v)]
        })
        report = report + [""]
    }

    if len(vus_variants) > 0 {
        report = report + ["--- VARIANTS OF UNCERTAIN SIGNIFICANCE ---"]
        vus_variants |> each(|v| {
            report = report + [format_variant_line(v)]
        })
        report = report + [""]
    }

    report = report + [
        "--- QUALITY METRICS ---",
        f"Mean QUAL score: {qc_stats.mean_qual}",
        f"Mean depth: {qc_stats.mean_dp}",
        f"Variants filtered (low quality): {qc_stats.total_input - qc_stats.passed_qc}",
        "",
        "--- LIMITATIONS ---",
        "- This analysis covers exonic regions only",
        "- Structural variants and CNVs are not assessed",
        "- Intronic and regulatory variants may be missed",
        "- Classification is based on a simplified scoring model",
        "",
        "--- END OF REPORT ---"
    ]

    return report
}

Section 10: Quality Assurance

Clinical pipelines must track their own quality. We compute QA metrics at each stage:

fn compute_qc_stats(all_variants, qc_passed, rare, panel_matched) {
    let quals = all_variants |> select("qual") |> map(|row| float(row.qual))
    let depths = all_variants |> map(|row| {
        try {
            let parts = split(row.info, ";")
            let dp_part = parts |> filter(|p| starts_with(p, "DP="))
            if len(dp_part) > 0 { int(replace(dp_part[0], "DP=", "")) } else { 0 }
        } catch err {
            0
        }
    })

    return {
        total_input: len(all_variants),
        passed_qc: len(qc_passed),
        rare_count: len(rare),
        panel_count: len(panel_matched),
        mean_qual: mean(quals),
        mean_dp: mean(depths)
    }
}

Section 11: The Complete Pipeline

Here is the entire pipeline, end to end. Each stage flows into the next via pipes and function calls:

# --- Load data ---
let variants = read_vcf("data/patient.vcf")
let gene_db = read_tsv("data/gene_db.tsv")
let clinvar_db = read_tsv("data/clinvar_db.tsv")
let cancer_panel = read_tsv("data/cancer_panel.tsv")
let patient_meta = read_tsv("data/patient_info.tsv")
let patient_info = patient_meta[0]

# --- Validate ---
let validation = validate_vcf(variants)

# --- Quality filter ---
let qc_passed = quality_filter(variants, 30.0, 10)

# --- Annotate ---
let annotated = qc_passed |> mutate("variant_key", |row| make_variant_key(row))
let with_genes = join_tables(annotated, gene_db, "gene")
let with_clinvar = join_tables(with_genes, clinvar_db, "variant_key")

# --- Frequency filter ---
let rare_variants = frequency_filter(with_clinvar, 0.01)

# --- Panel filter ---
let panel_variants = panel_filter(rare_variants, cancer_panel)

# --- Classify ---
let classified = panel_variants |> map(|row| score_variant(row))

# --- QC stats ---
let qc_stats = compute_qc_stats(variants, qc_passed, rare_variants, panel_variants)

# --- Generate report ---
let report_lines = build_report(patient_info, classified, qc_stats)
write_lines(report_lines, "data/output/clinical_report.txt")

# --- Export classified variants ---
let classified_table = classified |> to_table()
write_tsv(classified_table, "data/output/classified_variants.tsv")

Run it:

cd days/day-28
bl init.bl
bl scripts/analysis.bl

Expected output files:

File	Contents
`data/output/clinical_report.txt`	Full clinical report with all sections
`data/output/classified_variants.tsv`	Classified variants in tabular format

Section 12: Extending with Live API Data

In a production pipeline, you would replace the local databases with live API queries. Here is a sketch of how the annotation stage would change:

# Production annotation: query NCBI and Ensembl for each gene
fn annotate_live(variants) {
    variants |> map(|row| {
        let gene_info = try {
            ncbi_gene(row.gene, "human")
        } catch err {
            { description: "unknown", summary: "" }
        }

        let vep_result = try {
            let hgvs = f"{row.chrom}:g.{row.pos}{row.ref}>{row.alt}"
            ensembl_vep(hgvs)
        } catch err {
            { consequence: "unknown" }
        }

        {
            gene: row.gene,
            chrom: row.chrom,
            pos: row.pos,
            ref_allele: row.ref,
            alt_allele: row.alt,
            gene_description: gene_info.description,
            vep_consequence: vep_result.consequence
        }
    })
}

The try/catch around every API call is essential — network failures must not crash a clinical pipeline.

Exercises

Exercise 1: Add a Secondary Findings Module

The ACMG recommends reporting pathogenic variants in 81 genes regardless of the primary indication. Extend the pipeline to:

Load the acmg_genes.tsv panel
Filter rare_variants against the ACMG gene list (separately from the cancer panel)
Classify the ACMG variants using the same scoring function
Add a “Secondary Findings” section to the report

Exercise 2: Variant Prioritization

Add a prioritization function that sorts classified variants by clinical urgency:

Pathogenic variants sorted by score (highest first)
Within the same score, sort by gene-disease association strength
Output a ranked list with rank numbers

Hint: use sort_by() with a custom key function that combines classification tier and score.

Exercise 3: Coverage Gap Report

Add a quality section that identifies genomic regions with insufficient coverage:

Parse the DP values from each variant’s INFO field
Flag any variant with DP < 20 as “low coverage”
Group low-coverage variants by chromosome
Add a “Coverage Gaps” section to the report listing affected genes

Exercise 4: Multi-Sample Comparison

Extend the pipeline to accept two VCF files (e.g., tumor and normal) and:

Identify variants present only in the tumor (somatic)
Identify variants shared between tumor and normal (germline)
Flag somatic variants with high impact for follow-up
Generate a comparative report

Key Takeaways

Clinical pipelines are layered — each filter stage reduces the variant set, and the order matters (quality before frequency before panel).
Error handling is non-negotiable — every I/O operation, every API call, every field access should be wrapped in try/catch in clinical code. A crash is unacceptable when patient data is at stake.
Classification is evidence-based — even our simplified scoring system combines multiple independent lines of evidence. Real ACMG classification uses 28 criteria across 5 evidence categories.
Reports must be transparent — every report includes methodology, limitations, and disclaimers. The pipeline documents what it did and what it could not do.
Modularity pays off — by Day 28, you can build a multi-stage pipeline by composing functions. Each scoring function, each filter, each formatter is independently testable.
Local-first, API-enriched — the pipeline works entirely offline with local databases, but can be extended with live API queries for production use. This mirrors how clinical labs operate: validated local databases with optional external enrichment.

Remember: Real clinical variant interpretation is a collaborative process between computational pipelines and board-certified clinical geneticists. Software identifies candidates; humans make diagnoses.

Summary

In this capstone, you built a complete clinical variant analysis pipeline that:

Loaded and validated VCF data
Applied multi-stage quality, frequency, and panel filters
Annotated variants with gene and ClinVar information
Implemented ACMG-inspired classification logic
Generated a structured clinical report
Tracked quality metrics throughout the pipeline

This project integrated skills from nearly every prior day: file I/O (Days 6–7), VCF parsing (Day 12), tables and joins (Day 10), API access (Days 9, 24), statistics (Day 14), error handling (Day 25), modules (Day 27), and pipeline design (Day 22).

Tomorrow in Day 29, we tackle another capstone: a complete RNA-seq differential expression study.

Keyboard shortcuts

Practical Bioinformatics in 30 Days