Day 19: Biological Data Visualization


Difficulty	Intermediate
Biology knowledge	Intermediate (GWAS, expression, survival analysis, genomic structure)
Coding knowledge	Intermediate (tables, records, pipes, sets)
Time	~3 hours
Prerequisites	Days 1-18 completed, BioLang installed (see Appendix A)
Data needed	Generated by `init.bl` (GWAS CSV, expression matrix CSV)

What You’ll Learn

How to create Manhattan and QQ plots for GWAS results
How to visualize gene expression with violin, density, PCA, and clustered heatmap plots
How to build clinical plots: Kaplan-Meier survival curves, ROC curves, and forest plots
How to render genomic structure with ideograms, circos plots, and lollipop plots
How to create sequence logos and phylogenetic trees
How to produce specialized genomic plots: Venn diagrams, UpSet plots, oncoprints, sashimi plots, and HiC maps
How to export publication-quality SVG figures

Standard plots — scatter, histogram, bar — are not enough for genomics. You need Manhattan plots for GWAS, ideograms for chromosomal views, circos plots for structural variants, survival curves for clinical data. Each biological question has a standard visualization, and building them from raw drawing primitives wastes hours that should be spent on analysis.

BioLang has 21 specialized bio visualization functions built in. Each takes a table or list, produces either ASCII art (for the terminal) or SVG (for publication), and follows a consistent pattern: data first, options second. Every function supports format: "svg" for publication-quality output.

GWAS Visualization

Genome-wide association studies produce millions of p-values, one per variant tested. The standard way to view these results is a Manhattan plot: chromosomes along the x-axis, negative log10 p-values on the y-axis. Significant associations appear as towers rising above a genome-wide significance threshold.

Manhattan Plot

# requires: data/gwas.csv in working directory (generated by init.bl)
let gwas = csv("data/gwas.csv")  # columns: chrom, pos, pvalue
manhattan(gwas, title: "Genome-Wide Association Study")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

The manhattan() function expects a table with chrom, pos, and pvalue columns. It automatically arranges chromosomes along the x-axis, alternates colors, and draws a significance threshold line at p = 5e-8.

To produce SVG for a publication figure:

let svg = manhattan(gwas, format: "svg", title: "GWAS Results")
save_svg(svg, "figures/manhattan.svg")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

QQ Plot

A QQ plot compares observed p-values against the expected uniform distribution. Points should fall along the diagonal if there is no systematic inflation. Deviation from the diagonal at the tail indicates true associations; deviation across the whole range suggests population stratification or other confounding.

# Check for inflation in p-values
let pvalues = col(gwas, "pvalue") |> collect()
qq_plot(pvalues, title: "QQ Plot — Observed vs Expected")

The qq_plot() function takes a list of p-values (not a table), sorts them, computes expected quantiles, and plots observed vs expected on a -log10 scale.

Expression Visualization

Gene expression experiments produce continuous measurements across conditions. Violin plots show the full distribution shape, density plots smooth out individual observations, PCA reveals sample clustering, and clustered heatmaps show both gene and sample groupings.

Violin Plot

A violin plot combines a box plot with a kernel density estimate, showing the full shape of the data distribution in each group.

let groups = {
    control: [5.2, 4.8, 5.1, 4.9, 5.3, 5.0, 4.7, 5.4],
    low_dose: [6.5, 7.1, 6.8, 6.3, 7.0, 6.6, 6.9, 7.2],
    high_dose: [9.2, 8.8, 9.5, 9.0, 8.6, 9.3, 8.9, 9.1]
}
violin(groups, title: "Expression by Treatment Group")

The violin() function takes a record where each key is a group name and each value is a list of numbers. It renders mirrored kernel density estimates for each group.

Density Plot

A density plot is a smoothed histogram, useful for seeing the overall shape of a distribution without binning artifacts.

let values = [2.1, 3.5, 4.2, 5.8, 6.1, 7.3, 3.8, 5.5, 4.9, 6.7, 3.2, 5.1, 4.5, 6.0, 7.8]
density(values, title: "Expression Density")

The density() function takes a list of numbers and uses kernel density estimation (Silverman bandwidth) to produce a smooth curve.

PCA Plot

Principal component analysis reduces high-dimensional expression data to two dimensions, revealing whether samples cluster by condition, batch, or other factors.

# requires: data/expression_matrix.csv in working directory
let expr = csv("data/expression_matrix.csv")
pca_plot(expr, title: "PCA — Sample Clustering")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

The pca_plot() function takes a numeric table (samples as rows, features as columns) and projects the data onto the first two principal components.

Clustered Heatmap

A clustered heatmap shows expression levels as colors in a grid, with hierarchical clustering applied to both rows and columns. Genes with similar expression patterns cluster together.

let matrix = csv("data/expression_matrix.csv")
clustered_heatmap(matrix, title: "Hierarchical Clustering")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

Clinical Visualization

Clinical bioinformatics requires plots that were developed in biostatistics: survival curves for time-to-event data, ROC curves for classifier evaluation, and forest plots for meta-analysis.

Kaplan-Meier Survival Curve

The Kaplan-Meier estimator plots the probability of survival over time. Each step down represents an event (death, relapse, progression). Censored observations (patients lost to follow-up) are marked but do not cause a step.

let survival_data = [
    {time: 12, event: 1}, {time: 24, event: 1}, {time: 36, event: 0},
    {time: 8, event: 1}, {time: 48, event: 0}, {time: 15, event: 1},
    {time: 30, event: 0}, {time: 20, event: 1}, {time: 42, event: 0},
    {time: 6, event: 1},
] |> to_table()
kaplan_meier(survival_data, title: "Overall Survival")

The kaplan_meier() function expects a table with time and event columns. event: 1 means the event occurred; event: 0 means the observation was censored.

ROC Curve

A receiver operating characteristic (ROC) curve evaluates binary classifiers by plotting the true positive rate against the false positive rate at every threshold. The area under the curve (AUC) summarizes overall performance — 0.5 is random guessing, 1.0 is perfect classification.

let predictions = [
    {score: 0.9, label: 1}, {score: 0.8, label: 1}, {score: 0.7, label: 0},
    {score: 0.6, label: 1}, {score: 0.5, label: 0}, {score: 0.4, label: 0},
    {score: 0.3, label: 0}, {score: 0.2, label: 1}, {score: 0.1, label: 0},
] |> to_table()
roc_curve(predictions, title: "Classifier Performance")

The roc_curve() function takes a table with score (predicted probability) and label (0 or 1) columns. It computes and displays the AUC.

Forest Plot

A forest plot displays effect sizes and confidence intervals from multiple studies, used in meta-analysis to visualize whether results are consistent across studies.

let studies = [
    {study: "Smith 2020", effect: 1.5, ci_lower: 1.1, ci_upper: 2.0},
    {study: "Jones 2021", effect: 1.8, ci_lower: 1.3, ci_upper: 2.5},
    {study: "Chen 2022", effect: 1.2, ci_lower: 0.8, ci_upper: 1.8},
    {study: "Patel 2023", effect: 1.6, ci_lower: 1.2, ci_upper: 2.1},
] |> to_table()
forest_plot(studies, title: "Meta-Analysis: Gene X Association")

The forest_plot() function expects columns study, effect, ci_lower, and ci_upper. Each study is shown as a point with horizontal whiskers for the confidence interval. A vertical line at effect = 1.0 marks the null.

Genomic Structure Visualization

Genomics often requires viewing data in the context of chromosome structure. Ideograms show banding patterns, circos plots present genome-wide data in a circular layout, and lollipop plots mark mutation positions along a protein or gene.

Ideogram

An ideogram draws a schematic chromosome with cytogenetic banding. Bands are colored by Giemsa staining intensity, giving a bird’s-eye view of chromosome structure.

let bands = [
    {chrom: "chr17", start: 0, end: 25000000, band: "p13.3", stain: "gneg"},
    {chrom: "chr17", start: 25000000, end: 43000000, band: "p11.2", stain: "gpos50"},
    {chrom: "chr17", start: 43000000, end: 83257441, band: "q25.3", stain: "gneg"},
] |> to_table()
ideogram(bands, title: "Chromosome 17")

The ideogram() function expects columns chrom, start, end, band, and stain. Stain values follow cytogenetic conventions: gneg (light), gpos25/gpos50/gpos75/gpos100 (increasingly dark), acen (centromere), gvar (variable).

Circos Plot

A circos plot arranges chromosomes in a circle and draws data tracks on the inside or outside. It is particularly useful for showing structural variants, translocations, or genome-wide trends.

let data = [
    {chrom: "chr1", start: 1000000, end: 2000000, value: 3.5},
    {chrom: "chr2", start: 500000, end: 1500000, value: 2.8},
    {chrom: "chr3", start: 2000000, end: 3000000, value: 4.1},
] |> to_table()
circos(data, title: "Genome-Wide View")

The circos() function takes a table with chrom, start, end, and value columns. In ASCII mode, it renders a simplified circular representation. In SVG mode, it produces a full circular plot.

Lollipop Plot

A lollipop plot shows mutation positions along a gene or protein sequence as vertical stems topped with circles. The height or size of each circle represents mutation frequency.

let mutations = [
    {position: 248, count: 45, label: "R248W"},
    {position: 273, count: 38, label: "R273H"},
    {position: 175, count: 30, label: "R175H"},
    {position: 245, count: 25, label: "G245S"},
    {position: 282, count: 18, label: "R282W"},
] |> to_table()
lollipop(mutations, title: "TP53 Hotspot Mutations")

The lollipop() function expects position and count columns. An optional label column adds text annotations at each position.

Sequence Visualization

Sequence Logo

A sequence logo shows the information content at each position in a set of aligned sequences. Tall letters indicate highly conserved positions; short letters indicate variable positions. This is the standard way to visualize transcription factor binding motifs, splice sites, and other sequence features.

let sequences = [
    "TATAAAGC", "TATAATGC", "TATAAAGC", "TATAATGC",
    "TATAAAGC", "TATAATGC", "TATAAAGC", "TATAATGC",
]
sequence_logo(sequences, title: "TATA Box Motif")

The sequence_logo() function takes a list of equal-length strings and computes the information content (bits) at each position.

Phylogenetic Tree

A phylogenetic tree shows evolutionary relationships between species or sequences. BioLang can render trees from Newick format strings.

let newick = "((Human:0.1,Chimp:0.12):0.08,(Mouse:0.25,Rat:0.23):0.15,Zebrafish:0.45);"
phylo_tree(newick, title: "Species Phylogeny")

The phylo_tree() function parses a Newick-format string and renders a dendrogram.

Specialized Genomic Plots

Venn Diagram

A Venn diagram shows the overlap between two or three sets. In genomics, this is commonly used to compare gene lists from different experiments, conditions, or methods.

let sets = {
    "Experiment A": set(["BRCA1", "TP53", "EGFR", "MYC", "KRAS"]),
    "Experiment B": set(["TP53", "EGFR", "PTEN", "RB1", "MYC"]),
    "Experiment C": set(["BRCA1", "MYC", "APC", "PTEN", "TP53"]),
}
venn(sets, title: "Gene Overlap Across Experiments")

The venn() function takes a record of sets (up to 3). It computes all intersection sizes and renders the classic overlapping-circles diagram.

UpSet Plot

When you have more than three sets, Venn diagrams become unreadable. UpSet plots show set intersections as a matrix with connected dots, with bar charts showing intersection sizes. They scale to dozens of sets.

upset(sets, title: "Set Intersections")

The upset() function takes the same input as venn() but is designed for any number of sets.

Oncoprint

An oncoprint shows the mutation landscape of a cancer cohort. Each row is a gene, each column is a sample, and colored tiles indicate mutation types (missense, nonsense, amplification, deletion). This is the standard visualization for cancer genomics studies.

let mutations_matrix = [
    {gene: "TP53", sample1: "Missense", sample2: "Nonsense", sample3: "None", sample4: "Missense"},
    {gene: "KRAS", sample1: "None", sample2: "Missense", sample3: "Missense", sample4: "None"},
    {gene: "EGFR", sample1: "Amplification", sample2: "None", sample3: "None", sample4: "Deletion"},
] |> to_table()
oncoprint(mutations_matrix, title: "Mutation Landscape")

RNA-seq Specific Plots

Sashimi Plot

A sashimi plot shows RNA-seq splice junctions as arcs connecting exon positions, with read counts on each arc. It is used to identify alternative splicing events and quantify their usage.

let junctions = [
    {chrom: "chr17", start: 43100000, end: 43105000, count: 25},
    {chrom: "chr17", start: 43105000, end: 43110000, count: 18},
    {chrom: "chr17", start: 43100000, end: 43110000, count: 5},
] |> to_table()
sashimi(junctions, title: "Splice Junctions — BRCA1")

HiC Contact Map

A HiC contact map shows chromatin interaction frequencies as a heatmap. High-frequency contacts appear as bright spots along the diagonal, and topologically associated domains (TADs) appear as triangles.

let contacts = [
    [100, 50, 20, 5],
    [50, 100, 40, 10],
    [20, 40, 100, 30],
    [5, 10, 30, 100],
]
hic_map(contacts, title: "Chromatin Contacts")

The hic_map() function takes a nested list (symmetric matrix) of contact frequencies.

Additional Genomic Plots

CNV Plot

A copy number variation plot shows log2 ratios across genomic positions. Segments above zero indicate gains (amplifications); segments below zero indicate losses (deletions).

let cnv_data = [
    {chrom: "chr1", start: 1000000, end: 5000000, log2ratio: 0.5},
    {chrom: "chr1", start: 5000000, end: 10000000, log2ratio: -0.8},
    {chrom: "chr2", start: 2000000, end: 8000000, log2ratio: 1.2},
    {chrom: "chr3", start: 1000000, end: 6000000, log2ratio: -0.3},
] |> to_table()
cnv_plot(cnv_data, title: "Copy Number Alterations")

Rainfall Plot

A rainfall plot shows inter-mutation distances on a log scale, revealing clusters of mutations (kataegis) as downward-pointing streaks.

let mutation_positions = [
    {chrom: "chr1", pos: 100000},
    {chrom: "chr1", pos: 100050},
    {chrom: "chr1", pos: 100120},
    {chrom: "chr1", pos: 500000},
    {chrom: "chr2", pos: 200000},
    {chrom: "chr2", pos: 800000},
] |> to_table()
rainfall(mutation_positions, title: "Mutation Clustering")

Saving and Exporting

All bio visualization functions support two output modes:

ASCII (default): Prints a text-based rendering to the terminal, useful for quick inspection in a REPL or pipeline
SVG (format: "svg"): Returns an SVG string for publication-quality figures

# ASCII output — prints directly to terminal
manhattan(gwas, title: "Quick Look")

# SVG output — returns a string
let svg = manhattan(gwas, format: "svg", title: "GWAS Results")
save_svg(svg, "figures/manhattan.svg")

# save_plot is an alias for save_svg
save_plot(violin(groups, format: "svg"), "figures/violin.svg")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

The SVG output is designed for journal submission: clean lines, proper labels, and a white background. You can open the SVG in Inkscape, Illustrator, or any browser for further editing.

Bio Plot Reference Table

Plot	Function	Data Input	Use Case
Manhattan	`manhattan()`	Table: chrom, pos, pvalue	GWAS significance
QQ	`qq_plot()`	List of p-values	P-value inflation check
Violin	`violin()`	Record of named lists	Distribution comparison
Density	`density()`	List of values	Smooth distribution
Kaplan-Meier	`kaplan_meier()`	Table: time, event	Survival analysis
ROC	`roc_curve()`	Table: score, label	Classifier evaluation
Forest	`forest_plot()`	Table: study, effect, ci_lower, ci_upper	Meta-analysis
Ideogram	`ideogram()`	Table: chrom, start, end, band, stain	Chromosome view
Circos	`circos()`	Table: chrom, start, end, value	Genome-wide circular
Lollipop	`lollipop()`	Table: position, count	Mutation hotspots
Sequence logo	`sequence_logo()`	List of equal-length strings	Motif conservation
Phylo tree	`phylo_tree()`	Newick string	Evolutionary relationships
Venn	`venn()`	Record of sets	Set overlap (2-3 sets)
UpSet	`upset()`	Record of sets	Set overlap (many sets)
Oncoprint	`oncoprint()`	Table: gene, sample columns	Mutation landscape
Sashimi	`sashimi()`	Table: chrom, start, end, count	Splice junctions
HiC	`hic_map()`	Nested list (matrix)	Chromatin contacts
CNV	`cnv_plot()`	Table: chrom, start, end, log2ratio	Copy number
Rainfall	`rainfall()`	Table: chrom, pos	Mutation clustering
PCA	`pca_plot()`	Table (samples x features)	Dimensionality reduction
Clustered heatmap	`clustered_heatmap()`	Table (matrix)	Hierarchical clustering

Exercises

Manhattan plot: Load data/gwas.csv, create a Manhattan plot, and identify which chromosome has the most significant hit (the lowest p-value).
Survival comparison: Create two Kaplan-Meier curves — one for a treatment group and one for a control group — and observe the difference in median survival time.
Sequence logo: Create a list of 10 aligned 8-mer sequences around a TATA box motif (positions should be mostly T-A-T-A-A-A with some variation at positions 5-8). Generate a sequence logo and identify which positions are most conserved.
Gene list overlap: Create three gene lists (at least 5 genes each) with partial overlap. Use venn() to visualize the overlaps, then use upset() on the same data and compare the two views.
Mutation hotspots: Build a lollipop plot showing at least 6 mutation positions in TP53. Include real hotspot names (R175H, G245S, R248W, R273H, R282W, Y220C).

Key Takeaways

BioLang has 21 specialized bio visualization functions, each designed for a specific biological question
GWAS: manhattan() for genome-wide significance, qq_plot() for inflation diagnostics
Expression: violin() for distributions, pca_plot() for sample clustering, clustered_heatmap() for pattern discovery
Clinical: kaplan_meier() for survival, roc_curve() for classifier evaluation, forest_plot() for meta-analysis
Genomic structure: ideogram() for chromosomes, circos() for genome-wide circular views, lollipop() for mutation positions
Sequence: sequence_logo() for motifs, phylo_tree() for evolution
All bio plots support ASCII (terminal) and SVG (publication) output
Use save_svg() or save_plot() to export publication-quality figures
Choose the plot that matches your data type and biological question

What’s Next

Tomorrow: Multi-Species Comparison — fetching orthologs, comparing sequences across species, and visualizing conservation patterns.

Practical Bioinformatics in 30 Days

Day 19: Biological Data Visualization

What You’ll Learn

The Problem

GWAS Visualization

Manhattan Plot

QQ Plot

Expression Visualization

Violin Plot

Density Plot

PCA Plot

Clustered Heatmap

Clinical Visualization

Kaplan-Meier Survival Curve

ROC Curve

Forest Plot

Genomic Structure Visualization

Ideogram

Circos Plot

Lollipop Plot

Sequence Visualization

Sequence Logo

Phylogenetic Tree

Specialized Genomic Plots

Venn Diagram

UpSet Plot

Oncoprint

RNA-seq Specific Plots

Sashimi Plot

HiC Contact Map

Additional Genomic Plots

CNV Plot

Rainfall Plot

Saving and Exporting

Bio Plot Reference Table

Exercises

Key Takeaways

What’s Next

Keyboard shortcuts

Practical Bioinformatics in 30 Days