Day 19: Biological Data Visualization
| Difficulty | Intermediate |
| Biology knowledge | Intermediate (GWAS, expression, survival analysis, genomic structure) |
| Coding knowledge | Intermediate (tables, records, pipes, sets) |
| Time | ~3 hours |
| Prerequisites | Days 1-18 completed, BioLang installed (see Appendix A) |
| Data needed | Generated by init.bl (GWAS CSV, expression matrix CSV) |
What You’ll Learn
- How to create Manhattan and QQ plots for GWAS results
- How to visualize gene expression with violin, density, PCA, and clustered heatmap plots
- How to build clinical plots: Kaplan-Meier survival curves, ROC curves, and forest plots
- How to render genomic structure with ideograms, circos plots, and lollipop plots
- How to create sequence logos and phylogenetic trees
- How to produce specialized genomic plots: Venn diagrams, UpSet plots, oncoprints, sashimi plots, and HiC maps
- How to export publication-quality SVG figures
The Problem
Standard plots — scatter, histogram, bar — are not enough for genomics. You need Manhattan plots for GWAS, ideograms for chromosomal views, circos plots for structural variants, survival curves for clinical data. Each biological question has a standard visualization, and building them from raw drawing primitives wastes hours that should be spent on analysis.
BioLang has 21 specialized bio visualization functions built in. Each takes a table or list, produces either ASCII art (for the terminal) or SVG (for publication), and follows a consistent pattern: data first, options second. Every function supports format: "svg" for publication-quality output.
GWAS Visualization
Genome-wide association studies produce millions of p-values, one per variant tested. The standard way to view these results is a Manhattan plot: chromosomes along the x-axis, negative log10 p-values on the y-axis. Significant associations appear as towers rising above a genome-wide significance threshold.
Manhattan Plot
# requires: data/gwas.csv in working directory (generated by init.bl)
let gwas = csv("data/gwas.csv") # columns: chrom, pos, pvalue
manhattan(gwas, title: "Genome-Wide Association Study")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
The manhattan() function expects a table with chrom, pos, and pvalue columns. It automatically arranges chromosomes along the x-axis, alternates colors, and draws a significance threshold line at p = 5e-8.
To produce SVG for a publication figure:
let svg = manhattan(gwas, format: "svg", title: "GWAS Results")
save_svg(svg, "figures/manhattan.svg")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
QQ Plot
A QQ plot compares observed p-values against the expected uniform distribution. Points should fall along the diagonal if there is no systematic inflation. Deviation from the diagonal at the tail indicates true associations; deviation across the whole range suggests population stratification or other confounding.
# Check for inflation in p-values
let pvalues = col(gwas, "pvalue") |> collect()
qq_plot(pvalues, title: "QQ Plot — Observed vs Expected")
The qq_plot() function takes a list of p-values (not a table), sorts them, computes expected quantiles, and plots observed vs expected on a -log10 scale.
Expression Visualization
Gene expression experiments produce continuous measurements across conditions. Violin plots show the full distribution shape, density plots smooth out individual observations, PCA reveals sample clustering, and clustered heatmaps show both gene and sample groupings.
Violin Plot
A violin plot combines a box plot with a kernel density estimate, showing the full shape of the data distribution in each group.
let groups = {
control: [5.2, 4.8, 5.1, 4.9, 5.3, 5.0, 4.7, 5.4],
low_dose: [6.5, 7.1, 6.8, 6.3, 7.0, 6.6, 6.9, 7.2],
high_dose: [9.2, 8.8, 9.5, 9.0, 8.6, 9.3, 8.9, 9.1]
}
violin(groups, title: "Expression by Treatment Group")
The violin() function takes a record where each key is a group name and each value is a list of numbers. It renders mirrored kernel density estimates for each group.
Density Plot
A density plot is a smoothed histogram, useful for seeing the overall shape of a distribution without binning artifacts.
let values = [2.1, 3.5, 4.2, 5.8, 6.1, 7.3, 3.8, 5.5, 4.9, 6.7, 3.2, 5.1, 4.5, 6.0, 7.8]
density(values, title: "Expression Density")
The density() function takes a list of numbers and uses kernel density estimation (Silverman bandwidth) to produce a smooth curve.
PCA Plot
Principal component analysis reduces high-dimensional expression data to two dimensions, revealing whether samples cluster by condition, batch, or other factors.
# requires: data/expression_matrix.csv in working directory
let expr = csv("data/expression_matrix.csv")
pca_plot(expr, title: "PCA — Sample Clustering")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
The pca_plot() function takes a numeric table (samples as rows, features as columns) and projects the data onto the first two principal components.
Clustered Heatmap
A clustered heatmap shows expression levels as colors in a grid, with hierarchical clustering applied to both rows and columns. Genes with similar expression patterns cluster together.
let matrix = csv("data/expression_matrix.csv")
clustered_heatmap(matrix, title: "Hierarchical Clustering")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
Clinical Visualization
Clinical bioinformatics requires plots that were developed in biostatistics: survival curves for time-to-event data, ROC curves for classifier evaluation, and forest plots for meta-analysis.
Kaplan-Meier Survival Curve
The Kaplan-Meier estimator plots the probability of survival over time. Each step down represents an event (death, relapse, progression). Censored observations (patients lost to follow-up) are marked but do not cause a step.
let survival_data = [
{time: 12, event: 1}, {time: 24, event: 1}, {time: 36, event: 0},
{time: 8, event: 1}, {time: 48, event: 0}, {time: 15, event: 1},
{time: 30, event: 0}, {time: 20, event: 1}, {time: 42, event: 0},
{time: 6, event: 1},
] |> to_table()
kaplan_meier(survival_data, title: "Overall Survival")
The kaplan_meier() function expects a table with time and event columns. event: 1 means the event occurred; event: 0 means the observation was censored.
ROC Curve
A receiver operating characteristic (ROC) curve evaluates binary classifiers by plotting the true positive rate against the false positive rate at every threshold. The area under the curve (AUC) summarizes overall performance — 0.5 is random guessing, 1.0 is perfect classification.
let predictions = [
{score: 0.9, label: 1}, {score: 0.8, label: 1}, {score: 0.7, label: 0},
{score: 0.6, label: 1}, {score: 0.5, label: 0}, {score: 0.4, label: 0},
{score: 0.3, label: 0}, {score: 0.2, label: 1}, {score: 0.1, label: 0},
] |> to_table()
roc_curve(predictions, title: "Classifier Performance")
The roc_curve() function takes a table with score (predicted probability) and label (0 or 1) columns. It computes and displays the AUC.
Forest Plot
A forest plot displays effect sizes and confidence intervals from multiple studies, used in meta-analysis to visualize whether results are consistent across studies.
let studies = [
{study: "Smith 2020", effect: 1.5, ci_lower: 1.1, ci_upper: 2.0},
{study: "Jones 2021", effect: 1.8, ci_lower: 1.3, ci_upper: 2.5},
{study: "Chen 2022", effect: 1.2, ci_lower: 0.8, ci_upper: 1.8},
{study: "Patel 2023", effect: 1.6, ci_lower: 1.2, ci_upper: 2.1},
] |> to_table()
forest_plot(studies, title: "Meta-Analysis: Gene X Association")
The forest_plot() function expects columns study, effect, ci_lower, and ci_upper. Each study is shown as a point with horizontal whiskers for the confidence interval. A vertical line at effect = 1.0 marks the null.
Genomic Structure Visualization
Genomics often requires viewing data in the context of chromosome structure. Ideograms show banding patterns, circos plots present genome-wide data in a circular layout, and lollipop plots mark mutation positions along a protein or gene.
Ideogram
An ideogram draws a schematic chromosome with cytogenetic banding. Bands are colored by Giemsa staining intensity, giving a bird’s-eye view of chromosome structure.
let bands = [
{chrom: "chr17", start: 0, end: 25000000, band: "p13.3", stain: "gneg"},
{chrom: "chr17", start: 25000000, end: 43000000, band: "p11.2", stain: "gpos50"},
{chrom: "chr17", start: 43000000, end: 83257441, band: "q25.3", stain: "gneg"},
] |> to_table()
ideogram(bands, title: "Chromosome 17")
The ideogram() function expects columns chrom, start, end, band, and stain. Stain values follow cytogenetic conventions: gneg (light), gpos25/gpos50/gpos75/gpos100 (increasingly dark), acen (centromere), gvar (variable).
Circos Plot
A circos plot arranges chromosomes in a circle and draws data tracks on the inside or outside. It is particularly useful for showing structural variants, translocations, or genome-wide trends.
let data = [
{chrom: "chr1", start: 1000000, end: 2000000, value: 3.5},
{chrom: "chr2", start: 500000, end: 1500000, value: 2.8},
{chrom: "chr3", start: 2000000, end: 3000000, value: 4.1},
] |> to_table()
circos(data, title: "Genome-Wide View")
The circos() function takes a table with chrom, start, end, and value columns. In ASCII mode, it renders a simplified circular representation. In SVG mode, it produces a full circular plot.
Lollipop Plot
A lollipop plot shows mutation positions along a gene or protein sequence as vertical stems topped with circles. The height or size of each circle represents mutation frequency.
let mutations = [
{position: 248, count: 45, label: "R248W"},
{position: 273, count: 38, label: "R273H"},
{position: 175, count: 30, label: "R175H"},
{position: 245, count: 25, label: "G245S"},
{position: 282, count: 18, label: "R282W"},
] |> to_table()
lollipop(mutations, title: "TP53 Hotspot Mutations")
The lollipop() function expects position and count columns. An optional label column adds text annotations at each position.
Sequence Visualization
Sequence Logo
A sequence logo shows the information content at each position in a set of aligned sequences. Tall letters indicate highly conserved positions; short letters indicate variable positions. This is the standard way to visualize transcription factor binding motifs, splice sites, and other sequence features.
let sequences = [
"TATAAAGC", "TATAATGC", "TATAAAGC", "TATAATGC",
"TATAAAGC", "TATAATGC", "TATAAAGC", "TATAATGC",
]
sequence_logo(sequences, title: "TATA Box Motif")
The sequence_logo() function takes a list of equal-length strings and computes the information content (bits) at each position.
Phylogenetic Tree
A phylogenetic tree shows evolutionary relationships between species or sequences. BioLang can render trees from Newick format strings.
let newick = "((Human:0.1,Chimp:0.12):0.08,(Mouse:0.25,Rat:0.23):0.15,Zebrafish:0.45);"
phylo_tree(newick, title: "Species Phylogeny")
The phylo_tree() function parses a Newick-format string and renders a dendrogram.
Specialized Genomic Plots
Venn Diagram
A Venn diagram shows the overlap between two or three sets. In genomics, this is commonly used to compare gene lists from different experiments, conditions, or methods.
let sets = {
"Experiment A": set(["BRCA1", "TP53", "EGFR", "MYC", "KRAS"]),
"Experiment B": set(["TP53", "EGFR", "PTEN", "RB1", "MYC"]),
"Experiment C": set(["BRCA1", "MYC", "APC", "PTEN", "TP53"]),
}
venn(sets, title: "Gene Overlap Across Experiments")
The venn() function takes a record of sets (up to 3). It computes all intersection sizes and renders the classic overlapping-circles diagram.
UpSet Plot
When you have more than three sets, Venn diagrams become unreadable. UpSet plots show set intersections as a matrix with connected dots, with bar charts showing intersection sizes. They scale to dozens of sets.
upset(sets, title: "Set Intersections")
The upset() function takes the same input as venn() but is designed for any number of sets.
Oncoprint
An oncoprint shows the mutation landscape of a cancer cohort. Each row is a gene, each column is a sample, and colored tiles indicate mutation types (missense, nonsense, amplification, deletion). This is the standard visualization for cancer genomics studies.
let mutations_matrix = [
{gene: "TP53", sample1: "Missense", sample2: "Nonsense", sample3: "None", sample4: "Missense"},
{gene: "KRAS", sample1: "None", sample2: "Missense", sample3: "Missense", sample4: "None"},
{gene: "EGFR", sample1: "Amplification", sample2: "None", sample3: "None", sample4: "Deletion"},
] |> to_table()
oncoprint(mutations_matrix, title: "Mutation Landscape")
RNA-seq Specific Plots
Sashimi Plot
A sashimi plot shows RNA-seq splice junctions as arcs connecting exon positions, with read counts on each arc. It is used to identify alternative splicing events and quantify their usage.
let junctions = [
{chrom: "chr17", start: 43100000, end: 43105000, count: 25},
{chrom: "chr17", start: 43105000, end: 43110000, count: 18},
{chrom: "chr17", start: 43100000, end: 43110000, count: 5},
] |> to_table()
sashimi(junctions, title: "Splice Junctions — BRCA1")
HiC Contact Map
A HiC contact map shows chromatin interaction frequencies as a heatmap. High-frequency contacts appear as bright spots along the diagonal, and topologically associated domains (TADs) appear as triangles.
let contacts = [
[100, 50, 20, 5],
[50, 100, 40, 10],
[20, 40, 100, 30],
[5, 10, 30, 100],
]
hic_map(contacts, title: "Chromatin Contacts")
The hic_map() function takes a nested list (symmetric matrix) of contact frequencies.
Additional Genomic Plots
CNV Plot
A copy number variation plot shows log2 ratios across genomic positions. Segments above zero indicate gains (amplifications); segments below zero indicate losses (deletions).
let cnv_data = [
{chrom: "chr1", start: 1000000, end: 5000000, log2ratio: 0.5},
{chrom: "chr1", start: 5000000, end: 10000000, log2ratio: -0.8},
{chrom: "chr2", start: 2000000, end: 8000000, log2ratio: 1.2},
{chrom: "chr3", start: 1000000, end: 6000000, log2ratio: -0.3},
] |> to_table()
cnv_plot(cnv_data, title: "Copy Number Alterations")
Rainfall Plot
A rainfall plot shows inter-mutation distances on a log scale, revealing clusters of mutations (kataegis) as downward-pointing streaks.
let mutation_positions = [
{chrom: "chr1", pos: 100000},
{chrom: "chr1", pos: 100050},
{chrom: "chr1", pos: 100120},
{chrom: "chr1", pos: 500000},
{chrom: "chr2", pos: 200000},
{chrom: "chr2", pos: 800000},
] |> to_table()
rainfall(mutation_positions, title: "Mutation Clustering")
Saving and Exporting
All bio visualization functions support two output modes:
- ASCII (default): Prints a text-based rendering to the terminal, useful for quick inspection in a REPL or pipeline
- SVG (
format: "svg"): Returns an SVG string for publication-quality figures
# ASCII output — prints directly to terminal
manhattan(gwas, title: "Quick Look")
# SVG output — returns a string
let svg = manhattan(gwas, format: "svg", title: "GWAS Results")
save_svg(svg, "figures/manhattan.svg")
# save_plot is an alias for save_svg
save_plot(violin(groups, format: "svg"), "figures/violin.svg")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
The SVG output is designed for journal submission: clean lines, proper labels, and a white background. You can open the SVG in Inkscape, Illustrator, or any browser for further editing.
Bio Plot Reference Table
| Plot | Function | Data Input | Use Case |
|---|---|---|---|
| Manhattan | manhattan() | Table: chrom, pos, pvalue | GWAS significance |
qq_plot() | List of p-values | P-value inflation check | |
| Violin | violin() | Record of named lists | Distribution comparison |
| Density | density() | List of values | Smooth distribution |
| Kaplan-Meier | kaplan_meier() | Table: time, event | Survival analysis |
| ROC | roc_curve() | Table: score, label | Classifier evaluation |
| Forest | forest_plot() | Table: study, effect, ci_lower, ci_upper | Meta-analysis |
| Ideogram | ideogram() | Table: chrom, start, end, band, stain | Chromosome view |
| Circos | circos() | Table: chrom, start, end, value | Genome-wide circular |
| Lollipop | lollipop() | Table: position, count | Mutation hotspots |
| Sequence logo | sequence_logo() | List of equal-length strings | Motif conservation |
| Phylo tree | phylo_tree() | Newick string | Evolutionary relationships |
| Venn | venn() | Record of sets | Set overlap (2-3 sets) |
| UpSet | upset() | Record of sets | Set overlap (many sets) |
| Oncoprint | oncoprint() | Table: gene, sample columns | Mutation landscape |
| Sashimi | sashimi() | Table: chrom, start, end, count | Splice junctions |
| HiC | hic_map() | Nested list (matrix) | Chromatin contacts |
| CNV | cnv_plot() | Table: chrom, start, end, log2ratio | Copy number |
| Rainfall | rainfall() | Table: chrom, pos | Mutation clustering |
| PCA | pca_plot() | Table (samples x features) | Dimensionality reduction |
| Clustered heatmap | clustered_heatmap() | Table (matrix) | Hierarchical clustering |
Exercises
-
Manhattan plot: Load
data/gwas.csv, create a Manhattan plot, and identify which chromosome has the most significant hit (the lowest p-value). -
Survival comparison: Create two Kaplan-Meier curves — one for a treatment group and one for a control group — and observe the difference in median survival time.
-
Sequence logo: Create a list of 10 aligned 8-mer sequences around a TATA box motif (positions should be mostly T-A-T-A-A-A with some variation at positions 5-8). Generate a sequence logo and identify which positions are most conserved.
-
Gene list overlap: Create three gene lists (at least 5 genes each) with partial overlap. Use
venn()to visualize the overlaps, then useupset()on the same data and compare the two views. -
Mutation hotspots: Build a lollipop plot showing at least 6 mutation positions in TP53. Include real hotspot names (R175H, G245S, R248W, R273H, R282W, Y220C).
Key Takeaways
- BioLang has 21 specialized bio visualization functions, each designed for a specific biological question
- GWAS:
manhattan()for genome-wide significance,qq_plot()for inflation diagnostics - Expression:
violin()for distributions,pca_plot()for sample clustering,clustered_heatmap()for pattern discovery - Clinical:
kaplan_meier()for survival,roc_curve()for classifier evaluation,forest_plot()for meta-analysis - Genomic structure:
ideogram()for chromosomes,circos()for genome-wide circular views,lollipop()for mutation positions - Sequence:
sequence_logo()for motifs,phylo_tree()for evolution - All bio plots support ASCII (terminal) and SVG (publication) output
- Use
save_svg()orsave_plot()to export publication-quality figures - Choose the plot that matches your data type and biological question
What’s Next
Tomorrow: Multi-Species Comparison — fetching orthologs, comparing sequences across species, and visualizing conservation patterns.