Introduction
BioLang is a domain-specific language built for bioinformatics. It is not a general-purpose language that happens to have biology libraries — every feature, from its type system to its operators, exists because bioinformatics workflows demand it.
Why a DSL?
Bioinformaticians spend most of their time gluing tools together: reading FASTA files, filtering variants, piping data through transformations, querying biological databases, and building reproducible pipelines. General-purpose languages require dozens of imports, boilerplate, and framework knowledge before you can write your first useful script.
BioLang eliminates that friction. A complete variant filtering workflow:
read_vcf("variants.vcf")
|> filter(|v| v.qual > 30)
|> filter(|v| is_snp(v))
|> filter(|v| is_transition(v))
|> map(|v| {chrom: v.chrom, pos: v.pos, ref: v.ref_allele, alt: v.alt_allele, qual: v.qual})
|> write_csv("filtered_transitions.csv")
No imports. No boilerplate. No framework. DNA, RNA, proteins, intervals, variants, and quality scores are first-class types — not strings pretending to be biology.
Design Principles
Pipe-first. The |> operator is the backbone of BioLang. Bioinformatics is inherently a series of transformations: raw reads become aligned reads become variant calls become annotated reports. Pipes make this natural:
samples
|> par_map(|s| read_fastq(s.path))
|> map(|reads| filter(reads, |r| mean(r.quality) > 20))
|> map(|reads| gc_content(reads))
|> summarize(mean_gc: mean)
Biology is built in. DNA, RNA, protein, and quality score literals are part of the syntax. Genomic intervals and variants have dedicated types with domain-specific methods. You do not need to install a package to reverse-complement a sequence:
let seq = dna"ATCGATCGATCG"
let rc = reverse_complement(seq) # dna"CGATCGATCGAT"
let rna = transcribe(seq) # rna"AUCGAUCGAUCG"
let protein = translate(rna) # protein"IDRS"
Tables are native. Bioinformatics lives in tables — sample sheets, count matrices, variant lists, BED files. BioLang tables support column operations, grouping, summarization, and joins without any library:
let counts = csv("gene_counts.csv")
counts
|> mutate(log_count: |row| log2(row.count + 1))
|> group_by("sample")
|> summarize(mean_expr: |g| mean(g.log_count), n_genes: |g| len(g))
Twelve bio databases at your fingertips. NCBI, Ensembl, UniProt, UCSC, KEGG, STRING, PDB, Reactome, Gene Ontology, COSMIC, BioMart, and NCBI Datasets are built-in — no API keys required for most (NCBI key optional for higher rate limits, COSMIC key required):
let gene = ncbi_gene("TP53")
let pathways = reactome_pathways("TP53")
let interactions = string_network("TP53", species: 9606)
Pipelines are a language construct. Not a separate workflow engine — pipelines with stages, parallel execution, and dependency tracking are part of the language itself:
pipeline qc_pipeline {
stage read_data {
read_fastq("sample.fq")
}
stage filter_reads {
read_data.result |> filter(|r| mean(r.quality) >= 25)
}
stage report {
let gc = filter_reads.result |> map(|r| gc_content(r.seq))
{mean_gc: mean(gc), n_reads: len(filter_reads.result)}
}
}
Streams handle scale. File readers return lazy streams. Process a 100 GB FASTQ without loading it into memory. Parallel maps distribute work across cores:
read_fastq("massive_file.fq")
|> filter(|r| mean(r.quality) > 25)
|> par_map(|r| gc_content(r.seq))
|> summarize(mean_gc: mean, n_reads: len)
Who This Book Is For
This book is for bioinformaticians, computational biologists, and genomics researchers who want to write analysis scripts faster and more clearly. You do not need deep programming experience — if you have used R, Python, or shell scripts for bioinformatics, you will feel at home.
Every example in this book uses real biological data and real analysis tasks. There are no toy examples. By the end, you will be able to:
- Process sequencing data (FASTA, FASTQ, BAM, VCF, BED, GFF)
- Query biological databases directly from your scripts
- Build reproducible analysis pipelines
- Perform statistical analysis on expression and variant data
- Scale to large datasets using streams and parallel operations
- Extend BioLang with plugins in Python, R, or TypeScript
Running the Examples
Install BioLang:
cargo install biolang
Or build from source:
git clone https://github.com/oriclabs/biolang
cd biolang
cargo build --release
Run any example from this book:
bl run example.bl
Or use the interactive REPL:
bl repl
Every code block in this book is valid BioLang. Copy, paste, run.
Sample Data
Many examples in this book reference files like contigs.fa, reads.fq, calls.vcf,
and gene_counts.csv. BioLang ships with sample data for all of these in the
examples/sample-data/ directory:
examples/sample-data/
contigs.fa 4 assembled contigs (FASTA)
reads.fq 8 sequencing reads with quality scores (FASTQ)
calls.vcf 12 variants across 5 chromosomes (VCF)
promoters.bed 10 promoter regions (BED)
chip_peaks.bed 7 ChIP-seq peaks (BED)
annotations.gff 16 gene features for 6 genes (GFF3)
samples.csv 6-sample sheet with conditions and batches (CSV)
gene_counts.csv 15 cancer genes × 6 samples expression matrix (CSV)
counts.tsv 8 genomic region read counts (TSV)
data.sam 6 SAM alignment records with headers (SAM)
To run the quickstart script that exercises all sample data:
bl run examples/quickstart.bl
When a book example references a bare filename like read_fasta("contigs.fa"),
you can substitute the sample data path:
let sequences = read_fasta("examples/sample-data/contigs.fa")
Chapter 1: Getting Started
BioLang is a pipe-first domain-specific language built for bioinformatics workflows. This chapter walks you through installation, the interactive REPL, running scripts, and writing your first real analysis.
Installation
From crates.io
cargo install biolang
This installs the bl binary, which provides both the REPL and the script runner.
From source
git clone https://github.com/oriclabs/biolang.git
cd biolang
cargo build --release
cp target/release/bl ~/.local/bin/
Verify installation
bl --version
Updating
BioLang has built-in update checking. Run bl version to see the current version
and check if a newer release is available:
bl version
# BioLang v0.1.0
# Checking for updates... up to date.
To upgrade to the latest release:
bl upgrade
This downloads the correct binary for your platform from GitHub Releases and
replaces the current bl executable.
BioLang also checks for updates automatically in the background when you run
bl run or bl repl. If a newer version is available, a one-line notice
appears on stderr. This check runs at most once per 24 hours and never blocks
startup. Disable it with:
export BIOLANG_NO_UPDATE_CHECK=1
The REPL
Launch the interactive REPL:
bl
You will see the BioLang prompt:
BioLang v0.1.0
Type :help for commands, :quit to exit.
bl>
Try evaluating a bio literal directly:
bl> dna"ATCGATCG" |> gc_content()
0.5
REPL Commands
The REPL supports several meta-commands, all prefixed with :.
:env – Inspect current bindings
bl> let ref_genome = "GRCh38"
bl> let min_mapq = 30
bl> :env
ref_genome : Str = "GRCh38"
min_mapq : Int = 30
:reset – Clear all bindings
bl> :reset
Environment cleared.
:load and :save – Session persistence
Load a script into the current session, executing every statement:
bl> :load preprocessing.bl
Loaded 42 bindings from preprocessing.bl
Save the current session bindings to a file:
bl> :save my_session.bl
Saved 12 bindings to my_session.bl
:time – Benchmark an expression
bl> :time read_fastq("sample_R1.fastq.gz") |> filter(|r| mean(r.quality) >= 30) |> len()
Result: 1847293
Elapsed: 4.38s
:type – Check the type of an expression
bl> :type dna"ATCG"
DNA
bl> :type {chrom: "chr1", start: 100, end: 200}
Record{chrom: Str, start: Int, end: Int}
:plugins – List available plugins
bl> :plugins
fastq read_fastq, write_fastq
fasta read_fasta, write_fasta
sam read_sam, read_bam
vcf read_vcf, write_vcf
bed read_bed, write_bed
table csv, tsv, write_csv
:profile – Profile an expression
bl> :profile read_fasta("reference.fa") |> filter(|r| seq_len(r.seq) > 1000) |> len()
Total: 2.14s
read: 1.87s (87.4%)
filter: 0.26s (12.1%)
len: 0.01s (0.5%)
Result: 24891
Running Scripts
BioLang scripts use the .bl extension. Run a script with:
bl run gc_analysis.bl
Pass arguments to a script:
bl run qc_report.bl -- --input sample.fastq.gz --min-quality 20
Arguments are available inside the script via the args record:
# qc_report.bl
let input_file = args.input
let min_qual = into(args.min_quality ?? "20", "Int")
let reads = read_fastq(input_file)
|> filter(|r| mean(r.quality) >= min_qual)
print(f"Passing reads: {len(reads)}")
print(f"Mean quality: {reads |> map(|r| mean(r.quality)) |> mean()}")
Your First Script: FASTA GC Content Analyzer
BioLang includes sample data in examples/sample-data/ — see the
Introduction for the full list. The script
below uses examples/sample-data/contigs.fa.
Create a file called gc_scan.bl:
# gc_scan.bl
# Read a FASTA file, compute per-sequence GC content, report statistics.
let sequences = read_fasta("examples/sample-data/contigs.fa")
# Compute GC content for each sequence
let gc_table = sequences
|> map(|seq| {
name: seq.id,
length: seq_len(seq.seq),
gc: gc_content(seq.seq)
})
|> table()
# Summary statistics
let gc_vals = gc_table |> select("gc")
let mean_gc = mean(gc_vals)
let std_gc = stdev(gc_vals)
let min_gc = min(gc_vals)
let max_gc = max(gc_vals)
let n_seqs = len(gc_vals)
print(f"Analyzed {n_seqs} sequences")
print(f"GC content: {mean_gc:.3f} (range: {min_gc:.3f} - {max_gc:.3f})")
print(f"Standard deviation: {std_gc:.4f}")
# Flag outlier contigs (GC > 2 std devs from mean)
let outliers = gc_table
|> filter(|row| abs(row.gc - mean_gc) > 2.0 * std_gc)
|> sort("gc", descending: true)
print(f"\nOutlier contigs ({len(outliers)}):")
outliers |> each(|row| print(f" {row.name}: GC={row.gc:.3f}, length={row.length}"))
Run it:
bl run gc_scan.bl
Example output:
Analyzed 847 sequences
GC content: 0.412 (range: 0.198 - 0.687)
Standard deviation: 0.0531
Outlier contigs (12):
contig_441: GC=0.687, length=3421
contig_002: GC=0.621, length=15789
...
Project Structure
Initialize a BioLang project:
bl init my-rnaseq-pipeline
cd my-rnaseq-pipeline
This creates the following structure:
my-rnaseq-pipeline/
.biolang/
config.yaml # project configuration
plugins/ # local plugin overrides
src/
main.bl # entry point
data/ # input data directory
results/ # output directory
.biolang/config.yaml
name: my-rnaseq-pipeline
version: 0.1.0
entry: src/main.bl
paths:
data: ./data
results: ./results
reference: /shared/references/GRCh38
defaults:
min_quality: 30
threads: 8
Access project config values in your scripts:
# src/main.bl
# Access project paths via import
import "src/paths.bl" as paths
let min_qual = 30
read_fastq(f"{paths.data}/sample_R1.fastq.gz")
|> filter(|r| mean(r.quality) >= min_qual)
|> write_fastq(f"{paths.results}/filtered_R1.fastq.gz")
Multi-file projects
Use import to split your pipeline across files:
# src/main.bl
import "src/qc.bl" as qc
import "src/alignment.bl" as align
import "src/variant_calling.bl" as vc
let samples = csv("data/sample_sheet.csv")
samples |> each(|sample| {
let cleaned = qc.run(sample.fastq_r1, sample.fastq_r2)
let bam = align.run(cleaned.r1, cleaned.r2, sample.reference)
vc.run(bam, sample.reference, sample.sample_id)
})
# src/qc.bl
let run = |r1, r2| {
let filt_r1 = read_fastq(r1) |> filter(|r| mean(r.quality) >= 30) |> write_fastq(f"{r1}.filtered.fq.gz")
let filt_r2 = read_fastq(r2) |> filter(|r| mean(r.quality) >= 30) |> write_fastq(f"{r2}.filtered.fq.gz")
{r1: filt_r1, r2: filt_r2}
}
BIOLANG_PATH
The BIOLANG_PATH environment variable controls where BioLang searches for imported
modules and plugins. It accepts a colon-separated (or semicolon on Windows) list of
directories:
export BIOLANG_PATH="/home/user/biolang-libs:/shared/team-modules"
Resolution order for import "module.bl":
- Relative to the importing file
- Project
.biolang/plugins/directory - Each directory in
BIOLANG_PATH - System-wide library path (
~/.biolang/lib/)
This is useful for sharing utility modules across projects:
# This resolves via BIOLANG_PATH if not found locally
import "genomics_utils.bl" as gutils
let kmers = dna"ATCGATCGATCG" |> gutils.kmer_frequencies(k: 3)
print(kmers)
What’s Next
You now have BioLang installed, know how to use the REPL for interactive exploration, and can write and run scripts. In the next chapter, we will explore bio literals – the first-class DNA, RNA, protein, and quality score types that make BioLang unique for bioinformatics work.
Chapter 2: Bio Literals
BioLang has first-class types for biological sequences. These are not strings – they carry domain semantics, validate their contents, and support biological operations directly. This chapter covers DNA, RNA, Protein, and Quality literals, along with the operations that make them powerful.
DNA Literals
A DNA literal is created with the dna prefix:
let seq = dna"ATCGATCGATCG"
DNA literals accept only valid IUPAC nucleotide codes: A, T, C, G, and
ambiguity codes N, R, Y, S, W, K, M, B, D, H, V.
# Ambiguous positions are valid
let probe = dna"ATCNNGATCG"
# Case is normalized to uppercase
let lower = dna"atcgatcg"
print(lower) # => dna"ATCGATCG"
Invalid bases cause a compile-time error:
# This is a compile error -- U is not a DNA base
let bad = dna"AUCG"
# Error: invalid DNA base 'U' at position 1
RNA Literals
RNA uses the rna prefix and contains A, U, C, G:
let mrna = rna"AUGCUUAAGGCUAG"
Protein Literals
Protein sequences use standard single-letter amino acid codes:
let p53_fragment = protein"MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPS"
Protein literals validate against the 20 standard amino acids plus X (unknown),
* (stop), U (selenocysteine), and O (pyrrolysine):
let with_stop = protein"MVLSPADKTNVK*"
Quality Score Literals
Phred+33 quality scores are first-class values:
let quals = qual"IIIIIHHHGGFFEEDCBA"
Each character maps to a Phred quality score. You can work with them numerically:
let q = qual"IIIIIHHHGG"
print(mean(q)) # => 37.2
print(min(q)) # => 38 (G = 38)
print(max(q)) # => 40 (I = 40)
# Filter reads by mean quality
let reads = read_fastq("sample.fastq.gz")
let hq_reads = reads |> filter(|r| mean(r.quality) >= 30)
Sequence Operations
Complement and Reverse Complement
let forward = dna"ATCGATCG"
let comp = forward |> complement()
print(comp) # => dna"TAGCTAGC"
let rc = forward |> reverse_complement()
print(rc) # => dna"CGATCGAT"
Reverse complement is the workhorse of strand-aware bioinformatics:
# Check if a primer binds to either strand
let template = dna"ATCGATCGAATTCGCTAGC"
let primer = dna"GCTAGC"
let fwd_match = template |> find_motif(primer)
let rev_match = template |> find_motif(primer |> reverse_complement())
print(f"Forward hits: {len(fwd_match)}, Reverse hits: {len(rev_match)}")
Transcription and Translation
let gene = dna"ATGAAAGCTTTTCGATAG"
# Transcribe DNA to RNA (T -> U)
let mrna = gene |> transcribe()
print(mrna) # => rna"AUGAAAGCUUUUCGAUAG"
# Translate RNA (or DNA) to protein
let protein_seq = gene |> translate()
print(protein_seq) # => protein"MKAFR*"
# Translate with a different codon table (e.g., mitochondrial)
let mito_protein = gene |> translate(table: 2)
GC Content
let contig = dna"ATCGATCGGGCCCATATATGCGCGC"
let gc = contig |> gc_content()
print(f"GC: {gc:.2%}") # => GC: 56.00%
# Sliding window GC content for detecting isochores
let gc_windows = contig |> window(size: 10, step: 5) |> map(|w| gc_content(w))
print(gc_windows)
Subsequences with slice and len
let genome_fragment = dna"ATCGATCGATCGATCGATCG"
print(seq_len(genome_fragment)) # => 20
print(genome_fragment |> slice(0, 6)) # => dna"ATCGAT"
# Extract a coding region by coordinates
let cds_start = 3
let cds_end = 15
let cds = genome_fragment |> slice(cds_start, cds_end)
Motif Searching
find_motif returns a list of match positions:
let seq = dna"ATCGAATTCGATCGAATTCG"
# Find EcoRI recognition site
let ecori_sites = seq |> find_motif(dna"GAATTC")
print(ecori_sites) # => [3, 13]
# Regex search on sequences (returns match records)
let matches = seq |> regex_find("GA[AT]TTC")
matches |> each(|m| print(f"Match at {m.start}: {m.text}"))
Example: Find All ORFs in a DNA Sequence
An open reading frame (ORF) starts with ATG and ends at the first in-frame stop codon (TAA, TAG, TGA). This script finds all ORFs in all six reading frames.
# orf_finder.bl
# Find all open reading frames in a FASTA sequence.
let find_orfs = |seq, min_length| {
let orfs = []
let codons = seq |> chunk(3)
let in_orf = false
let orf_start = 0
codons |> enumerate() |> each(|i, codon| {
if !in_orf && codon == dna"ATG" then {
in_orf = true
orf_start = i * 3
}
if in_orf && (codon == dna"TAA" || codon == dna"TAG" || codon == dna"TGA") then {
let orf_end = (i + 1) * 3
let orf_len = orf_end - orf_start
if orf_len >= min_length then {
orfs = orfs ++ [{start: orf_start, end: orf_end, length: orf_len, seq: seq |> slice(orf_start, orf_end)}]
}
in_orf = false
}
})
orfs
}
let sequences = read_fasta("contigs.fa")
# Search all 6 reading frames
let all_orfs = sequences |> flat_map(|entry| {
let fwd = entry.seq
let rev = entry.seq |> reverse_complement()
0..3 |> flat_map(|frame| {
let fwd_orfs = fwd |> slice(frame, seq_len(fwd)) |> find_orfs(300)
|> map(|orf| {
...orf,
seq_id: entry.id,
strand: "+",
frame: frame,
start: orf.start + frame
})
let rev_orfs = rev |> slice(frame, seq_len(rev)) |> find_orfs(300)
|> map(|orf| {
...orf,
seq_id: entry.id,
strand: "-",
frame: frame
})
fwd_orfs ++ rev_orfs
})
})
print(f"Found {len(all_orfs)} ORFs (>= 300 bp)")
all_orfs
|> sort("length", descending: true)
|> take(20)
|> each(|orf| print(f" {orf.seq_id} {orf.strand}:{orf.start}-{orf.end} ({orf.length} bp)"))
Example: Codon Usage Table
Given a coding sequence, calculate the frequency of each codon and display the codon usage bias.
# codon_usage.bl
# Build a codon usage table from a coding sequence.
let cds_records = read_fasta("cds_sequences.fa")
# Aggregate codons across all coding sequences
let codon_counts = cds_records
|> flat_map(|rec| rec.seq |> chunk(3))
|> filter(|codon| seq_len(codon) == 3)
|> group_by(|codon| to_string(codon))
|> map(|group| {codon: group.key, count: len(group.values)})
let total = codon_counts |> map(|c| c.count) |> sum()
let usage_table = codon_counts
|> map(|c| {
...c,
amino_acid: into(c.codon, "DNA") |> translate() |> to_string(),
frequency: c.count / total,
per_thousand: (c.count / total) * 1000.0
})
|> sort("amino_acid")
print(f"Codon usage from {len(cds_records)} sequences ({total} codons)\n")
print("Codon AA Count Freq /1000")
print("----- -- -------- ----- -----")
usage_table |> each(|row|
print(f"{row.codon} {row.amino_acid} {row.count:>8} {row.frequency:.4f} {row.per_thousand:.1f}")
)
# Identify rare codons (< 10 per thousand)
let rare = usage_table |> filter(|c| c.per_thousand < 10.0)
print(f"\nRare codons ({len(rare)}):")
rare |> each(|c| print(f" {c.codon} ({c.amino_acid}): {c.per_thousand:.1f}/1000"))
Example: Restriction Enzyme Site Finder
Find all restriction enzyme recognition sites in a sequence and predict fragment sizes for a virtual digest.
# digest.bl
# Virtual restriction digest of a DNA sequence.
# Common restriction enzymes: name -> recognition site
let enzymes = [
{name: "EcoRI", site: dna"GAATTC", cut_offset: 1},
{name: "BamHI", site: dna"GGATCC", cut_offset: 1},
{name: "HindIII", site: dna"AAGCTT", cut_offset: 1},
{name: "NotI", site: dna"GCGGCCGC", cut_offset: 2},
{name: "XhoI", site: dna"CTCGAG", cut_offset: 1},
{name: "SalI", site: dna"GTCGAC", cut_offset: 1},
{name: "NcoI", site: dna"CCATGG", cut_offset: 1},
{name: "PstI", site: dna"CTGCAG", cut_offset: 5}
]
let entry = read_fasta("plasmid.fa") |> first()
let seq = entry.seq
print(f"Sequence length: {seq_len(seq)} bp\n")
# Find all cut sites for each enzyme
let results = enzymes |> map(|enzyme| {
let sites = seq |> find_motif(enzyme.site)
let cut_positions = sites |> map(|pos| pos + enzyme.cut_offset)
{name: enzyme.name, site: to_string(enzyme.site), n_cuts: len(sites), positions: cut_positions}
})
# Report
results |> each(|r| {
if r.n_cuts > 0 then {
print(f"{r.name} ({r.site}): {r.n_cuts} site(s) at {r.positions}")
} else {
print(f"{r.name} ({r.site}): no sites")
}
})
# Virtual double digest with EcoRI + BamHI
let ecori_result = results |> find(|r| r.name == "EcoRI")
let ecori_cuts = ecori_result.positions
let bamhi_result = results |> find(|r| r.name == "BamHI")
let bamhi_cuts = bamhi_result.positions
let all_cuts = (ecori_cuts ++ bamhi_cuts) |> sort() |> unique()
# Calculate fragment sizes (circular plasmid)
let fragments = if len(all_cuts) == 0 then [seq_len(seq)]
else {
let sorted_cuts = all_cuts |> sort()
let frags = sorted_cuts
|> window(2)
|> map(|pair| pair[1] - pair[0])
# Add the wrap-around fragment for circular DNA
let wrap = seq_len(seq) - last(sorted_cuts) + first(sorted_cuts)
frags ++ [wrap]
}
print(f"\nEcoRI + BamHI double digest: {len(fragments)} fragments")
fragments
|> sort(descending: true)
|> enumerate()
|> each(|i, size| print(f" Fragment {i + 1}: {size} bp"))
Summary
Bio literals are the foundation of BioLang. They carry biological meaning, enforce validity at parse time, and chain naturally with biological operations through pipes. Instead of calling string manipulation functions on raw text, you work directly with typed sequences that know about complements, codons, reading frames, and motifs.
In the next chapter, we will cover the full type system – variables, records, type coercion, and nil handling – which builds on these bio types to represent complex biological data structures.
Chapter 3: Variables and Types
BioLang has a rich type system designed around bioinformatics data. This chapter covers variable bindings, the full set of types, records, type checking and coercion, and nil handling.
let Bindings
Variables are introduced with let and are immutable by default:
let sample_id = "TCGA-A1-A0SP"
let min_depth = 30
let reference = dna"ATCGATCGATCG"
Attempting to rebind a let variable in the same scope is an error:
let threshold = 0.05
let threshold = 0.01 # Error: 'threshold' is already bound in this scope
Reassignment
Use = without let to reassign an existing variable:
let total_reads = 0
total_reads = total_reads + 1847293
let status = "pending"
status = "complete"
Primitive Types
| Type | Examples | Description |
|---|---|---|
Int | 42, -1, 1_000_000 | 64-bit integer |
Float | 3.14, 1e-10, 0.05 | 64-bit float |
Str | "hello", f"GC: {gc}" | UTF-8 string |
Bool | true, false | Boolean |
Nil | nil | Absence of value |
let num_samples = 48
let p_value = 3.2e-8
let experiment = "RNA-seq time course"
let is_paired_end = true
let adapter_seq = nil # not yet determined
Bio Types
These types carry domain semantics beyond raw data:
| Type | Literal | Description |
|---|---|---|
DNA | dna"ATCG" | DNA sequence (IUPAC) |
RNA | rna"AUGC" | RNA sequence |
Protein | protein"MVLK" | Amino acid sequence |
Quality | qual"IIIHH" | Phred+33 quality scores |
Interval | interval("chr1", 1000, 2000) | Genomic interval |
Variant | variant("chr17", 7674220, "C", "T") | Sequence variant |
let primer_fwd = dna"TGTAAAACGACGGCCAGT"
let stop_codon = rna"UAG"
let signal_peptide = protein"MKWVTFISLLFLFSSAYS"
let read_quals = qual"IIIIIIIIHHHHHGGGGFFFF"
# Genomic coordinates
let tp53_exon1 = interval("chr17", 7668421, 7669690)
let brca1_snp = variant("chr17", 43094464, "G", "A")
Interval Type
Intervals represent genomic regions with chromosome, start, and end:
let region = interval("chr1", 1000000, 2000000)
print(region.chrom) # => "chr1"
print(region.start) # => 1000000
print(region.end) # => 2000000
print(region.length) # => 1000000
# Interval arithmetic
let exon1 = interval("chr1", 1000, 1500)
let exon2 = interval("chr1", 1400, 2000)
print(overlaps(exon1, exon2)) # => true
print(intersection(exon1, exon2)) # => interval("chr1", 1400, 1500)
Variant Type
Variants represent sequence changes at specific positions:
let snv = variant("chr7", 55249071, "C", "T")
print(snv.chrom) # => "chr7"
print(snv.pos) # => 55249071
print(snv.ref_allele) # => "C"
print(snv.alt_allele) # => "T"
# Classify variant type
print(variant_type(snv)) # => "SNV"
let indel = variant("chr7", 55249071, "CT", "C")
print(variant_type(indel)) # => "DEL"
Records
Records are BioLang’s primary structured data type. They map naturally to the tabular, metadata-rich world of bioinformatics:
let sample = {
sample_id: "S001",
patient: "P-4821",
tissue: "tumor",
reads: 45_000_000,
mean_coverage: 32.5,
contamination: 0.02
}
print(sample.sample_id) # => "S001"
print(sample.mean_coverage) # => 32.5
String Keys
Record keys can be identifiers or strings:
let annotation = {
"gene_name": "EGFR",
"Gene Ontology": "GO:0005524",
chrom: "chr7"
}
print(annotation."Gene Ontology") # => "GO:0005524"
Record Spread
The spread operator ... merges one record into another. Later keys override
earlier ones:
let defaults = {
aligner: "bwa-mem2",
min_mapq: 30,
mark_duplicates: true,
reference: "GRCh38"
}
let sample_config = {
...defaults,
sample_id: "S001",
min_mapq: 20 # override the default
}
print(sample_config.aligner) # => "bwa-mem2"
print(sample_config.min_mapq) # => 20
This pattern is essential for bioinformatics pipelines where you have standard parameters with per-sample overrides.
Nested Records
Records can nest to arbitrary depth:
let variant_annotation = {
variant: variant("chr17", 7674220, "C", "T"),
gene: {
name: "TP53",
transcript: "NM_000546.6",
exon: 7,
consequence: "missense_variant"
},
population: {
gnomad_af: 0.00003,
clinvar: "Pathogenic",
cosmic_count: 4521
},
prediction: {
sift: {score: 0.001, label: "deleterious"},
polyphen: {score: 0.998, label: "probably_damaging"}
}
}
print(variant_annotation.gene.consequence)
print(variant_annotation.prediction.sift.score)
Type Checking
BioLang provides introspection functions for runtime type checking:
let seq = dna"ATCG"
print(type(seq)) # => "DNA"
print(is_dna(seq)) # => true
print(is_rna(seq)) # => false
let rec = {name: "BRCA1", chrom: "chr17"}
print(type(rec)) # => "Record"
print(is_record(rec)) # => true
print(is_str(rec)) # => false
Available type check functions:
is_int(val)
is_float(val)
is_str(val)
is_bool(val)
is_nil(val)
is_dna(val)
is_rna(val)
is_protein(val)
is_quality(val)
is_list(val)
is_set(val)
is_record(val)
is_table(val)
is_interval(val)
is_variant(val)
Type Coercion with into
into(value, "TargetType") converts between compatible types:
# String to DNA
let seq = into("ATCGATCG", "DNA")
# DNA to RNA (transcription)
let mrna = into(dna"ATCGATCG", "RNA")
# Int to Float
let ratio = into(42, "Float")
# String to Int
let depth = into("30", "Int")
# List of records to Table
let rows = [{gene: "TP53", pval: 0.001}, {gene: "BRCA1", pval: 0.05}]
let tbl = into(rows, "Table")
# Variant to Interval (single-base or span)
let region = into(variant("chr1", 1000, "A", "T"), "Interval")
Type Aliases
Define aliases to make code self-documenting:
type Locus = Record
type SampleMeta = Record
type QCMetrics = Record
let build_locus = |chrom, start, end, gene| {
chrom: chrom, start: start, end: end, gene: gene
}
let tp53 = build_locus("chr17", 7668421, 7687490, "TP53")
Nil Handling
Bioinformatics data is full of missing values – absent annotations, failed QC metrics, unmapped reads. BioLang has two operators for nil handling.
?? Null Coalesce
Returns the left side if non-nil, otherwise the right:
let depth = sample.coverage ?? 0
let gene_name = annotation.symbol ?? annotation.id ?? "unknown"
# Useful for setting defaults in pipeline parameters
let min_qual = args.min_quality ?? 30
let threads = args.threads ?? 4
?. Optional Chain
Safely accesses nested fields, returning nil if any intermediate value is nil:
let clinvar_status = variant_record?.annotation?.clinvar?.significance
# Returns nil if annotation, clinvar, or significance is missing
# Combine with ??
let pathogenicity = variant_record?.annotation?.clinvar?.significance ?? "Unknown"
Example: Building a Sample Metadata Registry
# sample_registry.bl
# Parse a sample sheet and build a typed metadata registry.
let defaults = {
platform: "Illumina",
chemistry: "NovaSeq 6000",
read_length: 150,
paired: true,
reference: "GRCh38"
}
let raw_sheet = csv("sample_sheet.csv")
let registry = raw_sheet |> map(|row| {
...defaults,
sample_id: row.Sample_ID,
patient_id: row.Patient ?? "unknown",
tissue: row.Tissue,
condition: row.Condition ?? "normal",
fastq_r1: row.FASTQ_R1,
fastq_r2: row.FASTQ_R2 ?? nil,
paired: !is_nil(row.FASTQ_R2),
library_prep: row.Library ?? "WGS",
date: row.Date
})
# Validate: every sample must have R1
let invalid = registry |> filter(|s| is_nil(s.fastq_r1))
if len(invalid) > 0 then {
print(f"ERROR: {len(invalid)} samples missing FASTQ R1:")
invalid |> each(|s| print(f" {s.sample_id}"))
exit(1)
}
# Group by condition for downstream analysis
let by_condition = registry |> group_by(|s| s.condition)
by_condition |> each(|group| {
print(f"{group.key}: {len(group.values)} samples")
group.values |> each(|s| print(f" {s.sample_id} ({s.tissue})"))
})
# Summary
let tumor_count = registry |> filter(|s| s.condition == "tumor") |> len()
let normal_count = registry |> filter(|s| s.condition == "normal") |> len()
print(f"\nRegistry: {len(registry)} samples ({tumor_count} tumor, {normal_count} normal)")
Example: Parsing and Validating Variant Records
# validate_variants.bl
# Read a VCF file, parse variants into typed records, validate, classify.
let raw_variants = read_vcf("calls.vcf.gz")
# Build typed variant records with full annotation
let typed_variants = raw_variants |> map(|v| {
var: variant(v.chrom, v.pos, v.ref, v.alt),
qual: into(v.qual ?? "0", "Float"),
depth: into(v.info?.DP ?? "0", "Int"),
allele_freq: into(v.info?.AF ?? "0.0", "Float"),
genotype: v.samples?.GT ?? "./.",
filter_status: v.filter
})
# Validate variant quality
let validate = |v| {
let issues = []
if v.qual < 30.0 then issues = issues ++ ["low_qual"]
if v.depth < 10 then issues = issues ++ ["low_depth"]
if v.allele_freq < 0.05 then issues = issues ++ ["low_af"]
if v.genotype == "./." then issues = issues ++ ["missing_gt"]
{...v, issues: issues, is_valid: len(issues) == 0}
}
let validated = typed_variants |> map(validate)
# Classify variants
let classify = |v| {
let vtype = variant_type(v.var)
let size = if vtype == "SNV" then 1
else abs(len(v.var.alt_allele) - len(v.var.ref_allele))
{...v, variant_type: vtype, size: size}
}
let classified = validated |> map(classify)
# Report summary
let total = len(classified)
let valid = classified |> filter(|v| v.is_valid) |> len()
let by_type = classified |> group_by(|v| v.variant_type)
print(f"Total variants: {total}")
print(f"Passing filters: {valid} ({valid / total * 100.0:.1f}%)")
print(f"\nBy type:")
by_type |> each(|g| print(f" {g.key}: {len(g.values)}"))
# Flag high-impact variants
let high_impact = classified
|> filter(|v| v.is_valid && v.variant_type != "SNV" && v.size > 50)
|> sort("size", descending: true)
print(f"\nHigh-impact structural variants (>50bp): {len(high_impact)}")
high_impact |> take(10) |> each(|v| {
print(f" {v.var.chrom}:{v.var.pos} {v.variant_type} size={v.size} AF={v.allele_freq:.3f}")
})
Summary
BioLang’s type system combines standard programming types with biological data types
and record-based data modeling. Records with spread make it natural to build layered
configurations and annotated data structures. Nil handling with ?? and ?. keeps
pipelines resilient to the missing data that pervades genomics workflows.
The next chapter covers collections and tables – the workhorses for processing the large tabular datasets that dominate bioinformatics.
Chapter 4: Collections and Tables
Bioinformatics data is inherently tabular and set-oriented. BioLang provides Lists, Sets, Tables, and Ranges as first-class collection types, with rich operations for filtering, grouping, and summarizing biological datasets.
Lists
Lists are ordered, heterogeneous sequences:
let chromosomes = ["chr1", "chr2", "chr3", "chrX", "chrY"]
let read_lengths = [150, 151, 149, 150, 148, 150]
let mixed = ["TP53", 42, true, dna"ATCG"]
Lists support standard access and manipulation:
let genes = ["BRCA1", "TP53", "EGFR", "KRAS", "BRAF"]
print(genes[0]) # => "BRCA1"
print(genes[-1]) # => "BRAF"
print(len(genes)) # => 5
# Concatenation
let more = genes ++ ["PIK3CA", "PTEN"]
# Nested lists (e.g., paired-end read pairs)
let pairs = [
["sample1_R1.fq.gz", "sample1_R2.fq.gz"],
["sample2_R1.fq.gz", "sample2_R2.fq.gz"]
]
print(pairs[0][1]) # => "sample1_R2.fq.gz"
Sets
Sets are unordered collections of unique elements. They are the right tool for gene lists, sample IDs, and any membership-testing workload:
let panel_genes = set(["BRCA1", "BRCA2", "TP53", "EGFR", "KRAS", "BRAF"])
let mutated_genes = set(["TP53", "KRAS", "APC", "PIK3CA"])
# Set operations
let overlap = intersection(panel_genes, mutated_genes)
print(overlap) # => set(["TP53", "KRAS"])
let all_genes = union(panel_genes, mutated_genes)
let panel_only = difference(panel_genes, mutated_genes)
let mutated_only = difference(mutated_genes, panel_genes)
print(f"On panel and mutated: {len(overlap)}")
print(f"Mutated, not on panel: {len(mutated_only)}")
# Membership
print(contains(panel_genes, "TP53")) # => true
A real use case – finding shared variants between tumor and normal:
let tumor_variants = read_vcf("tumor.vcf.gz")
|> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
|> set()
let normal_variants = read_vcf("normal.vcf.gz")
|> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
|> set()
let somatic = difference(tumor_variants, normal_variants)
let germline = intersection(tumor_variants, normal_variants)
print(f"Somatic: {len(somatic)}, Germline: {len(germline)}")
Tables
Tables are the central data structure for tabular bioinformatics data. They are created from lists of records:
let samples = table([
{sample_id: "S001", tissue: "tumor", reads: 42_000_000, coverage: 35.2},
{sample_id: "S002", tissue: "normal", reads: 38_000_000, coverage: 31.4},
{sample_id: "S003", tissue: "tumor", reads: 51_000_000, coverage: 42.1},
{sample_id: "S004", tissue: "normal", reads: 44_000_000, coverage: 36.8}
])
Column Access
let coverages = samples |> select("coverage")
print(coverages) # => [35.2, 31.4, 42.1, 36.8]
# Multiple columns
let subset = samples |> select(["sample_id", "coverage"])
Row Iteration
samples |> each(|row| {
print(f"{row.sample_id}: {row.coverage}x ({row.tissue})")
})
Table Operations
select – Choose Columns
let qc_summary = full_table |> select(["sample_id", "total_reads", "pct_mapped", "mean_coverage"])
mutate – Add or Transform Columns
let enriched = samples |> mutate(
reads_millions: |row| row.reads / 1_000_000.0,
is_high_coverage: |row| row.coverage >= 30.0,
label: |row| f"{row.sample_id}_{row.tissue}"
)
summarize – Aggregate Statistics
let stats = samples |> summarize(
n_samples: count(),
total_reads: sum(reads),
mean_coverage: mean(coverage),
min_coverage: min(coverage),
max_coverage: max(coverage)
)
print(f"Samples: {stats.n_samples}, Mean coverage: {stats.mean_coverage:.1f}x")
group_by – Split-Apply-Combine
let by_tissue = samples
|> group_by("tissue")
|> summarize(
n: count(),
mean_cov: mean(coverage),
mean_reads: mean(reads)
)
by_tissue |> each(|row|
print(f"{row.tissue}: n={row.n}, cov={row.mean_cov:.1f}x, reads={row.mean_reads:.0f}")
)
sort – Order Rows
# Sort by coverage descending
let ranked = samples |> sort("coverage", descending: true)
# Multi-column sort
let ordered = samples |> sort(["tissue", "coverage"], descending: [false, true])
filter – Select Rows by Condition
let high_cov = samples |> filter(|row| row.coverage >= 30.0)
let tumor_only = samples |> filter(|row| row.tissue == "tumor")
# Chained filters
let good_tumor = samples
|> filter(|row| row.tissue == "tumor")
|> filter(|row| row.coverage >= 30.0)
|> filter(|row| row.reads >= 40_000_000)
Joins (Conceptual)
Table joins are planned for a future release. Currently, use map with lookups:
let annotations = table([
{sample_id: "S001", patient: "P101", stage: "III"},
{sample_id: "S002", patient: "P101", stage: "III"},
{sample_id: "S003", patient: "P202", stage: "II"}
])
# Manual left join via lookup
let ann_map = annotations |> group_by("sample_id") |> map(|g| {key: g.key, val: g.values[0]})
let joined = samples |> map(|row| {
let ann = ann_map |> find(|a| a.key == row.sample_id)
{...row, patient: ann?.val?.patient ?? "unknown", stage: ann?.val?.stage ?? "unknown"}
})
Ranges
Ranges generate sequences of integers, useful for genomic coordinate windows:
let indices = 0..10 # [0, 1, 2, ..., 9]
let inclusive = 0..=10 # [0, 1, 2, ..., 10]
let stepped = 0..100..10 # [0, 10, 20, ..., 90]
# Chromosome positions in 1 Mb windows
let chrom_length = 248_956_422 # chr1
let windows = 0..chrom_length..1_000_000
|> map(|start| {
chrom: "chr1",
start: start,
end: min(start + 1_000_000, chrom_length)
})
print(f"Generated {len(windows)} windows for chr1")
Example: Sample QC Summary Table from FASTQ Stats
Read multiple FASTQ files, compute QC metrics, and build a summary table.
# fastq_qc_summary.bl
# Build a QC summary table from raw FASTQ files.
let sample_sheet = csv("samples.csv")
let compute_stats = |reads| {
let quals = reads |> map(|r| mean(r.quality))
let lengths = reads |> map(|r| seq_len(r.seq))
let gc_vals = reads |> map(|r| gc_content(r.seq))
{
total_reads: len(reads),
total_bases: sum(lengths),
mean_length: mean(lengths),
mean_quality: mean(quals),
q30_pct: reads |> filter(|r| mean(r.quality) >= 30) |> len() / len(reads) * 100.0,
gc_pct: mean(gc_vals) * 100.0
}
}
let qc_results = sample_sheet |> map(|row| {
let r1 = read_fastq(row.fastq_r1)
let r1_stats = compute_stats(r1)
let r2_stats = if !is_nil(row.fastq_r2) then
compute_stats(read_fastq(row.fastq_r2))
else nil
{
sample_id: row.sample_id,
total_reads: r1_stats.total_reads + (r2_stats?.total_reads ?? 0),
total_bases: r1_stats.total_bases + (r2_stats?.total_bases ?? 0),
mean_length: r1_stats.mean_length,
mean_quality: r1_stats.mean_quality,
q30_pct: r1_stats.q30_pct,
gc_pct: r1_stats.gc_pct
}
})
let qc_table = table(qc_results)
# Add pass/fail column
let qc_flagged = qc_table |> mutate(
qc_pass: |row| row.mean_quality >= 30.0 && row.q30_pct >= 80.0 && row.total_reads >= 10_000_000
)
# Summary by pass/fail
let summary = qc_flagged |> group_by("qc_pass") |> summarize(
count: count(),
mean_reads: mean(total_reads),
mean_q: mean(mean_quality)
)
print("QC Summary:")
print("=" * 60)
qc_flagged |> each(|row| {
let flag = if row.qc_pass then "PASS" else "FAIL"
print(f" [{flag}] {row.sample_id}: {row.total_reads / 1e6:.1f}M reads, Q={row.mean_quality:.1f}, Q30={row.q30_pct:.1f}%")
})
let pass_count = qc_flagged |> filter(|r| r.qc_pass) |> len()
let fail_count = qc_flagged |> filter(|r| !r.qc_pass) |> len()
print(f"\n{pass_count} passed, {fail_count} failed out of {len(qc_flagged)} samples")
# Write results
qc_flagged |> write_csv("qc_summary.csv")
Example: Gene Expression Matrix – Filter and Normalize
Load a counts matrix, filter low-count genes, and apply TPM normalization.
# expression_normalize.bl
# Filter low-count genes and compute TPM from a raw counts matrix.
let counts = csv("gene_counts.csv") # rows = genes, columns = samples
let gene_lengths = csv("gene_lengths.csv") # gene_id, length
# Build a length lookup
let length_map = gene_lengths
|> group_by("gene_id")
|> map(|g| {key: g.key, length: g.values[0].length})
# Convert to table with gene metadata
let expr_table = table(counts)
let sample_cols = colnames(expr_table) |> filter(|c| c != "gene_id")
# Filter: keep genes with >= 10 counts in at least 3 samples
let filtered = expr_table |> filter(|row| {
let expressed = sample_cols |> filter(|col| row[col] >= 10) |> len()
expressed >= 3
})
print(f"Genes before filter: {len(expr_table)}")
print(f"Genes after filter: {len(filtered)}")
# RPK: reads per kilobase — transform each row
let rpk_table = filtered |> map(|gene_row| {
let gene_len = length_map |> find(|g| g.key == gene_row.gene_id)
let len_kb = (gene_len?.length ?? 1000) / 1000.0
let updated = sample_cols |> reduce({...gene_row}, |acc, col| {
...acc, [col]: gene_row[col] / len_kb
})
updated
}) |> table()
# TPM: normalize RPK to sum to 1 million per sample
let rpk_sums = sample_cols |> map(|col| {
col: col,
total: rpk_table |> select(col) |> sum()
})
let tpm_table = rpk_table |> map(|row| {
sample_cols |> reduce({...row}, |acc, col| {
let col_sum = rpk_sums |> find(|s| s.col == col)
{...acc, [col]: (row[col] / col_sum.total) * 1e6}
})
}) |> table()
# Summary statistics per gene
let gene_stats = tpm_table |> mutate(
mean_tpm: |row| sample_cols |> map(|c| row[c]) |> mean(),
max_tpm: |row| sample_cols |> map(|c| row[c]) |> max(),
cv: |row| {
let vals = sample_cols |> map(|c| row[c])
stdev(vals) / (mean(vals) + 1e-10)
}
)
# Top variable genes
let top_variable = gene_stats
|> sort("cv", descending: true)
|> take(20)
print("\nTop 20 most variable genes (by CV):")
top_variable |> each(|g|
print(f" {g.gene_id}: mean TPM={g.mean_tpm:.1f}, CV={g.cv:.2f}")
)
tpm_table |> write_csv("tpm_normalized.csv")
Example: Merge Variant Calls from Multiple Samples
Combine VCF calls from multiple samples into a unified genotype matrix.
# merge_variants.bl
# Merge variant calls from multiple samples into a unified table.
let sample_vcfs = [
{sample: "tumor_A", vcf: "tumor_A.vcf.gz"},
{sample: "tumor_B", vcf: "tumor_B.vcf.gz"},
{sample: "normal_A", vcf: "normal_A.vcf.gz"},
{sample: "normal_B", vcf: "normal_B.vcf.gz"}
]
# Read all variants with sample tag
let all_calls = sample_vcfs |> flat_map(|s| {
read_vcf(s.vcf) |> map(|v| {
key: f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}",
chrom: v.chrom,
pos: v.pos,
ref_allele: v.ref,
alt_allele: v.alt,
sample: s.sample,
qual: into(v.qual ?? "0", "Float"),
depth: into(v.info?.DP ?? "0", "Int"),
af: into(v.info?.AF ?? "0.0", "Float"),
genotype: v.samples?.GT ?? "./."
})
})
# Collect unique variant sites
let unique_sites = all_calls
|> map(|v| v.key)
|> unique()
|> sort()
print(f"Total calls: {len(all_calls)}")
print(f"Unique variant sites: {len(unique_sites)}")
# Build genotype matrix: one row per variant site, one column per sample
let sample_names = sample_vcfs |> map(|s| s.sample)
let calls_by_key = all_calls |> group_by(|v| v.key)
let merged = unique_sites |> map(|site_key| {
let site_calls = calls_by_key |> find(|g| g.key == site_key)
let first = site_calls.values[0]
let base = {
chrom: first.chrom,
pos: first.pos,
ref_allele: first.ref_allele,
alt_allele: first.alt_allele,
n_samples: len(site_calls.values)
}
# Add per-sample genotype columns
let gt_fields = sample_names |> reduce({}, |acc, sname| {
let call = site_calls.values |> find(|v| v.sample == sname)
{...acc, [f"{sname}_GT"]: call?.genotype ?? "./.", [f"{sname}_AF"]: call?.af ?? 0.0}
})
{...base, ...gt_fields}
})
let merged_table = table(merged)
# Find variants present in tumor but absent in normal (somatic candidates)
let somatic = merged_table |> filter(|row| {
let tumor_present = row.tumor_A_GT != "./." || row.tumor_B_GT != "./."
let normal_absent = row.normal_A_GT == "./." && row.normal_B_GT == "./."
tumor_present && normal_absent
})
let shared_tumor = merged_table |> filter(|row|
row.tumor_A_GT != "./." && row.tumor_B_GT != "./."
)
print(f"\nSomatic candidates: {len(somatic)}")
print(f"Shared across tumors: {len(shared_tumor)}")
print(f"\nTop somatic candidates:")
somatic
|> sort("n_samples", descending: true)
|> take(10)
|> each(|v| print(f" {v.chrom}:{v.pos} {v.ref_allele}>{v.alt_allele} (in {v.n_samples} sample(s))"))
merged_table |> write_csv("merged_genotypes.csv")
Summary
Collections and tables give you the tools to handle the data shapes that
dominate bioinformatics: sample sheets, QC matrices, expression counts,
and multi-sample variant calls. Combined with group_by, summarize,
and mutate, you can express complex data wrangling pipelines concisely.
The next chapter introduces pipes and transforms – the core of BioLang’s workflow composition model.
Chapter 5: Pipes and Transforms
The pipe operator is the defining feature of BioLang. Every bioinformatics workflow is a chain of transformations: read data, filter, transform, summarize, output. Pipes make these chains explicit, readable, and composable.
The Pipe Operator |>
The pipe a |> f(b) desugars to f(a, b). The left-hand value becomes the first
argument of the right-hand function:
# These are equivalent:
gc_content(dna"ATCGATCGATCG")
dna"ATCGATCGATCG" |> gc_content()
# And these:
filter(reads, |r| mean(r.quality) >= 30)
reads |> filter(|r| mean(r.quality) >= 30)
Why Pipe-First Matters for Bioinformatics
Without pipes, nested function calls read inside-out:
# Hard to follow -- you read from the innermost call outward
write_csv(
sort(
summarize(
group_by(
filter(read_vcf("calls.vcf.gz"), |v| v.qual >= 30),
|v| v.chrom
),
n: count(),
mean_qual: mean(qual)
),
"n",
descending: true
),
"chrom_summary.csv"
)
With pipes, the same workflow reads top-to-bottom, matching how you think about the analysis:
read_vcf("calls.vcf.gz")
|> filter(|v| v.qual >= 30)
|> group_by(|v| v.chrom)
|> summarize(n: count(), mean_qual: mean(qual))
|> sort("n", descending: true)
|> write_csv("chrom_summary.csv")
Each line is one transformation step. You can read the pipeline like a protocol.
Chaining Transforms
The real power emerges when you chain multiple operations. Here is a complete read-to-report workflow:
# Read a BAM file, compute per-chromosome alignment statistics
read_bam("sample.sorted.bam")
|> filter(|r| r.mapq >= 30 && !r.is_duplicate && !r.is_unmapped)
|> group_by(|r| r.chrom)
|> summarize(
mapped_reads: count(),
mean_mapq: mean(mapq),
mean_insert: mean(insert_size)
)
|> sort("mapped_reads", descending: true)
|> mutate(pct: |row| row.mapped_reads / sum(mapped_reads) * 100.0)
|> write_csv("alignment_stats.csv")
The Tap Pipe |>>
Sometimes you need a side effect in the middle of a chain – logging, writing
an intermediate file, printing a progress message – without disrupting the data
flow. The tap pipe |>> evaluates the right side for its effect but passes the
left side through unchanged:
read_fastq("sample_R1.fastq.gz")
|>> |reads| print(f"Raw reads: {len(reads)}")
|> filter(|r| mean(r.quality) >= 25)
|>> |reads| print(f"After quality filter: {len(reads)}")
|> filter(|r| seq_len(r.seq) >= 50)
|>> |reads| print(f"After length filter: {len(reads)}")
|> write_fastq("filtered_R1.fastq.gz")
Output:
Raw reads: 2847291
After quality filter: 2541083
After length filter: 2539847
The tap pipe is invaluable for debugging pipelines:
read_vcf("raw_calls.vcf.gz")
|>> |vs| print(f"Step 0 - Raw: {len(vs)} variants")
|> filter(|v| v.filter == "PASS")
|>> |vs| print(f"Step 1 - PASS only: {len(vs)}")
|> filter(|v| v.qual >= 30)
|>> |vs| print(f"Step 2 - Qual >= 30: {len(vs)}")
|> filter(|v| into(v.info?.DP ?? "0", "Int") >= 10)
|>> |vs| print(f"Step 3 - Depth >= 10: {len(vs)}")
|> write_vcf("filtered_calls.vcf.gz")
You can also use tap to write intermediate checkpoints:
read_fasta("assembly.fa")
|> filter(|s| seq_len(s.seq) >= 1000)
|>> |seqs| write_fasta(seqs, "contigs_gt1kb.fa")
|> filter(|s| gc_content(s.seq) >= 0.3 && gc_content(s.seq) <= 0.7)
|>> |seqs| write_fasta(seqs, "contigs_normal_gc.fa")
|> sort(|s| seq_len(s.seq), descending: true)
|> take(100)
|> write_fasta("top100_contigs.fa")
Higher-Order Functions in Pipe Chains
map – Transform Each Element
# Extract GC content per contig
let gc_profile = read_fasta("contigs.fa")
|> map(|seq| {id: seq.id, gc: gc_content(seq.seq), length: seq_len(seq.seq)})
filter – Select Elements by Predicate
# Keep only high-quality mapped reads
let good_reads = read_bam("aligned.bam")
|> filter(|r| r.mapq >= 30)
|> filter(|r| !r.is_duplicate)
|> filter(|r| r.is_proper_pair)
reduce – Accumulate a Single Value
# Total bases across all chromosomes
let total_bases = read_fasta("reference.fa")
|> map(|s| seq_len(s.seq))
|> reduce(0, |acc, n| acc + n)
print(f"Reference genome size: {total_bases / 1e9:.2f} Gb")
sort – Order Elements
# Rank genes by expression level
let top_genes = csv("tpm.csv")
|> sort("tpm", descending: true)
|> take(50)
|> map(|row| row.gene_name)
flat_map – Map and Flatten
Essential when each input produces multiple outputs:
# Extract all exons from a gene annotation
let all_exons = read_gff("genes.gff3")
|> filter(|f| f.type == "gene")
|> flat_map(|gene| gene.children |> filter(|c| c.type == "exon"))
# Find all k-mers across multiple sequences
let all_kmers = read_fasta("contigs.fa")
|> flat_map(|seq| seq.seq |> window(21) |> map(|w| to_string(w)))
|> group_by(|kmer| kmer)
|> map(|g| {kmer: g.key, count: len(g.values)})
|> sort("count", descending: true)
take_while – Stream Until Condition Fails
# Read sorted variants until we pass a genomic position
let nearby_variants = read_vcf("sorted.vcf.gz")
|> filter(|v| v.chrom == "chr17")
|> take_while(|v| v.pos <= 7700000)
|> filter(|v| v.pos >= 7660000)
scan – Running Accumulation
Like reduce, but emits every intermediate value:
# Cumulative read count across sorted BAM regions
let cumulative = read_bam("sorted.bam")
|> group_by(|r| r.chrom)
|> map(|g| {chrom: g.key, count: len(g.values)})
|> sort("chrom")
|> scan(0, |acc, row| acc + row.count)
window and chunk – Sliding and Fixed Blocks
# Sliding window GC content
let gc_track = dna"ATCGATCGATCGATCGCCCCGGGG"
|> window(size: 10, step: 5)
|> map(|w| gc_content(w))
# Chunk reads into batches for parallel processing
let batches = read_fastq("sample.fastq.gz")
|> chunk(100_000)
|> map(|batch| {n_reads: len(batch), mean_gc: batch |> map(|r| gc_content(r.seq)) |> mean()})
zip – Pair Two Collections
# Pair forward and reverse reads
let r1 = read_fastq("sample_R1.fastq.gz")
let r2 = read_fastq("sample_R2.fastq.gz")
let paired = zip(r1, r2) |> map(|pair| {
id: pair[0].id,
r1_qual: mean(pair[0].quality),
r2_qual: mean(pair[1].quality),
combined_length: seq_len(pair[0].seq) + seq_len(pair[1].seq)
})
enumerate – Track Position
# Number contigs by size rank
read_fasta("assembly.fa")
|> sort(|s| seq_len(s.seq), descending: true)
|> enumerate()
|> map(|i, seq| {rank: i + 1, id: seq.id, length: seq_len(seq.seq)})
|> take(10)
|> each(|row| print(f"#{row.rank}: {row.id} ({row.length} bp)"))
Example: FASTQ Quality Filtering Pipeline
A complete quality filtering workflow with logging at each step.
# quality_filter.bl
# Multi-step FASTQ quality filtering pipeline.
let input_r1 = "raw_data/sample_R1.fastq.gz"
let input_r2 = "raw_data/sample_R2.fastq.gz"
# Process R1 and R2 in parallel-style
let process_reads = |input_path, label| {
read_fastq(input_path)
|>> |reads| print(f"[{label}] Input: {len(reads)} reads")
# Step 1: Remove short reads
|> filter(|r| seq_len(r.seq) >= 50)
|>> |reads| print(f"[{label}] After length filter: {len(reads)}")
# Step 2: Filter by mean quality
|> filter(|r| mean(r.quality) >= 25)
|>> |reads| print(f"[{label}] After quality filter: {len(reads)}")
# Step 3: Remove low-complexity reads
|> filter(|r| {
let gc = gc_content(r.seq)
gc > 0.1 && gc < 0.9
})
|>> |reads| print(f"[{label}] After complexity filter: {len(reads)}")
}
let filtered_r1 = process_reads(input_r1, "R1")
let filtered_r2 = process_reads(input_r2, "R2")
# Keep only concordant pairs (both mates survived)
let r1_ids = filtered_r1 |> map(|r| r.id) |> set()
let r2_ids = filtered_r2 |> map(|r| r.id) |> set()
let shared_ids = intersection(r1_ids, r2_ids)
let paired_r1 = filtered_r1 |> filter(|r| contains(shared_ids, r.id))
let paired_r2 = filtered_r2 |> filter(|r| contains(shared_ids, r.id))
print(f"\nFinal paired reads: {len(paired_r1)} pairs")
paired_r1 |> write_fastq("filtered/sample_R1.fastq.gz")
paired_r2 |> write_fastq("filtered/sample_R2.fastq.gz")
# Orphan reads (mate was filtered out)
let orphan_r1 = filtered_r1 |> filter(|r| !contains(shared_ids, r.id))
let orphan_r2 = filtered_r2 |> filter(|r| !contains(shared_ids, r.id))
let orphans = orphan_r1 ++ orphan_r2
print(f"Orphan reads: {len(orphans)}")
orphans |> write_fastq("filtered/sample_orphans.fastq.gz")
Example: Variant Annotation Pipeline
Read VCF, classify variants, annotate against known databases, filter, and report.
# annotate_variants.bl
# Variant annotation and classification pipeline.
let known_pathogenic = csv("clinvar_pathogenic.csv")
|> map(|r| f"{r.chrom}:{r.pos}:{r.ref}>{r.alt}")
|> set()
let gnomad_af = csv("gnomad_af.csv")
|> group_by(|r| f"{r.chrom}:{r.pos}:{r.ref}>{r.alt}")
|> map(|g| {key: g.key, af: into(g.values[0].af, "Float")})
read_vcf("somatic_calls.vcf.gz")
# Step 1: Basic quality filter
|> filter(|v| v.qual >= 30 && v.filter == "PASS")
|>> |vs| print(f"After quality filter: {len(vs)}")
# Step 2: Classify variant type
|> map(|v| {
...v,
key: f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}",
vtype: variant_type(variant(v.chrom, v.pos, v.ref, v.alt))
})
# Step 3: Annotate with population frequency
|> map(|v| {
let gnomad = gnomad_af |> find(|g| g.key == v.key)
{...v, gnomad_af: gnomad?.af ?? 0.0}
})
# Step 4: Annotate with ClinVar pathogenicity
|> map(|v| {
...v,
is_known_pathogenic: contains(known_pathogenic, v.key)
})
# Step 5: Assign priority tier
|> map(|v| {
let tier = if v.is_known_pathogenic then "Tier1_Known"
else if v.gnomad_af < 0.001 && v.vtype != "SNV" then "Tier2_RareIndel"
else if v.gnomad_af < 0.01 then "Tier3_Rare"
else "Tier4_Common"
{...v, tier: tier}
})
|>> |vs| {
let tier_counts = vs |> group_by(|v| v.tier) |> map(|g| f"{g.key}: {len(g.values)}")
print(f"Tier distribution: {tier_counts}")
}
# Step 6: Filter to actionable variants
|> filter(|v| v.tier == "Tier1_Known" || v.tier == "Tier2_RareIndel")
|>> |vs| print(f"Actionable variants: {len(vs)}")
# Step 7: Sort by priority and position
|> sort(["tier", "chrom", "pos"])
# Step 8: Generate report
|> map(|v| {
chrom: v.chrom,
pos: v.pos,
ref_allele: v.ref,
alt_allele: v.alt,
type: v.vtype,
qual: v.qual,
gnomad_af: v.gnomad_af,
tier: v.tier,
clinvar: if v.is_known_pathogenic then "Pathogenic" else "."
})
|> write_csv("annotated_actionable.csv")
Example: Multi-Sample Comparison with Zip and Reduce
Compare variant calls across matched tumor-normal pairs and compute concordance.
# multi_sample_compare.bl
# Compare variant calls across tumor-normal pairs.
let pairs = [
{tumor: "patient1_tumor.vcf.gz", normal: "patient1_normal.vcf.gz", patient: "P001"},
{tumor: "patient2_tumor.vcf.gz", normal: "patient2_normal.vcf.gz", patient: "P002"},
{tumor: "patient3_tumor.vcf.gz", normal: "patient3_normal.vcf.gz", patient: "P003"}
]
# Process each pair
let pair_results = pairs |> map(|pair| {
let tumor_vars = read_vcf(pair.tumor)
|> filter(|v| v.qual >= 30)
|> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
|> set()
let normal_vars = read_vcf(pair.normal)
|> filter(|v| v.qual >= 30)
|> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
|> set()
let somatic = difference(tumor_vars, normal_vars)
let germline = intersection(tumor_vars, normal_vars)
let normal_only = difference(normal_vars, tumor_vars)
{
patient: pair.patient,
tumor_total: len(tumor_vars),
normal_total: len(normal_vars),
somatic: len(somatic),
germline: len(germline),
normal_only: len(normal_only),
somatic_variants: somatic
}
})
# Summary table
let summary = table(pair_results |> map(|r| {
patient: r.patient,
tumor_total: r.tumor_total,
normal_total: r.normal_total,
somatic: r.somatic,
germline: r.germline,
somatic_rate: r.somatic / r.tumor_total * 100.0
}))
print("Patient Variant Comparison")
print("=" * 70)
summary |> each(|row|
print(f" {row.patient}: tumor={row.tumor_total}, somatic={row.somatic} ({row.somatic_rate:.1f}%), germline={row.germline}")
)
# Find recurrent somatic variants (present in 2+ patients)
let all_somatic = pair_results
|> flat_map(|r| r.somatic_variants |> collect() |> map(|v| {variant: v, patient: r.patient}))
|> group_by(|x| x.variant)
|> filter(|g| len(g.values) >= 2)
|> map(|g| {
variant: g.key,
n_patients: len(g.values),
patients: g.values |> map(|v| v.patient)
})
|> sort("n_patients", descending: true)
print(f"\nRecurrent somatic variants (in >= 2 patients): {len(all_somatic)}")
all_somatic |> take(20) |> each(|v|
print(f" {v.variant} -- found in {v.n_patients} patients: {v.patients}")
)
# Cross-patient aggregates via reduce
let totals = pair_results |> reduce(
{total_somatic: 0, total_germline: 0, patients: 0},
|acc, r| {
total_somatic: acc.total_somatic + r.somatic,
total_germline: acc.total_germline + r.germline,
patients: acc.patients + 1
}
)
print(f"\nCohort totals across {totals.patients} patients:")
print(f" Mean somatic variants: {totals.total_somatic / totals.patients:.0f}")
print(f" Mean germline variants: {totals.total_germline / totals.patients:.0f}")
# Zip tumor-normal read depths for concordance check
let p1_tumor_depths = read_vcf(pairs[0].tumor) |> map(|v| into(v.info?.DP ?? "0", "Int"))
let p1_normal_depths = read_vcf(pairs[0].normal) |> map(|v| into(v.info?.DP ?? "0", "Int"))
let depth_comparison = zip(p1_tumor_depths, p1_normal_depths)
|> map(|pair| {tumor_dp: pair[0], normal_dp: pair[1], ratio: pair[0] / (pair[1] + 1.0)})
|> filter(|d| d.ratio > 3.0) # flag positions with 3x tumor vs normal depth
print(f"\nPatient 1 depth outliers (tumor/normal > 3x): {len(depth_comparison)}")
Summary
Pipes and transforms are the backbone of BioLang. The |> operator turns nested
function calls into linear, readable workflows. The tap pipe |>> lets you
observe and debug without disrupting the data flow. Higher-order functions like
map, filter, reduce, flat_map, zip, and window cover the full range
of data transformation patterns you encounter in bioinformatics pipelines.
Every example in this chapter represents a real analysis pattern: quality filtering, variant annotation, multi-sample comparison. As your pipelines grow more complex, pipes keep them readable and composable.
Functions and Closures
Functions are the primary unit of reuse in BioLang. Whether you are writing a normalisation routine for RNA-seq counts, a scoring function for sequence alignment, or a filter predicate for variant calls, functions let you name a computation and invoke it wherever it is needed.
Defining Functions
A function is introduced with the fn keyword, followed by a name, a
parameter list, and a body enclosed in braces.
fn gc_content(seq) {
let gc = seq |> filter(|b| b == "G" || b == "C") |> len()
gc / seq_len(seq)
}
let ratio = gc_content("ATGCGCTA")
# ratio => 0.5
The last expression in the body is the implicit return value. There is no need
for a return keyword in the common case.
Explicit Return
When you need to exit early – for example after detecting an invalid input –
use return.
fn median_quality(quals) {
if len(quals) == 0 then {
return 0.0
}
let sorted = quals |> sort()
let mid = len(sorted) / 2
if len(sorted) % 2 == 0 then {
(sorted[mid - 1] + sorted[mid]) / 2.0
} else {
sorted[mid]
}
}
Default Parameters
Parameters can carry default values. Callers may omit them.
fn trim_reads(records, min_qual: 20, min_len: 36) {
records
|> filter(|r| mean(r.quality) >= min_qual)
|> filter(|r| seq_len(r.seq) >= min_len)
}
# Use defaults
let trimmed = trim_reads(raw_reads)
# Override quality threshold only
let strict = trim_reads(raw_reads, min_qual: 30)
Default parameters must appear after positional parameters.
Named Return Values
You can declare named return fields to make the return shape explicit.
fn count_variants(vcf_path) -> (snps: Int, indels: Int, other: Int) {
let records = read_vcf(vcf_path)
let snps = 0
let indels = 0
let other = 0
records |> each(|v| {
let vt = variant_type(v)
if vt == "Snp" then snps = snps + 1
else if vt == "Indel" then indels = indels + 1
else other = other + 1
})
}
Closures and Lambdas
A closure is an anonymous function written with pipe delimiters. Short closures
use a single expression; longer ones use { } for a block body.
# Single-expression closure
let is_high_qual = |v| v.qual >= 30.0
# Block closure
let normalise = |counts, size_factors| {
let total = counts |> reduce(0, |a, b| a + b)
counts |> map(|c| (c / total) * 1e6 / size_factors)
}
Closures capture variables from the surrounding scope, which makes them ideal for building specialised predicates on the fly.
fn make_depth_filter(min_depth) {
|record| record.info.DP >= min_depth
}
let deep_enough = make_depth_filter(30)
let filtered = variants |> filter(deep_enough)
Higher-Order Functions
A higher-order function accepts another function (or closure) as an argument, returns one, or both.
fn apply_qc_chain(reads, filters) {
filters |> reduce(reads, |current, f| current |> filter(f))
}
let filters = [
|r| mean(r.quality) >= 20,
|r| seq_len(r.seq) >= 50,
|r| gc_content(r.seq) < 0.8
]
let passing = apply_qc_chain(raw_reads, filters)
Because pipe inserts as the first argument, you can chain higher-order functions fluently:
let top_genes = counts
|> filter(|g| g.biotype == "protein_coding")
|> sort(|g| g.tpm, descending: true)
|> take(100)
|> map(|g| g.gene_name)
Where Clauses
A where clause attaches a precondition to a function. If the predicate is
false at call time, a runtime error is raised.
fn normalise_counts(counts) where len(counts) > 0 {
let total = counts |> reduce(0, |a, b| a + b)
counts |> map(|c| c / total)
}
fn log2_fold_change(treated, control) where control > 0.0 {
log2(treated / control)
}
fn call_consensus(reads) where len(reads) >= 3 {
let bases = reads |> map(|r| r.base)
bases |> group_by(|b| b)
|> sort(|g| len(g.values), descending: true)
|> first()
|> map(|g| g.key)
}
Where clauses serve as executable documentation: they state the contract and enforce it at the boundary.
The @memoize Decorator
Bioinformatics computations are often called repeatedly with the same inputs –
the same gene queried in an annotation database, the same k-mer scored in a
seed-extension aligner. The @memoize decorator caches results keyed by
argument values.
@memoize
fn gc_of_kmer(kmer) {
gc_content(dna(kmer))
}
# First call computes and caches the result.
# Subsequent calls with the same k-mer return instantly.
let g1 = gc_of_kmer("ATCGGC") # cache miss
let g2 = gc_of_kmer("ATCGGC") # cache hit
Memoization is especially valuable when combined with recursion.
Other Decorators
Decorators are annotations placed before a function definition. Besides
@memoize, BioLang supports user-defined decorators.
@log_timing
fn align_reads(fastq, ref_genome) {
# ... alignment logic ...
}
@validate_inputs
fn merge_bams(bam_paths) where len(bam_paths) > 0 {
# ... merge logic ...
}
Decorators wrap the function transparently – the original signature and behaviour are preserved, with the decorator adding cross-cutting logic.
Recursion
Recursive functions call themselves. BioLang supports standard recursion, which is useful for tree-structured biological data such as phylogenies and ontology DAGs.
fn tree_leaf_count(node) {
if len(node.children) == 0 then {
return 1
}
node.children |> map(tree_leaf_count) |> reduce(0, |a, b| a + b)
}
Combine @memoize with recursion for dynamic-programming-style algorithms:
@memoize
fn needleman_wunsch_score(seq1, seq2, i, j, gap_penalty: -2) {
if i == 0 then { return j * gap_penalty }
if j == 0 then { return i * gap_penalty }
let match_score = if seq1[i - 1] == seq2[j - 1] then 1 else -1
let diag = needleman_wunsch_score(seq1, seq2, i - 1, j - 1) + match_score
let up = needleman_wunsch_score(seq1, seq2, i - 1, j) + gap_penalty
let left = needleman_wunsch_score(seq1, seq2, i, j - 1) + gap_penalty
max(diag, max(up, left))
}
let score = needleman_wunsch_score("AGTACG", "ACATAG", 6, 6)
Example: Memoized K-mer Scoring Function
Counting k-mer matches across many sequences calls the same function millions of times. Caching the GC computation removes redundant work.
@memoize
fn kmer_gc(kmer_str) {
gc_content(dna(kmer_str))
}
fn score_sequence_kmers(seq, k: 21) {
let kmer_list = seq |> kmers(k)
let gc_scores = kmer_list |> map(|km| kmer_gc(to_string(km)))
{
n_kmers: len(gc_scores),
mean_gc: mean(gc_scores),
high_gc_count: gc_scores |> filter(|g| g > 0.6) |> len()
}
}
let result = score_sequence_kmers(dna"ATCGATCGCCCCGGGGATATATATCGCGCGATCG")
print(f"K-mers: {result.n_kmers}, Mean GC: {result.mean_gc:.3f}")
Example: Reusable QC Filter Factory
A factory function returns a closure configured with experiment-specific thresholds. Different sequencing protocols produce different acceptable ranges.
fn make_qc_filter(min_qual: 20, min_len: 50, max_n_frac: 0.05) {
|read| {
let avg_q = mean(read.quality)
let n_frac = (read.seq |> filter(|b| b == "N") |> len()) / seq_len(read.seq)
avg_q >= min_qual && seq_len(read.seq) >= min_len && n_frac <= max_n_frac
}
}
# Whole-genome: permissive
let wgs_filter = make_qc_filter(min_qual: 15, min_len: 36)
# Amplicon: strict
let amp_filter = make_qc_filter(min_qual: 30, min_len: 100, max_n_frac: 0.01)
let wgs_reads = read_fastq("sample_wgs.fastq.gz") |> filter(wgs_filter)
let amp_reads = read_fastq("sample_amp.fastq.gz") |> filter(amp_filter)
print("WGS passing: " + to_string(len(wgs_reads)))
print("Amplicon passing: " + to_string(len(amp_reads)))
Example: Recursive Phylogenetic Tree Traversal
Phylogenetic trees are naturally recursive. This example collects all leaf taxa whose branch length from the root exceeds a threshold – useful for detecting fast-evolving lineages.
fn collect_distant_leaves(node, dist_so_far, threshold) {
let current_dist = dist_so_far + (node.branch_length ?? 0.0)
if len(node.children) == 0 then {
# Leaf node
if current_dist > threshold then {
return [{name: node.name, distance: current_dist}]
} else {
return []
}
}
# Internal node: recurse into children and concatenate results
node.children
|> flat_map(|child| collect_distant_leaves(child, current_dist, threshold))
}
# Build a simple tree as nested records
let tree = {
name: "root", branch_length: 0.0, children: [
{name: "primates", branch_length: 0.08, children: [
{name: "human", branch_length: 0.1, children: []},
{name: "chimp", branch_length: 0.12, children: []}
]},
{name: "rodents", branch_length: 0.20, children: [
{name: "mouse", branch_length: 0.35, children: []},
{name: "rat", branch_length: 0.33, children: []}
]}
]
}
let fast_evolving = collect_distant_leaves(tree, 0.0, 0.30)
fast_evolving |> each(|taxon|
print(taxon.name + " => total branch length " + to_string(taxon.distance))
)
# mouse => total branch length 0.55
# rat => total branch length 0.53
Summary
| Feature | Syntax | Use Case |
|---|---|---|
| Function | fn name(params) { } | Named, reusable computation |
| Default params | fn f(x, k: 10) { } | Flexible call sites |
| Where clause | fn f(x) where x > 0 { } | Precondition contracts |
| Closure | |x| expr | Inline predicates, callbacks |
| Block closure | |x| { stmts } | Multi-step anonymous logic |
| @memoize | @memoize fn f() { } | Cache repeated computations |
| Recursion | Self-call in body | Trees, dynamic programming |
Functions and closures are the building blocks that the rest of BioLang – pipelines, parallel maps, and the standard library – is built upon. Master them and every subsequent chapter becomes easier.
Control Flow
BioLang provides a rich set of control-flow constructs designed for the messy realities of biological data. Variants need classification, samples need multi-criteria gating, and searches over sequences must handle both the found and not-found cases gracefully.
if/else Expressions
if/else in BioLang is an expression – it returns a value. This lets you
embed conditional logic directly in assignments and pipes.
let label = if gc_content(seq) > 0.6 {
"GC-rich"
} else if gc_content(seq) < 0.4 {
"AT-rich"
} else {
"balanced"
}
Because it is an expression, you can use it inline:
let allele_freq = if total_depth > 0 { alt_count / total_depth } else { 0.0 }
match Expressions
match compares a value against a series of patterns. Like if, it is an
expression and returns the result of the matching arm.
let codon = "ATG"
let amino_acid = match codon {
"ATG" => "Met"
"TGG" => "Trp"
"TAA" | "TAG" | "TGA" => "Stop"
_ => lookup_codon_table(codon)
}
Patterns can destructure records:
fn describe_alignment(aln) {
match aln {
{mapped: true, mapq: q} if q >= 30 => "high-confidence"
{mapped: true, mapq: q} if q >= 10 => "low-confidence"
{mapped: true} => "very-low-quality"
{mapped: false} => "unmapped"
}
}
for Loops
The basic for loop iterates over any collection or stream.
let reads = read_fastq("sample.fastq.gz")
let total_bases = 0
for read in reads {
total_bases = total_bases + len(read.seq)
}
print("Total bases: " + str(total_bases))
Tuple Destructuring
When iterating over paired data, destructure directly in the loop header.
let sample_ids = ["S1", "S2", "S3"]
let fastq_paths = [
"data/S1_R1.fastq.gz",
"data/S2_R1.fastq.gz",
"data/S3_R1.fastq.gz"
]
for (sample, path) in zip(sample_ids, fastq_paths) {
let stats = read_fastq(path) |> read_stats()
print(sample + ": " + str(stats.total_reads) + " reads")
}
Destructuring also works with enumerate:
let exons = read_bed("BRCA1_exons.bed")
for (i, exon) in enumerate(exons) {
print("Exon " + str(i + 1) + ": " + exon.chrom + ":" + str(exon.start) + "-" + str(exon.end))
}
for/else
The else block executes only when the loop completes without hitting a
break. This is perfect for search-and-report patterns.
let target = dna"TATAAA"
let promoters = read_bed("promoter_regions.bed")
for region in promoters {
let seq = extract_sequence(ref_genome, region)
if contains(seq, target) {
print("TATA box found in " + region.name)
break
}
} else {
print("No TATA box found in any promoter region")
}
while Loops
Use while when the number of iterations is unknown ahead of time.
fn find_orf(seq) {
let i = 0
let in_orf = false
let orf_start = 0
let orfs = []
while i + 2 < len(seq) {
let codon = slice(seq, i, i + 3)
if codon == "ATG" && !in_orf {
in_orf = true
orf_start = i
}
if (codon == "TAA" || codon == "TAG" || codon == "TGA") && in_orf {
orfs = orfs + [{start: orf_start, end: i + 3, length: i + 3 - orf_start}]
in_orf = false
}
i = i + 3
}
orfs
}
let orfs = find_orf("ATGAAACCCTAGATGTTTGAATAA")
# [{start: 0, end: 12, length: 12}, {start: 12, end: 24, length: 12}]
break and continue
break exits the innermost loop. continue skips to the next iteration.
let variants = read_vcf("somatic.vcf")
let first_pathogenic = None
for v in variants {
# Skip low-quality calls
if v.qual < 30.0 {
continue
}
let info = parse_info(v.info_str)
if info?.CLNSIG == "Pathogenic" {
first_pathogenic = v
break
}
}
given/otherwise
given/otherwise is a declarative chain of conditions. It reads top to bottom;
the first true condition wins. otherwise is the fallback.
fn classify_read_pair(r1_len, r2_len, insert_size) {
given {
insert_size < 0 => "invalid_pair"
insert_size > 1000 => "structural_variant_candidate"
r1_len < 36 || r2_len < 36 => "short_read"
insert_size < 150 => "short_insert"
otherwise => "normal"
}
}
given is an expression, so you can assign its result:
let risk = given {
allele_freq >= 0.5 && coverage >= 30 => "high_confidence_somatic"
allele_freq >= 0.2 => "moderate_evidence"
allele_freq >= 0.05 => "low_frequency"
otherwise => "below_detection"
}
guard Clauses
guard asserts that a condition is true. If it is not, the else block
executes – typically an early return or error.
fn calculate_tmb(variants, exome_size_mb) {
guard exome_size_mb > 0 else {
return {error: "Exome size must be positive"}
}
guard len(variants) > 0 else {
return {tmb: 0.0, classification: "low"}
}
let somatic = variants |> filter(|v| v.filter == "PASS")
let tmb = len(somatic) / exome_size_mb
let classification = given {
tmb >= 20.0 => "high"
tmb >= 10.0 => "intermediate"
otherwise => "low"
}
{tmb: tmb, classification: classification}
}
Guards keep the main logic at the top indentation level by pushing error handling to the margin. Use them liberally for input validation.
unless
unless is syntactic sugar for if !condition. It reads naturally for
negative checks.
fn process_bam(path) {
let header = read_bam_header(path)
unless contains(header.sort_order, "coordinate") {
print("WARNING: BAM is not coordinate-sorted, sorting first")
path = sort_bam(path)
}
unless file_exists(path + ".bai") {
print("Indexing BAM")
index_bam(path)
}
# Proceed with analysis
let stats = bam_stats(path)
stats
}
Example: Classify Variants by Type
Read a VCF and produce a summary table of variant types using match.
let variants = read_vcf("sample.vcf")
let classified = variants |> map(|v| {
let vtype = match true {
is_snp(v) && is_transition(v) => "transition"
is_snp(v) && is_transversion(v) => "transversion"
is_snp(v) => "snp_other"
is_indel(v) && len(v.alt) > len(v.ref) => "insertion"
is_indel(v) => "deletion"
_ => "complex"
}
{...v, classification: vtype}
})
let summary = classified
|> group_by(|v| v.classification)
|> map(|group| {type: group.0, count: len(group.1)})
|> sort(|a, b| b.count - a.count)
for row in summary {
print(row.type + ": " + str(row.count))
}
# transition: 45231
# transversion: 22890
# deletion: 3412
# insertion: 2876
# complex: 134
Example: Search for Motif with for/else
Scan upstream regions for a transcription-factor binding motif. Report the first hit or state that none was found.
let motif = dna"CANNTG" # E-box motif (N = any base)
let genes = read_gff("annotations.gff")
|> filter(|f| f.type == "gene" && f.biotype == "protein_coding")
for gene in genes {
let upstream = interval(gene.chrom, gene.start - 2000, gene.start)
let seq = extract_sequence(ref_genome, upstream)
let hits = find_motif(seq, motif)
if len(hits) > 0 {
print("E-box found upstream of " + gene.name + " at offset " + str(hits[0].position))
break
}
} else {
print("No E-box motif found in any upstream region scanned")
}
Example: Multi-Criteria Sample QC with given/otherwise
Gate samples through a series of quality thresholds and assign a disposition.
fn qc_disposition(stats) {
given {
stats.total_reads < 1_000_000 =>
{pass: false, reason: "insufficient_reads", reads: stats.total_reads}
stats.pct_mapped < 70.0 =>
{pass: false, reason: "low_mapping_rate", pct: stats.pct_mapped}
stats.mean_depth < 10.0 =>
{pass: false, reason: "low_depth", depth: stats.mean_depth}
stats.pct_duplicate > 40.0 =>
{pass: false, reason: "high_duplication", pct: stats.pct_duplicate}
stats.contamination > 0.03 =>
{pass: false, reason: "contamination", frac: stats.contamination}
otherwise =>
{pass: true, reason: "all_checks_passed"}
}
}
let samples = ["S001", "S002", "S003", "S004"]
let bam_dir = "data/aligned"
for sample in samples {
let stats = bam_stats(bam_dir + "/" + sample + ".bam")
let result = qc_disposition(stats)
if result.pass {
print(sample + ": PASS")
} else {
print(sample + ": FAIL (" + result.reason + ")")
}
}
Example: Input Validation with guard
A variant-calling wrapper that validates every precondition before running the expensive computation.
fn call_variants(bam_path, ref_path, bed_path, min_depth: 10) {
guard file_exists(bam_path) else {
return {error: "BAM file not found: " + bam_path}
}
guard file_exists(ref_path) else {
return {error: "Reference FASTA not found: " + ref_path}
}
guard file_exists(bed_path) else {
return {error: "Target BED not found: " + bed_path}
}
let header = read_bam_header(bam_path)
guard header.sort_order == "coordinate" else {
return {error: "BAM must be coordinate-sorted"}
}
guard file_exists(bam_path + ".bai") else {
return {error: "BAM index (.bai) not found"}
}
# All preconditions met -- run the caller
let regions = read_bed(bed_path)
let raw_calls = pileup_caller(bam_path, ref_path, regions, min_depth)
let filtered = raw_calls
|> filter(|v| v.qual >= 30.0 && v.depth >= min_depth)
{
total_calls: len(raw_calls),
passing_calls: len(filtered),
variants: filtered
}
}
Summary
| Construct | Returns Value? | Best For |
|---|---|---|
if/else | Yes | Binary or ternary decisions |
match | Yes | Multi-arm pattern dispatch |
for | No | Iteration over collections |
for/else | No | Search with not-found fallback |
while | No | Indeterminate iteration |
given/otherwise | Yes | Declarative condition chains |
guard ... else | No | Early-exit preconditions |
unless | No | Negative-condition readability |
Choose the construct that best communicates intent. Use guard for validation
at function boundaries, given for multi-criteria classification, match for
type-driven dispatch, and for/else when a search must report failure.
Error Handling
Biological data is messy. FASTA files contain malformed headers, VCF records have missing INFO fields, and remote databases time out. BioLang provides layered error-handling mechanisms so you can write code that degrades gracefully instead of crashing mid-pipeline.
try/catch Blocks
Wrap any code that might fail in try. If an error occurs, control transfers
to the catch block, where you can inspect the error and decide how to
proceed.
let records = try {
read_vcf("variants.vcf")
} catch e {
print("Failed to parse VCF: " + e.message)
[]
}
try/catch is an expression – it returns the value of whichever branch
executes:
let depth = try {
let info = parse_info(variant.info_str)
int(info["DP"])
} catch e {
0 # default when DP is absent or unparseable
}
Catching Specific Errors
You can pattern-match on error types inside catch:
try {
let result = fetch_uniprot(accession)
process_protein(result)
} catch e {
match e.kind {
"network_error" => {
print("Network unreachable, using cached data")
load_cache("uniprot", accession)
}
"parse_error" => {
print("Malformed response for " + accession + ", skipping")
None
}
_ => {
print("Unexpected error: " + e.message)
None
}
}
}
Error Propagation
When a function cannot handle an error meaningfully, let it propagate to the caller.
fn load_reference(path) {
guard file_exists(path) else {
error("Reference file not found: " + path)
}
let records = read_fasta(path)
guard len(records) > 0 else {
error("Reference file is empty: " + path)
}
records
}
# Caller decides how to handle it
try {
let ref_seqs = load_reference("hg38.fa")
run_alignment(reads, ref_seqs)
} catch e {
print("Pipeline aborted: " + e.message)
}
The error() function raises an exception that unwinds to the nearest catch.
retry
Network calls to biological databases are unreliable. The retry construct
repeats a block up to a specified number of attempts, with configurable delay
between tries.
let gene_info = retry(3, delay: 2000) {
fetch_ncbi_gene("BRCA1")
}
If all attempts fail, the last error propagates. Combine retry with try/catch
for a fully defensive pattern:
let annotation = try {
retry(5, delay: 1000) {
fetch_ensembl_annotation("ENSG00000139618")
}
} catch e {
print("Ensembl unavailable after 5 attempts: " + e.message)
{gene_name: "BRCA2", source: "fallback_cache"}
}
Exponential Backoff
For high-traffic APIs, increase delay between retries using a helper:
fn fetch_with_backoff(url, max_attempts: 5) {
let attempt = 0
let result = None
while attempt < max_attempts {
try {
result = http_get(url)
break
} catch e {
attempt = attempt + 1
if attempt >= max_attempts {
error("All " + str(max_attempts) + " attempts failed: " + e.message)
}
let wait = 1000 * pow(2, attempt - 1) # 1s, 2s, 4s, 8s, 16s
print("Attempt " + str(attempt) + " failed, retrying in " + str(wait) + "ms")
sleep(wait)
}
}
result
}
let blast_result = fetch_with_backoff("https://blast.ncbi.nlm.nih.gov/blast/Blast.cgi?RID=" + rid)
Null Coalescing: ??
The ?? operator returns the left-hand side if it is not nil; otherwise it
returns the right-hand side. This is indispensable for fields that may be
absent.
let depth = variant.info?.DP ?? 0
let gene_name = annotation?.gene_name ?? "unknown"
let af = variant.info?.AF ?? variant.info?.MAF ?? 0.0
?? chains naturally. The first non-None value wins:
let symbol = record.hugo_symbol ?? record.gene_id ?? record.locus_tag ?? "uncharacterised"
Optional Chaining: ?.
The ?. operator short-circuits a field access chain when any intermediate
value is None. Instead of crashing, the entire expression evaluates to None.
# Without optional chaining -- crashes if info is None or CLNSIG is absent
let sig = variant.info.CLNSIG
# With optional chaining -- returns None safely
let sig = variant.info?.CLNSIG
# Deep chains
let protein_change = variant.annotation?.consequences?.protein?.hgvs
Combine with ?? for a complete default:
let consequence = variant.annotation?.consequences?.most_severe ?? "unknown"
Defensive Patterns for Bio Data
Missing Fields in Records
Biological file formats are under-specified. A BED file might have 3 columns or 12. A VCF INFO field might omit half the keys. Build defensive accessors.
fn safe_info(variant, key, default: None) {
try {
let info = parse_info(variant.info_str)
info[key] ?? default
} catch e {
default
}
}
let dp = safe_info(v, "DP", default: 0)
let af = safe_info(v, "AF", default: 0.0)
let clnsig = safe_info(v, "CLNSIG", default: "not_reported")
Malformed Records
Skip bad records instead of aborting the whole file:
fn robust_parse(lines, parser) {
let good = []
let bad = 0
for (i, line) in enumerate(lines) {
try {
good = good + [parser(line)]
} catch e {
bad = bad + 1
print("Skipped line " + str(i + 1) + ": " + e.message)
}
}
if bad > 0 {
print("Warning: " + str(bad) + " malformed records skipped")
}
good
}
Nil-safe Aggregation
When computing statistics over optional fields, filter out None values first:
let qualities = variants
|> map(|v| v.qual)
|> filter(|q| q != None)
let mean_qual = if len(qualities) > 0 { mean(qualities) } else { 0.0 }
Example: Robust FASTA Parser with try/catch
Parse a FASTA file that may contain corrupted entries mixed with valid ones.
fn parse_fasta_robust(path) {
let raw = read_lines(path)
let records = []
let current_header = None
let current_seq = ""
let errors = []
for line in raw {
if starts_with(line, ">") {
# Flush previous record
if current_header != None {
try {
guard len(current_seq) > 0 else {
error("Empty sequence for " + current_header)
}
guard is_valid_dna(current_seq) else {
error("Invalid characters in " + current_header)
}
records = records + [{header: current_header, seq: current_seq}]
} catch e {
errors = errors + [{header: current_header, reason: e.message}]
}
}
current_header = slice(line, 1, len(line)) |> trim()
current_seq = ""
} else {
current_seq = current_seq + trim(line)
}
}
# Flush last record
if current_header != None {
try {
guard len(current_seq) > 0 else { error("Empty sequence") }
guard is_valid_dna(current_seq) else { error("Invalid characters") }
records = records + [{header: current_header, seq: current_seq}]
} catch e {
errors = errors + [{header: current_header, reason: e.message}]
}
}
print("Parsed " + str(len(records)) + " records, " + str(len(errors)) + " errors")
for err in errors {
print(" SKIP: " + err.header + " (" + err.reason + ")")
}
records
}
let sequences = parse_fasta_robust("mixed_quality.fasta")
Example: Retrying API Calls to NCBI
Query NCBI’s E-utilities for gene summaries, handling the rate limit and transient failures that are common on public APIs.
fn fetch_gene_summaries(gene_ids) {
let results = []
for id in gene_ids {
let summary = try {
retry(3, delay: 1500) {
let url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
+ "?db=gene&id=" + str(id) + "&retmode=json"
let resp = http_get(url)
let data = parse_json(resp.body)
guard data?.result != None else {
error("No result field in response")
}
data.result[str(id)]
}
} catch e {
print("Failed to fetch gene " + str(id) + ": " + e.message)
{gene_id: id, name: "unknown", description: "fetch_failed"}
}
results = results + [summary]
# Respect NCBI rate limit: max 3 requests per second without API key
sleep(400)
}
results
}
let gene_ids = [672, 675, 7157, 4609] # BRCA1, BRCA2, TP53, MYC
let summaries = fetch_gene_summaries(gene_ids)
for s in summaries {
print(str(s.gene_id ?? s.uid) + ": " + (s.name ?? "unknown") + " - " + (s.description ?? "N/A"))
}
Example: Processing VCF with Missing INFO Fields
Real VCF files have inconsistent INFO columns. Some records carry AF, others
do not. Some have CLNSIG, most do not. Use ?. and ?? to process them
uniformly.
let variants = read_vcf("clinvar.vcf")
let annotated = variants |> map(|v| {
let info = parse_info(v.info_str)
let af = info?.AF ?? info?.CAF ?? 0.0
let clinical_sig = info?.CLNSIG ?? "not_reported"
let gene = split(info?.GENEINFO ?? "", ":") |> first() ?? "intergenic"
let review_status = info?.CLNREVSTAT ?? "no_assertion"
let origin = info?.ORIGIN ?? "unknown"
{
chrom: v.chrom,
pos: v.pos,
ref: v.ref,
alt: v.alt,
gene: gene,
clinical_significance: clinical_sig,
allele_frequency: af,
review_status: review_status,
origin: origin
}
})
# Filter to pathogenic variants with at least two-star review
let pathogenic = annotated
|> filter(|v| v.clinical_significance == "Pathogenic"
|| v.clinical_significance == "Likely_pathogenic")
|> filter(|v| v.review_status != "no_assertion")
print("Pathogenic/likely pathogenic variants: " + str(len(pathogenic)))
for v in pathogenic |> take(10) {
print(v.gene + " " + v.chrom + ":" + str(v.pos)
+ " " + v.ref + ">" + v.alt
+ " AF=" + str(v.allele_frequency))
}
# Write a clean report
write_csv(pathogenic, "pathogenic_report.csv")
Summary
| Mechanism | Syntax | Purpose |
|---|---|---|
| try/catch | try { } catch e { } | Graceful failure handling |
| error() | error("msg") | Raise an exception |
| retry | retry(n, delay: ms) { } | Transient-failure recovery |
| ?? | val ?? default | None substitution |
| ?. | obj?.field | Safe field access |
| guard | guard cond else { } | Precondition assertion |
Layer these mechanisms: guard at function entry to reject invalid inputs,
?. and ?? for data access, try/catch around I/O and parsing, and retry
for network calls. Together they produce pipelines that finish with partial
results instead of crashing with none.
File I/O
Bioinformatics begins and ends with files. Sequencers emit FASTQ, aligners produce BAM, callers output VCF, annotators read GFF. BioLang provides first-class readers and writers for every major format, with consistent interfaces and stream support for files that exceed available memory.
Sample data: BioLang ships with sample files in
examples/sample-data/covering FASTA, FASTQ, VCF, BED, GFF, CSV, TSV, and SAM formats. You can substitute these paths wherever the examples below use bare filenames. Runbl run examples/quickstart.blto verify all files are present.
Reading FASTA
read_fasta() returns a list of records, each with header and seq fields.
let contigs = read_fasta("assembly.fasta")
for c in contigs {
print(c.header + ": " + str(len(c.seq)) + " bp")
}
# Find the longest contig
let longest = contigs |> sort(|a, b| len(b.seq) - len(a.seq)) |> first()
print("Longest: " + longest.header + " (" + str(len(longest.seq)) + " bp)")
FASTA records carry the raw sequence as a string. Use bio literals for
compile-time validated sequences, and read_fasta for runtime file data.
Reading FASTQ
read_fastq() returns records with id, seq, quality, and comment
fields. Quality scores are integer Phred values.
let reads = read_fastq("sample_R1.fastq.gz")
let gc_mean = reads |> map(|r| gc_content(r.seq)) |> mean()
let summary = {
total: len(reads),
mean_length: reads |> map(|r| len(r.seq)) |> mean(),
mean_qual: reads |> map(|r| mean(r.quality)) |> mean(),
gc_pct: gc_mean * 100
}
print("Reads: " + str(summary.total))
print("Mean length: " + str(round(summary.mean_length, 1)))
print("Mean quality: " + str(round(summary.mean_qual, 1)))
print("GC%: " + str(round(summary.gc_pct, 1)))
Compressed files (.gz) are decompressed transparently.
Reading VCF
read_vcf() returns variant records with fields chrom, pos, id, ref,
alt, qual, filter, info_str, and genotypes.
let variants = read_vcf("gatk_output.vcf")
let passing = variants |> filter(|v| v.filter == "PASS")
let snps = passing |> filter(|v| is_snp(v))
let indels = passing |> filter(|v| is_indel(v))
print("PASS variants: " + str(len(passing)))
print(" SNPs: " + str(len(snps)))
print(" Indels: " + str(len(indels)))
Reading BED
read_bed() returns interval records with chrom, start, end, and
optional name, score, strand fields depending on the number of columns.
let targets = read_bed("exome_targets.bed")
let total_bp = targets |> map(|t| t.end - t.start) |> reduce(0, |a, b| a + b)
print("Total target region: " + str(total_bp) + " bp")
print("Number of targets: " + str(len(targets)))
# Group by chromosome
let by_chrom = targets |> group_by(|t| t.chrom)
for (chrom, regions) in by_chrom {
let bp = regions |> map(|r| r.end - r.start) |> reduce(0, |a, b| a + b)
print(chrom + ": " + str(len(regions)) + " targets, " + str(bp) + " bp")
}
Reading GFF/GTF
read_gff() parses GFF3 and GTF formats. Records have chrom, source,
type, start, end, score, strand, phase, and attributes (a map).
let features = read_gff("gencode.v44.annotation.gff3")
let genes = features |> filter(|f| f.type == "gene")
let protein_coding = genes |> filter(|g| g.attributes.gene_type == "protein_coding")
print("Total genes: " + str(len(genes)))
print("Protein-coding: " + str(len(protein_coding)))
# Gene length distribution
let lengths = protein_coding |> map(|g| g.end - g.start)
print("Median gene length: " + str(median(lengths)) + " bp")
print("Mean gene length: " + str(round(mean(lengths), 0)) + " bp")
Reading CSV and TSV
csv() and tsv() return tables where column headers become field names.
let metadata = csv("sample_metadata.csv")
# Group samples by condition
let groups = metadata |> group_by(|row| row.condition)
for (condition, samples) in groups {
let ids = samples |> map(|s| s.sample_id) |> join(", ")
print(condition + ": " + ids)
}
# DESeq2-style count matrix
let counts = tsv("gene_counts.tsv")
let gene_names = counts |> map(|row| row.gene_id)
print("Genes in count matrix: " + str(len(gene_names)))
Writing Files
Writer functions take a list of records and an output path. The format is determined by the function name.
# Write FASTA
let filtered = contigs |> filter(|c| len(c.seq) >= 500)
write_fasta(filtered, "assembly_filtered.fasta")
# Write FASTQ
let trimmed = reads |> map(|r| trim_quality(r, 20, 0))
write_fastq(trimmed, "trimmed_R1.fastq.gz")
# Write CSV
let stats_table = samples |> map(|s| {
sample: s.id,
total_reads: s.read_count,
pct_mapped: round(s.mapped / s.read_count * 100, 2),
mean_depth: round(s.depth, 1)
})
write_csv(stats_table, "qc_summary.csv")
The .gz extension triggers gzip compression automatically.
Stream-Based Reading
For files too large to fit in memory – a 300 GB whole-genome BAM, a multi-sample VCF with millions of records – use streaming. Stream readers return a lazy sequence that is consumed once, record by record.
# Stream a large FASTQ without loading it all
let stream = read_fastq("wgs_sample.fastq.gz")
# Streams work with all HOFs -- processing is lazy
let high_qual_count = stream
|> filter(|r| mean(r.quality) >= 30)
|> filter(|r| len(r.seq) >= 100)
|> len()
print("High-quality long reads: " + str(high_qual_count))
Streams are consumed once. If you need multiple passes, collect into a list or read the file again:
# Two-pass: first pass counts, second pass filters
let total = read_fastq("reads.fastq.gz") |> len()
let passing = read_fastq("reads.fastq.gz")
|> filter(|r| mean(r.quality) >= 20)
|> len()
print("Pass rate: " + str(round(passing / total * 100, 2)) + "%")
Example: Convert FASTQ to FASTA
Strip quality scores from a FASTQ file to produce a FASTA for downstream assembly or BLAST.
let reads = read_fastq("nanopore_reads.fastq.gz")
let fasta_records = reads |> map(|r| {
header: r.id + " " + (r.description ?? ""),
seq: r.seq
})
write_fasta(fasta_records, "nanopore_reads.fasta")
print("Converted " + str(len(fasta_records)) + " reads to FASTA")
For large files, this streams through without holding everything in memory
because map on a stream produces a stream.
Example: Extract Coding Sequences from GFF + FASTA
Combine a genome FASTA with GFF annotation to extract CDS sequences for each protein-coding gene.
let genome = read_fasta("GRCh38.fasta")
let features = read_gff("gencode.v44.annotation.gff3")
# Build a lookup from chromosome name to sequence
let chrom_seqs = {}
for contig in genome {
let name = split(contig.header, " ") |> first()
chrom_seqs = {...chrom_seqs, [name]: contig.seq}
}
# Collect CDS exons grouped by transcript
let cds_features = features
|> filter(|f| f.type == "CDS")
|> group_by(|f| f.attributes.transcript_id)
let cds_records = []
for (tx_id, exons) in cds_features {
let sorted_exons = exons |> sort(|a, b| a.start - b.start)
let chrom = sorted_exons[0].chrom
let strand = sorted_exons[0].strand
guard chrom_seqs[chrom] != None else { continue }
let seq = sorted_exons
|> map(|e| slice(chrom_seqs[chrom], e.start - 1, e.end))
|> join("")
let final_seq = if strand == "-" { reverse_complement(seq) } else { seq }
cds_records = cds_records + [{
header: tx_id + " " + chrom + " " + strand,
seq: final_seq
}]
}
write_fasta(cds_records, "coding_sequences.fasta")
print("Extracted " + str(len(cds_records)) + " coding sequences")
Example: Filter VCF by Quality and Write Passing Variants
Apply multiple quality filters to a VCF and write both passing and failing records to separate files.
let variants = read_vcf("raw_calls.vcf")
let results = variants |> map(|v| {
let info = parse_info(v.info_str)
let dp = info?.DP ?? 0
let mq = info?.MQ ?? 0.0
let af = info?.AF ?? 0.0
let pass = v.qual >= 30.0
&& dp >= 10
&& mq >= 40.0
&& v.filter != "LowQual"
{...v, qc_pass: pass, depth: dp, map_qual: mq, allele_freq: af}
})
let passing = results |> filter(|v| v.qc_pass)
let failing = results |> filter(|v| !v.qc_pass)
write_vcf(passing, "filtered_PASS.vcf")
write_vcf(failing, "filtered_FAIL.vcf")
print("Total: " + str(len(results)))
print("PASS: " + str(len(passing)))
print("FAIL: " + str(len(failing)))
# Summarise failure reasons
let low_qual = failing |> filter(|v| v.qual < 30.0) |> len()
let low_depth = failing |> filter(|v| v.depth < 10) |> len()
let low_mq = failing |> filter(|v| v.map_qual < 40.0) |> len()
print("Reasons (not mutually exclusive):")
print(" Low QUAL (<30): " + str(low_qual))
print(" Low depth (<10): " + str(low_depth))
print(" Low MQ (<40): " + str(low_mq))
Example: Merge BED Files and Compute Coverage
Load target regions from multiple BED files, merge overlapping intervals, and compute total coverage.
let bed_files = [
"targets_panel_v1.bed",
"targets_panel_v2.bed",
"custom_hotspots.bed"
]
# Read and concatenate all regions
let all_regions = bed_files
|> flat_map(|path| {
let regions = read_bed(path)
print("Loaded " + str(len(regions)) + " regions from " + path)
regions
})
print("Total regions before merge: " + str(len(all_regions)))
# Sort by chromosome and start position
let sorted = all_regions
|> sort(|a, b| {
if a.chrom != b.chrom {
compare(a.chrom, b.chrom)
} else {
a.start - b.start
}
})
# Merge overlapping intervals
fn merge_intervals(intervals) {
guard len(intervals) > 0 else { return [] }
let merged = [intervals[0]]
for region in intervals |> drop(1) {
let last_region = merged |> last()
if region.chrom == last_region.chrom && region.start <= last_region.end {
# Overlapping -- extend the current interval
let new_end = max(last_region.end, region.end)
merged = merged |> take(len(merged) - 1)
merged = merged + [{...last_region, end: new_end}]
} else {
merged = merged + [region]
}
}
merged
}
let merged = merge_intervals(sorted)
print("Regions after merge: " + str(len(merged)))
let total_bp = merged |> map(|r| r.end - r.start) |> reduce(0, |a, b| a + b)
print("Total coverage: " + str(total_bp) + " bp (" + str(round(total_bp / 1e6, 2)) + " Mb)")
# Per-chromosome summary
let by_chrom = merged |> group_by(|r| r.chrom)
for (chrom, regions) in by_chrom |> sort(|a, b| compare(a.0, b.0)) {
let bp = regions |> map(|r| r.end - r.start) |> reduce(0, |a, b| a + b)
print(" " + chrom + ": " + str(len(regions)) + " regions, " + str(bp) + " bp")
}
# Write merged BED
write_bed(merged, "merged_targets.bed")
Format Reference
| Function | Format | Returns |
|---|---|---|
read_fasta(path) | FASTA | [{header, seq}] |
read_fastq(path) | FASTQ | [{id, seq, quality, comment}] |
read_vcf(path) | VCF | [{chrom, pos, id, ref, alt, qual, filter, info_str, genotypes}] |
read_bed(path) | BED | [{chrom, start, end, name?, score?, strand?}] |
read_gff(path) | GFF3/GTF | [{chrom, source, type, start, end, score, strand, phase, attributes}] |
csv(path) | CSV | [{col1: val, col2: val, ...}] |
tsv(path) | TSV | [{col1: val, col2: val, ...}] |
write_fasta(records, path) | FASTA | writes file |
write_fastq(records, path) | FASTQ | writes file |
write_vcf(records, path) | VCF | writes file |
write_bed(records, path) | BED | writes file |
write_csv(records, path) | CSV | writes file |
All readers accept .gz paths and decompress automatically. All writers
compress when the output path ends in .gz.
Genomic Intervals and Variants
Two data types sit at the heart of genomic analysis: intervals (regions on a chromosome) and variants (differences from a reference). BioLang provides built-in types and operations for both, including interval trees for fast overlap queries and variant classification functions that handle the full spectrum from SNPs to complex structural events.
Genomic Intervals
An interval represents a contiguous region on a chromosome. Create one with
the interval() constructor.
let promoter = interval("chr1", 11869, 14409)
let exon = interval("chr1", 11869, 12227, strand: "+")
Interval Fields
Every interval has chrom, start, and end. Optional fields include
strand, name, and score.
let region = interval("chr7", 55181000, 55182000, strand: "+", name: "EGFR_exon20")
print(region.chrom) # chr7
print(region.start) # 55181000
print(region.end) # 55182000
print(region.strand) # +
print(region.name) # EGFR_exon20
let width = region.end - region.start
print("Width: " + str(width) + " bp")
BioLang uses 0-based, half-open coordinates (BED convention) throughout.
Interval Arithmetic
Test relationships between intervals directly.
let a = interval("chr1", 100, 200)
let b = interval("chr1", 150, 250)
let c = interval("chr1", 300, 400)
# Overlap test
print(overlaps(a, b)) # true
print(overlaps(a, c)) # false
# Containment
let outer = interval("chr1", 100, 500)
print(contains(outer, c)) # true
# Distance between non-overlapping intervals
print(distance(a, c)) # 100
# Merge overlapping intervals into one
let merged = merge_interval(a, b)
# interval("chr1", 100, 250)
# Intersection
let common = intersect(a, b)
# interval("chr1", 150, 200)
Interval Trees
When you have thousands of intervals and need to query overlaps repeatedly, build an interval tree. Construction is O(n log n); each query is O(log n + k) where k is the number of hits.
let exons = read_bed("gencode_exons.bed")
|> map(|e| interval(e.chrom, e.start, e.end, name: e.name))
let tree = interval_tree(exons)
query_overlaps
Find all intervals in the tree that overlap a query region.
let query = interval("chr17", 43044294, 43044295) # single position in BRCA1
let hits = query_overlaps(tree, query)
for hit in hits {
print("Overlaps: " + hit.name + " " + str(hit.start) + "-" + str(hit.end))
}
query_nearest
Find the closest interval to a query, even if there is no overlap.
let snp_pos = interval("chr7", 55181378, 55181379)
let nearest = query_nearest(tree, snp_pos)
print("Nearest feature: " + nearest.name
+ " at distance " + str(distance(snp_pos, nearest)) + " bp")
coverage
Compute per-base coverage across a set of intervals – how many intervals cover each position.
let reads_as_intervals = aligned_reads
|> map(|r| interval(r.chrom, r.start, r.end))
let tree = interval_tree(reads_as_intervals)
let target = interval("chr17", 43044000, 43045000)
let cov = coverage(tree, target)
print("Mean depth over target: " + str(mean(cov)))
print("Bases >= 30x: " + str(cov |> filter(|d| d >= 30) |> len()))
Variants
A variant represents a difference from the reference genome. Create one with
the variant() constructor.
let snp = variant("chr7", 55181378, "T", "A") # EGFR T790M
let del = variant("chr17", 43045684, "TCAA", "T") # BRCA1 deletion
let ins = variant("chr2", 29416089, "A", "AGCTGCTG") # ALK insertion
Variant Classification
BioLang provides functions to classify variants by their molecular type.
let v = variant("chr7", 55181378, "T", "A")
print(is_snp(v)) # true
print(is_indel(v)) # false
print(is_transition(v)) # false (T>A is transversion)
print(is_transversion(v)) # true
print(variant_type(v)) # "Snp"
The full classification set:
let examples = [
variant("chr1", 100, "A", "G"), # transition (purine to purine)
variant("chr1", 200, "A", "T"), # transversion
variant("chr1", 300, "ACG", "A"), # deletion
variant("chr1", 400, "A", "ATCG"), # insertion
variant("chr1", 500, "ACG", "TGA"), # MNP (multi-nucleotide polymorphism)
]
for v in examples {
let vtype = variant_type(v)
let detail = given {
is_snp(v) && is_transition(v) => "transition"
is_snp(v) && is_transversion(v) => "transversion"
is_indel(v) => "indel (" + str(abs(len(v.alt) - len(v.ref))) + "bp)"
otherwise => vtype
}
print(v.chrom + ":" + str(v.pos) + " " + v.ref + ">" + v.alt + " => " + detail)
}
Genotype Queries
When a variant carries genotype information (from a VCF), query zygosity.
let variants = read_vcf("sample.vcf")
let het_snps = variants
|> filter(|v| is_snp(v) && is_het(v))
let hom_alt = variants
|> filter(|v| is_hom_alt(v))
print("Heterozygous SNPs: " + str(len(het_snps)))
print("Homozygous alt calls: " + str(len(hom_alt)))
# Allele balance for heterozygous calls
let allele_balances = het_snps |> map(|v| {
let info = parse_info(v.info_str)
let ad = info?.AD ?? [0, 0]
let total = ad |> reduce(0, |a, b| a + b)
if total > 0 { ad[1] / total } else { 0.0 }
})
print("Median allele balance (het): " + str(round(median(allele_balances), 3)))
parse_vcf_info
The VCF INFO column packs key-value pairs into a semicolon-delimited string.
parse_vcf_info() turns it into a record.
let info_str = "DP=42;MQ=60.0;AF=0.48;CLNSIG=Pathogenic;GENEINFO=BRCA1:672"
let info = parse_info(info_str)
print(info.DP) # 42
print(info.AF) # 0.48
print(info.CLNSIG) # "Pathogenic"
print(info.GENEINFO) # "BRCA1:672"
Example: Find Genes Overlapping with ChIP-seq Peaks
Identify which genes have promoter regions that overlap with transcription factor binding peaks from a ChIP-seq experiment.
# Load gene annotations
let genes = read_gff("gencode.v44.annotation.gff3")
|> filter(|f| f.type == "gene" && f.attributes.gene_type == "protein_coding")
# Define promoter regions: 2kb upstream of TSS
let promoters = genes |> map(|g| {
let tss = if g.strand == "+" { g.start } else { g.end }
let prom_start = if g.strand == "+" { tss - 2000 } else { tss }
let prom_end = if g.strand == "+" { tss } else { tss + 2000 }
interval(g.chrom, max(0, prom_start), prom_end,
name: g.attributes.gene_name)
})
# Load ChIP-seq peaks
let peaks = read_bed("H3K27ac_peaks.bed")
|> map(|p| interval(p.chrom, p.start, p.end, name: p.name ?? "peak"))
# Build interval tree from promoters for fast lookup
let promoter_tree = interval_tree(promoters)
# Query each peak against promoters
let genes_with_peaks = []
for peak in peaks {
let hits = query_overlaps(promoter_tree, peak)
for hit in hits {
genes_with_peaks = genes_with_peaks + [{
gene: hit.name,
peak_chrom: peak.chrom,
peak_start: peak.start,
peak_end: peak.end,
overlap_bp: intersect(peak, hit) |> width()
}]
}
}
# Deduplicate by gene name
let unique_genes = genes_with_peaks
|> map(|g| g.gene)
|> unique()
print("ChIP-seq peaks: " + str(len(peaks)))
print("Genes with H3K27ac at promoter: " + str(len(unique_genes)))
# Top genes by peak overlap size
let top = genes_with_peaks
|> sort(|a, b| b.overlap_bp - a.overlap_bp)
|> take(20)
for entry in top {
print(entry.gene + ": " + str(entry.overlap_bp) + " bp overlap at "
+ entry.peak_chrom + ":" + str(entry.peak_start) + "-" + str(entry.peak_end))
}
write_csv(genes_with_peaks, "genes_with_h3k27ac_peaks.csv")
Example: Classify Variants and Generate a Ti/Tv Report
Transition/transversion ratio (Ti/Tv) is a key quality metric for SNP calling. Whole-genome sequencing typically yields Ti/Tv around 2.0-2.1; exome around 2.8-3.0. Deviations suggest systematic errors.
let variants = read_vcf("deepvariant_output.vcf")
# Keep only PASS SNPs
let pass_snps = variants
|> filter(|v| v.filter == "PASS" && is_snp(v))
# Classify each SNP
let classified = pass_snps |> map(|v| {
let cls = if is_transition(v) { "Ti" } else { "Tv" }
let change = v.ref + ">" + v.alt
{chrom: v.chrom, pos: v.pos, change: change, class: cls, qual: v.qual}
})
let ti_count = classified |> filter(|v| v.class == "Ti") |> len()
let tv_count = classified |> filter(|v| v.class == "Tv") |> len()
let ratio = if tv_count > 0 { ti_count / tv_count } else { 0.0 }
print("Transitions: " + str(ti_count))
print("Transversions: " + str(tv_count))
print("Ti/Tv ratio: " + str(round(ratio, 3)))
# Per-chromosome Ti/Tv
let by_chrom = classified |> group_by(|v| v.chrom)
let chrom_report = by_chrom |> map(|group| {
let chrom = group.0
let vars = group.1
let ti = vars |> filter(|v| v.class == "Ti") |> len()
let tv = vars |> filter(|v| v.class == "Tv") |> len()
{
chrom: chrom,
ti: ti,
tv: tv,
ratio: if tv > 0 { round(ti / tv, 3) } else { 0.0 },
total: ti + tv
}
})
let sorted_report = chrom_report |> sort(|a, b| compare(a.chrom, b.chrom))
print("\nPer-chromosome Ti/Tv:")
print("Chrom\tTi\tTv\tRatio\tTotal")
for row in sorted_report {
print(row.chrom + "\t" + str(row.ti) + "\t" + str(row.tv)
+ "\t" + str(row.ratio) + "\t" + str(row.total))
}
# Substitution spectrum: count each of the 12 possible changes
let spectrum = classified
|> group_by(|v| v.change)
|> map(|g| {change: g.0, count: len(g.1)})
|> sort(|a, b| b.count - a.count)
print("\nSubstitution spectrum:")
for entry in spectrum {
let bar = str_repeat("*", entry.count / 100)
print(entry.change + "\t" + str(entry.count) + "\t" + bar)
}
write_csv(sorted_report, "titv_per_chromosome.csv")
write_csv(spectrum, "substitution_spectrum.csv")
Example: Annotate Variants with Overlapping Regulatory Regions
Given a set of variants and a regulatory region BED file (e.g., ENCODE cCREs), annotate each variant with the regulatory elements it falls within.
# Load regulatory regions from ENCODE
let regulatory = read_bed("ENCODE_cCREs.bed")
|> map(|r| interval(r.chrom, r.start, r.end, name: r.name ?? "cCRE"))
let reg_tree = interval_tree(regulatory)
# Load variants
let variants = read_vcf("gwas_hits.vcf")
# Annotate each variant with overlapping regulatory regions
let annotated = variants |> map(|v| {
let pos_interval = interval(v.chrom, v.pos - 1, v.pos + len(v.ref) - 1)
let overlapping = query_overlaps(reg_tree, pos_interval)
let nearest = query_nearest(reg_tree, pos_interval)
let reg_names = overlapping |> map(|r| r.name) |> join(";")
let nearest_name = nearest?.name ?? "none"
let nearest_dist = if len(overlapping) > 0 {
0
} else {
distance(pos_interval, nearest)
}
{
chrom: v.chrom,
pos: v.pos,
ref: v.ref,
alt: v.alt,
in_regulatory: len(overlapping) > 0,
regulatory_elements: if len(reg_names) > 0 { reg_names } else { "none" },
num_overlapping: len(overlapping),
nearest_element: nearest_name,
distance_to_nearest: nearest_dist
}
})
# Summary statistics
let in_reg = annotated |> filter(|v| v.in_regulatory) |> len()
let total = len(annotated)
print("Variants in regulatory regions: " + str(in_reg) + "/" + str(total)
+ " (" + str(round(in_reg / total * 100, 1)) + "%)")
# Distribution of distance to nearest regulatory element
let distances = annotated
|> filter(|v| !v.in_regulatory)
|> map(|v| v.distance_to_nearest)
print("Non-regulatory variants:")
print(" Median distance to nearest cCRE: " + str(median(distances)) + " bp")
print(" Within 1kb of cCRE: " + str(distances |> filter(|d| d <= 1000) |> len()))
print(" Within 10kb of cCRE: " + str(distances |> filter(|d| d <= 10000) |> len()))
# Variants overlapping multiple regulatory elements
let multi = annotated |> filter(|v| v.num_overlapping > 1)
print("\nVariants in multiple regulatory regions: " + str(len(multi)))
for v in multi |> take(10) {
print(" " + v.chrom + ":" + str(v.pos) + " overlaps " + str(v.num_overlapping)
+ " elements: " + v.regulatory_elements)
}
write_csv(annotated, "variants_regulatory_annotation.csv")
Summary
Interval Operations
| Function | Description |
|---|---|
interval(chrom, start, end) | Create an interval |
overlaps(a, b) | Test for overlap |
contains(a, b) | Test containment |
distance(a, b) | Gap between non-overlapping intervals |
intersect(a, b) | Overlapping portion |
merge_interval(a, b) | Union of overlapping pair |
interval_tree(list) | Build tree for fast queries |
query_overlaps(tree, q) | All intervals overlapping q |
query_nearest(tree, q) | Closest interval to q |
coverage(tree, region) | Per-base depth array |
Variant Operations
| Function | Description |
|---|---|
variant(chrom, pos, ref, alt) | Create a variant |
variant_type(v) | “Snp”, “Indel”, “Mnp”, “Other” |
is_snp(v) | Single nucleotide change |
is_indel(v) | Insertion or deletion |
is_transition(v) | Purine-purine or pyrimidine-pyrimidine |
is_transversion(v) | Purine-pyrimidine or vice versa |
is_het(v) | Heterozygous genotype |
is_hom_ref(v) | Homozygous reference |
is_hom_alt(v) | Homozygous alternate |
parse_vcf_info(str) | Parse INFO column to record |
Intervals and variants are the coordinate system of genomics. Master these types and you can express peak-calling, variant annotation, coverage analysis, and regulatory overlap queries concisely and efficiently.
Chapter 11: Pipelines
Bioinformatics workflows are rarely a single step. A typical analysis chains together quality control, alignment, variant calling, filtering, and annotation into a directed acyclic graph of stages. BioLang provides first-class pipeline blocks that make these multi-stage workflows explicit, reproducible, and parallel-aware.
Pipeline Blocks
A pipeline block declares a named workflow composed of stage blocks. Stages
execute sequentially by default, and each stage can reference the output of the
previous stage.
pipeline fastq_qc {
stage raw_stats {
read_fastq("sample_R1.fastq.gz")
|> fastq_stats()
}
stage filter {
read_fastq("sample_R1.fastq.gz")
|> filter(|r| mean(r.quality) >= 30)
|> write_fastq("sample_R1.filtered.fastq.gz")
}
stage filtered_stats {
read_fastq("sample_R1.filtered.fastq.gz")
|> fastq_stats()
}
}
When you invoke fastq_qc, BioLang runs raw_stats, then filter, then
filtered_stats in order. The return value of the pipeline is the result of
the last stage.
Passing Data Between Stages
Stages can capture the result of a previous stage by name. The stage name becomes a binding available to all subsequent stages.
pipeline align_and_sort {
stage alignment {
let ref = "GRCh38.fa"
let r1 = "tumor_R1.fastq.gz"
let r2 = "tumor_R2.fastq.gz"
bwa_mem(ref, r1, r2, threads: 16)
}
stage sorted {
# alignment is the output path from the previous stage
samtools_sort(alignment, threads: 8)
}
stage indexed {
samtools_index(sorted)
sorted # return the sorted BAM path
}
}
Each stage name resolves to whatever its block evaluated to. This creates an implicit dependency chain without manual wiring.
Parallel Execution
When you need to process multiple samples, parallel for fans out work across
available cores. BioLang schedules iterations concurrently, respecting system
resources.
let samples = [
{id: "tumor_01", r1: "t01_R1.fq.gz", r2: "t01_R2.fq.gz"},
{id: "tumor_02", r1: "t02_R1.fq.gz", r2: "t02_R2.fq.gz"},
{id: "tumor_03", r1: "t03_R1.fq.gz", r2: "t03_R2.fq.gz"},
{id: "normal_01", r1: "n01_R1.fq.gz", r2: "n01_R2.fq.gz"},
]
parallel for sample in samples {
let bam = bwa_mem("GRCh38.fa", sample.r1, sample.r2, threads: 4)
let sorted = samtools_sort(bam, threads: 4)
samtools_index(sorted)
{id: sample.id, bam: sorted}
}
The result is a list of records, one per sample, collected in the order the iterations complete.
Stage Dependencies and DAG Execution
BioLang infers a dependency graph from which stage names appear in each block. Independent stages run concurrently when the runtime detects no data dependencies.
pipeline variant_qc {
stage aligned {
bwa_mem("GRCh38.fa", "sample_R1.fq.gz", "sample_R2.fq.gz")
|> samtools_sort(threads: 8)
}
# These two stages depend only on aligned, not on each other.
# BioLang runs them concurrently.
stage depth {
samtools_depth(aligned) |> mean()
}
stage flagstat {
samtools_flagstat(aligned)
}
# This stage depends on both depth and flagstat
stage report {
{
mean_depth: depth,
mapping_rate: flagstat.mapped_pct,
sample: "sample_01",
}
}
}
The runtime builds a DAG: aligned has no dependencies, depth and flagstat
both depend on aligned (and thus run in parallel once it finishes), and
report depends on both.
Defer for Cleanup
Intermediate files pile up fast. defer registers a cleanup action that runs
when the pipeline completes, whether it succeeds or fails. Multiple defer
blocks execute in reverse order (LIFO).
pipeline trimmed_alignment {
stage trim {
let trimmed_r1 = "tmp/trimmed_R1.fq.gz"
let trimmed_r2 = "tmp/trimmed_R2.fq.gz"
trim_adapters("sample_R1.fq.gz", "sample_R2.fq.gz",
out_r1: trimmed_r1, out_r2: trimmed_r2,
adapter: "AGATCGGAAGAG")
defer { remove(trimmed_r1); remove(trimmed_r2) }
{r1: trimmed_r1, r2: trimmed_r2}
}
stage align {
let bam = bwa_mem("GRCh38.fa", trim.r1, trim.r2, threads: 16)
let unsorted = bam
let sorted = samtools_sort(bam, threads: 8)
defer { remove(unsorted) }
sorted
}
stage mark_dups {
let deduped = gatk_mark_duplicates(align)
let metrics = align + ".dup_metrics"
defer { remove(metrics) }
deduped
}
}
When trimmed_alignment finishes, the deferred blocks fire: the duplicate
metrics file is removed first, then the unsorted BAM, then the trimmed FASTQs.
Example: RNA-seq Processing Pipeline
A complete RNA-seq workflow from raw reads to normalized expression counts.
pipeline rnaseq {
stage qc {
let report = fastqc("rnaseq_R1.fq.gz", "rnaseq_R2.fq.gz",
out_dir: "qc/")
let stats = read_fastq("rnaseq_R1.fq.gz") |> fastq_stats()
defer { log("QC complete: " + str(stats.total_reads) + " reads") }
{report: report, stats: stats}
}
stage trim {
let trimmed = trim_adapters(
"rnaseq_R1.fq.gz", "rnaseq_R2.fq.gz",
out_r1: "trimmed_R1.fq.gz",
out_r2: "trimmed_R2.fq.gz",
quality: 20,
min_length: 36
)
defer { remove("trimmed_R1.fq.gz"); remove("trimmed_R2.fq.gz") }
trimmed
}
stage align {
star_align(
index: "star_index/GRCh38",
r1: "trimmed_R1.fq.gz",
r2: "trimmed_R2.fq.gz",
out_prefix: "aligned/rnaseq_",
threads: 16,
two_pass: true
)
}
stage quantify {
featurecounts(
bam: align,
annotation: "gencode.v44.gtf",
paired: true,
strand: "reverse",
threads: 8
)
}
stage normalize {
let raw_counts = read_tsv(quantify)
let size_factors = raw_counts
|> columns(exclude: ["gene_id", "gene_name"])
|> each(|col| median(col))
|> |medians| median(medians) / medians
raw_counts
|> map_columns(exclude: ["gene_id", "gene_name"],
|col, i| col * size_factors[i])
|> write_tsv("normalized_counts.tsv")
}
}
Example: Variant Calling Pipeline
Germline variant calling from FASTQ to annotated VCF.
pipeline germline_variants {
stage align {
let ref = "GRCh38.fa"
bwa_mem(ref, "sample_R1.fq.gz", "sample_R2.fq.gz",
threads: 16, read_group: "@RG\\tID:S1\\tSM:sample")
|> samtools_sort(threads: 8)
|> samtools_index()
}
stage mark_dups {
gatk_mark_duplicates(align)
}
stage bqsr {
let known_sites = [
"dbsnp_156.vcf.gz",
"Mills_and_1000G.indels.vcf.gz",
]
gatk_base_recalibrator(mark_dups, ref: "GRCh38.fa",
known_sites: known_sites)
|> gatk_apply_bqsr(mark_dups, ref: "GRCh38.fa")
}
stage call {
gatk_haplotype_caller(bqsr, ref: "GRCh38.fa",
emit_ref_confidence: "GVCF",
intervals: "exome_targets.bed")
}
stage genotype {
gatk_genotype_gvcfs(call, ref: "GRCh38.fa")
}
stage filter {
let snps = gatk_select_variants(genotype, type: "SNP")
|> gatk_filter(filters: [
{name: "QD", expr: "QD < 2.0"},
{name: "FS", expr: "FS > 60.0"},
{name: "MQ", expr: "MQ < 40.0"},
])
let indels = gatk_select_variants(genotype, type: "INDEL")
|> gatk_filter(filters: [
{name: "QD", expr: "QD < 2.0"},
{name: "FS", expr: "FS > 200.0"},
])
bcftools_concat(snps, indels) |> bcftools_sort()
}
stage annotate {
ensembl_vep(filter,
species: "human",
assembly: "GRCh38",
cache_dir: "vep_cache/",
plugins: ["CADD", "SpliceAI"])
|> write("sample.annotated.vcf.gz")
}
}
Example: Multi-Sample Parallel Processing with Merge
Processing a cohort in parallel, then merging results in a final stage.
let cohort = read_tsv("sample_manifest.tsv")
|> map(|row| {
id: row.sample_id,
r1: row.fastq_r1,
r2: row.fastq_r2,
type: row.sample_type,
})
pipeline cohort_analysis {
stage per_sample {
parallel for sample in cohort {
let bam = bwa_mem("GRCh38.fa", sample.r1, sample.r2,
threads: 4,
read_group: "@RG\\tID:" + sample.id + "\\tSM:" + sample.id)
|> samtools_sort(threads: 4)
samtools_index(bam)
let gvcf = gatk_haplotype_caller(bam, ref: "GRCh38.fa",
emit_ref_confidence: "GVCF")
let stats = samtools_flagstat(bam)
let depth = samtools_depth(bam) |> mean()
{
id: sample.id,
bam: bam,
gvcf: gvcf,
mapped_pct: stats.mapped_pct,
mean_depth: depth,
}
}
}
stage qc_gate {
# Fail the pipeline if any sample has poor mapping
let failed = per_sample
|> filter(|s| s.mapped_pct < 90.0 or s.mean_depth < 20.0)
if length(failed) > 0 then
error("QC failed for: " + join(map(failed, |s| s.id), ", "))
per_sample
}
stage joint_genotype {
let gvcfs = qc_gate |> map(|s| s.gvcf)
gatk_combine_gvcfs(gvcfs, ref: "GRCh38.fa")
|> gatk_genotype_gvcfs(ref: "GRCh38.fa")
}
stage filter_and_annotate {
joint_genotype
|> gatk_vqsr(ref: "GRCh38.fa",
resources: ["hapmap.vcf.gz", "1000G_omni.vcf.gz"])
|> ensembl_vep(species: "human", assembly: "GRCh38")
|> write("cohort.annotated.vcf.gz")
}
stage summary {
let variant_counts = read_vcf("cohort.annotated.vcf.gz")
|> group_by(|v| v.info.consequence)
|> map_values(|vs| length(vs))
let sample_stats = qc_gate
|> map(|s| {id: s.id, depth: s.mean_depth, mapped: s.mapped_pct})
{
n_samples: length(cohort),
variant_counts: variant_counts,
sample_stats: sample_stats,
}
|> write_json("cohort_summary.json")
}
}
Parameterized Pipelines (Templates)
A pipeline can accept parameters, turning it into a reusable template. When you
declare pipeline name(param1, param2) { ... }, BioLang defines a callable
function instead of executing immediately. You invoke it like any other function.
# Define a reusable alignment template
pipeline align_sample(sample_id, r1, r2, reference) {
stage aligned {
shell("bwa-mem2 mem -t 16 " + reference + " " + r1 + " " + r2
+ " | samtools sort -@ 8 -o " + sample_id + ".sorted.bam")
sample_id + ".sorted.bam"
}
stage indexed {
shell("samtools index " + aligned)
aligned
}
}
# Call with different inputs — same pipeline, different data
let tumor = align_sample("tumor", "tumor_R1.fq.gz", "tumor_R2.fq.gz", "GRCh38.fa")
let normal = align_sample("normal", "normal_R1.fq.gz", "normal_R2.fq.gz", "GRCh38.fa")
print("Tumor BAM: " + tumor)
print("Normal BAM: " + normal)
You can loop over a sample list to process a whole cohort with the same template:
let samples = [
{id: "S1", r1: "S1_R1.fq.gz", r2: "S1_R2.fq.gz"},
{id: "S2", r1: "S2_R1.fq.gz", r2: "S2_R2.fq.gz"},
{id: "S3", r1: "S3_R1.fq.gz", r2: "S3_R2.fq.gz"},
]
let bams = samples |> map(|s| align_sample(s.id, s.r1, s.r2, "GRCh38.fa"))
print("Aligned " + str(len(bams)) + " samples")
How it works:
pipeline name { ... }(no params) — executes immediately, result bound tonamepipeline name(params) { ... }— defines a callable function; nothing runs until you call it- The parameter list uses the same syntax as function parameters
Pipeline Composition
Pipelines are values. You can call one pipeline from within another or store its result for downstream use.
pipeline align_one(sample) {
stage bam {
bwa_mem("GRCh38.fa", sample.r1, sample.r2, threads: 8)
|> samtools_sort(threads: 4)
}
stage index {
samtools_index(bam)
bam
}
}
pipeline somatic_pair(tumor, normal) {
stage tumor_bam { align_one(tumor) }
stage normal_bam { align_one(normal) }
stage call {
mutect2(tumor: tumor_bam, normal: normal_bam,
ref: "GRCh38.fa",
pon: "pon.vcf.gz",
germline: "af-only-gnomad.vcf.gz")
}
stage filter {
gatk_filter_mutect(call, ref: "GRCh38.fa")
}
}
# Run the composed pipeline
let result = somatic_pair(
{r1: "tumor_R1.fq.gz", r2: "tumor_R2.fq.gz"},
{r1: "normal_R1.fq.gz", r2: "normal_R2.fq.gz"}
)
print("Somatic variants written to: " + result)
Summary
Pipeline blocks give you named, composable workflows with automatic dependency
tracking. Stages pass data forward by name, parallel for fans out across
samples, and defer keeps your working directory clean. The runtime infers a
DAG from stage references, running independent stages concurrently while
respecting data dependencies.
nf-core Integration
Why nf-core in a DSL
nf-core is the largest curated collection of bioinformatics pipelines, maintained by a global community and covering over 100 workflows – from bulk RNA-seq to rare-disease variant calling. Every pipeline follows strict standards: peer review, continuous integration, container packaging, and a machine-readable parameter schema.
BioLang exposes nf-core as a set of built-in functions. There is nothing to import
and nothing to install. You can browse, search, inspect, and compare nf-core
pipelines from the same scripts that process your FASTQ files and call variants.
The five builtins – nfcore_list, nfcore_search, nfcore_info, nfcore_releases,
and nfcore_params – turn the nf-core catalog into a queryable data source that
fits naturally into pipes, filters, and record operations.
Browsing the Catalog
nfcore_list() returns every pipeline in the nf-core organization. Each entry is
a record with fields like name, description, stars, and topics.
# List all nf-core pipelines
nfcore_list()
The result is a list of records. You can pipe it through any BioLang operation:
# Show names and star counts, sorted by popularity
nfcore_list(sort_by: "stars")
|> each |p| { name: p.name, stars: p.stars }
To limit the result set, pass limit. This is useful in exploratory sessions
where you want a quick overview rather than the full catalog:
# Top 10 pipelines by most recent release date
nfcore_list(sort_by: "release", limit: 10)
|> each |p| { name: p.name, updated: p.updated }
Sorting accepts three values:
"stars"– community popularity (default)"name"– alphabetical"release"– most recently updated first
Searching Pipelines
When you know what kind of analysis you need but not which pipeline implements it,
use nfcore_search. It matches against pipeline names, descriptions, and topic tags.
# Find RNA-seq related pipelines
nfcore_search("rnaseq")
Search accepts an optional limit to cap results:
# Find variant-calling pipelines, show top 5
nfcore_search("variant", 5)
The returned records have the same shape as nfcore_list entries, so you can
chain the same downstream operations:
# Search for methylation pipelines and extract their topics
nfcore_search("methylation")
|> each |p| { name: p.name, topics: p.topics }
Pipeline Details
nfcore_info returns a single record with comprehensive metadata for a named
pipeline. The returned record contains:
| Field | Type | Description |
|---|---|---|
name | String | Short pipeline name |
full_name | String | GitHub-qualified name (nf-core/…) |
description | String | One-line summary |
stars | Int | GitHub star count |
url | String | Repository URL |
license | String | License identifier (usually MIT) |
topics | List | GitHub topic tags |
open_issues | Int | Current open issue count |
created | String | Repository creation date |
updated | String | Last push date |
# Inspect the Sarek germline/somatic variant calling pipeline
let sarek = nfcore_info("sarek")
print("Pipeline: " + sarek.full_name)
print("License: " + sarek.license)
print("Stars: " + to_string(sarek.stars))
print("Topics: " + join(sarek.topics, ", "))
You can use the metadata to make decisions in scripts. For example, checking whether a pipeline is actively maintained before committing to it:
# Only proceed if the pipeline has been updated recently
let info = nfcore_info("taxprofiler")
if info.open_issues < 50 then
print(info.name + " looks actively maintained")
Releases and Versions
Production workflows should pin to a specific release rather than tracking the
development branch. nfcore_releases returns the full release history as a list
of records, each with tag and published_at fields.
# List all rnaseq releases
nfcore_releases("rnaseq")
The list is ordered newest-first, so the head element is the latest stable version:
# Get the latest stable release tag for rnaseq
let latest = nfcore_releases("rnaseq")
|> first
print("Latest release: " + latest.tag + " (" + latest.published_at + ")")
You can also search for a specific version range or find how many releases a pipeline has had – a rough proxy for maturity:
# Count total releases for sarek
let release_count = nfcore_releases("sarek") |> length
print("sarek has " + to_string(release_count) + " releases")
# Find all 3.x releases of rnaseq
nfcore_releases("rnaseq")
|> filter |r| starts_with(r.tag, "3.")
|> each |r| r.tag
Parameter Schemas
Every nf-core pipeline publishes a JSON schema describing its configurable
parameters – input paths, reference genomes, trimming options, resource limits,
and more. nfcore_params fetches and parses this schema, returning a record
whose keys are parameter group names and whose values are records of individual
parameters.
# Fetch the full parameter schema for rnaseq
let params = nfcore_params("rnaseq")
The top-level keys are group names such as "input_output_options",
"reference_genome_options", or "trimming_options". Each group is a record
of parameter entries:
# List all parameter groups
nfcore_params("rnaseq")
|> keys
# Inspect the reference genome options
nfcore_params("rnaseq").reference_genome_options
|> keys
This is particularly useful for validating that your configuration covers the required parameters before submitting a long-running pipeline:
# Check whether a specific parameter exists
let params = nfcore_params("sarek")
let genome_opts = params.reference_genome_options
print(keys(genome_opts))
Example: Finding the Right Pipeline
A common task: you need to process single-cell RNA-seq data but are unsure which nf-core pipeline to use. Search the catalog, compare candidates, and pick the best fit.
# Step 1: Search for single-cell pipelines
let candidates = nfcore_search("single-cell")
# Step 2: Show names, stars, and descriptions side by side
candidates
|> each |p| {
name: p.name,
stars: p.stars,
description: p.description
}
|> sort_by |a, b| b.stars - a.stars
|>> each |p| print(p.name + " (" + to_string(p.stars) + " stars): " + p.description)
# Step 3: Get detailed info on the top candidate
let top = candidates
|> sort_by |a, b| b.stars - a.stars
|> first
let info = nfcore_info(top.name)
print("Selected: " + info.full_name)
print("License: " + info.license)
print("Topics: " + join(info.topics, ", "))
# Step 4: Check the latest release
let latest = nfcore_releases(top.name) |> first
print("Latest release: " + latest.tag + " published " + latest.published_at)
# Step 5: Preview the parameter groups
nfcore_params(top.name)
|> keys
|>> each |group| print(" - " + group)
This entire exploration runs in a single script – no switching between a browser, the nf-core CLI tool, and your pipeline configuration files.
Example: Auditing Pipeline Parameters
Before launching a whole-genome sequencing analysis with Sarek, you want to confirm that your reference genome configuration covers every required parameter and that no deprecated options have crept into your config.
# Fetch the Sarek parameter schema
let params = nfcore_params("sarek")
# Extract all reference genome parameters
let ref_params = params.reference_genome_options
# Your current configuration
let my_config = {
genome: "GRCh38",
igenomes_base: "s3://ngi-igenomes/igenomes",
fasta: nil,
fasta_fai: nil
}
# Check for parameters in the schema that you have not configured
let schema_keys = keys(ref_params)
let config_keys = keys(my_config)
let missing = schema_keys
|> filter |k| not(contains(config_keys, k))
if length(missing) > 0 then
print("WARNING: unset reference genome parameters:")
missing |>> each |k| print(" - " + k)
else
print("All reference genome parameters are covered.")
# Check for keys in your config that are not in the schema (possibly deprecated)
let extra = config_keys
|> filter |k| not(contains(schema_keys, k))
if length(extra) > 0 then
print("WARNING: config keys not in current schema (possibly deprecated):")
extra |>> each |k| print(" - " + k)
This catches configuration drift early – before you discover it four hours into a whole-genome analysis run.
Example: Building a Pipeline Registry
For a core facility managing dozens of active projects, it helps to maintain a catalog of available pipelines organized by topic. This script builds a topic index from the full nf-core catalog.
# Fetch all pipelines
let all_pipelines = nfcore_list(sort_by: "stars")
# Build a topic-to-pipeline mapping
# Flatten: for each pipeline, emit one record per topic tag
let entries = all_pipelines
|> flat_map |p| p.topics |> each |t| { topic: t, name: p.name, stars: p.stars }
# Group by topic
let topics = entries
|> group_by |e| e.topic
# Print a summary: topics with 3 or more pipelines
keys(topics)
|> filter |t| length(topics[t]) >= 3
|> sort
|>> each |t| {
let pipelines = topics[t]
|> sort_by |a, b| b.stars - a.stars
|> each |p| p.name
print(t + " (" + to_string(length(pipelines)) + " pipelines): " + join(pipelines, ", "))
}
You can extend this to generate a Markdown report, push to a shared wiki, or feed into a lab notebook:
# Export the top 20 pipelines as a tab-separated table
let header = "name\tstars\ttopics\tlatest_release"
print(header)
nfcore_list(sort_by: "stars", limit: 20)
|>> each |p| {
let latest = nfcore_releases(p.name) |> first
let tag = if latest != nil then latest.tag else "unreleased"
let topic_str = join(p.topics, ";")
print(p.name + "\t" + to_string(p.stars) + "\t" + topic_str + "\t" + tag)
}
Redirect stdout to a file and you have a pipeline inventory that updates every time you run the script – no manual curation needed.
Parsing Nextflow Files
BioLang can parse Nextflow .nf files directly with nf_parse(). This function
reads the file and extracts its structure without requiring a Nextflow runtime.
# Parse a Nextflow pipeline file
let parsed = nf_parse("main.nf")
The result is a record with five fields:
| Field | Type | Content |
|---|---|---|
params | Record | Parameter name-value pairs |
processes | List[Record] | Process name, inputs, outputs, script, container, cpus, memory |
includes | List[Record] | Include name, alias, source path |
workflow | List[String] | Workflow block lines |
dsl | String | DSL version (“DSL1” or “DSL2”) |
You can pipe the parsed structure through any BioLang operation:
# List all processes and their containers
let parsed = nf_parse("variant_calling.nf")
parsed.processes
|> each |p| {
print(p.name + ": " + if p.container != nil then p.container else "no container")
}
# Extract all parameters with their default values
let parsed = nf_parse("rnaseq.nf")
keys(parsed.params)
|> sort
|>> each |k| print(k + " = " + to_string(parsed.params[k]))
Generating BioLang Code
The nf_to_bl function takes a parsed Nextflow record and generates BioLang
pipeline code as a string:
# Parse and convert a Nextflow pipeline
let parsed = nf_parse("rnaseq.nf")
let bl_code = nf_to_bl(parsed)
# Print the generated code
print(bl_code)
# Save to a file
write_file("rnaseq.bl", bl_code)
The generated code maps Nextflow constructs to BioLang equivalents:
params.*become top-level variablesprocessblocks becomepipeline { stage ... }blocks- Container, CPU, and memory directives are preserved as stage annotations
- Workflow call ordering is preserved as comments
This is a starting point – edit the generated skeleton to add BioLang-specific features like pipe chains, error handling, and parallel execution.
# Full workflow: parse, generate, and review
let parsed = nf_parse("sarek_main.nf")
# Show what was extracted
print("Found " + to_string(length(parsed.processes)) + " processes")
print("Found " + to_string(length(keys(parsed.params))) + " parameters")
# Generate BioLang code
let bl_code = nf_to_bl(parsed)
write_file("sarek.bl", bl_code)
print("Generated sarek.bl")
Chapter 13: BioContainers Integration
Reproducible bioinformatics demands pinned software versions. A variant calling pipeline that works today must produce identical results next year, on a different machine, with the same tool versions. The BioContainers project addresses this by packaging over 9,000 bioinformatics tools as container images hosted on registries like quay.io and Docker Hub.
BioLang gives you native access to the BioContainers registry. Four built-in functions let you search for tools, discover popular packages, inspect version histories, and retrieve exact container image URIs – all without leaving your script. No imports are needed.
Searching for Tools
biocontainers_search queries the BioContainers registry by name or keyword.
It returns a list of records, each describing a matching tool.
# Find all samtools-related containers
let results = biocontainers_search("samtools")
# results => [
# {name: "samtools", description: "Tools for manipulating NGS alignments...",
# organization: "biocontainers", version_count: 42,
# latest_version: "1.19--h50ea8bc_0",
# latest_image: "quay.io/biocontainers/samtools:1.19--h50ea8bc_0"},
# {name: "htslib", description: "C library for high-throughput sequencing...", ...},
# ...
# ]
results |> each(|tool| {
print(tool.name + " (" + str(tool.version_count) + " versions)")
print(" Latest: " + tool.latest_version)
})
The default limit is 25 results. Pass a second argument to narrow or widen the search.
# Top 5 matches for short-read aligners
let aligners = biocontainers_search("bwa", 5)
aligners |> each(|a| print(a.name + " => " + a.latest_image))
Search terms are matched against tool names and descriptions, so broader queries work too.
# Find tools related to RNA-seq quantification
let quant_tools = biocontainers_search("salmon rna-seq")
quant_tools
|> filter(|t| t.version_count > 5)
|> each(|t| print(t.name + ": " + t.description))
Popular Tools
biocontainers_popular returns the most-pulled tools in the registry. This is
useful for discovering which tools the community relies on and for auditing
whether your pipeline uses well-maintained software.
let top20 = biocontainers_popular()
print("Top 20 BioContainers tools:")
top20 |> each(|t| print(" " + t.name + " - " + t.latest_version))
Pass a limit to retrieve more.
# Top 50 most popular bioinformatics containers
let top50 = biocontainers_popular(50)
# Which of our pipeline tools are in the top 50?
let our_tools = ["samtools", "bcftools", "bwa-mem2", "gatk4", "picard"]
let popular_names = top50 |> map(|t| t.name)
our_tools |> each(|tool| {
let found = popular_names |> find(|n| n == tool)
if found != None then
print(tool + " is in the top 50")
else
print(tool + " is NOT in the top 50")
})
Tool Details
biocontainers_info returns a detailed record for a single tool, including its
full version history with per-version container images.
let info = biocontainers_info("samtools")
# info => {
# name: "samtools",
# description: "Tools for manipulating NGS alignments...",
# organization: "biocontainers",
# aliases: ["samtools"],
# versions: [
# {version: "1.19--h50ea8bc_0",
# images: [{registry: "quay.io", image: "quay.io/biocontainers/samtools:1.19--h50ea8bc_0",
# type: "Docker", size: 14250000}]},
# {version: "1.18--h50ea8bc_0", images: [...]},
# ...
# ]
# }
print(info.name + " by " + info.organization)
print(str(len(info.versions)) + " versions available")
# List the 5 most recent versions
info.versions
|> take(5)
|> each(|v| {
let image = v.images |> first()
print(" " + v.version + " (" + image.registry + ", "
+ str(image.size / 1000000) + " MB)")
})
The images list for each version may contain entries from multiple registries
or image types (Docker, Singularity). Filter by registry or type if your
infrastructure requires a specific format.
# Find Singularity images for deepvariant
let dv = biocontainers_info("deepvariant")
dv.versions |> each(|v| {
let singularity = v.images |> filter(|img| img.type == "Singularity")
if len(singularity) > 0 then
print(v.version + " has Singularity image")
})
Version Management
biocontainers_versions returns a flat list of all versions for a tool, each
with a list of full image URI strings. This is the function to use when you
need to pin a specific version in a pipeline manifest.
let versions = biocontainers_versions("gatk4")
# versions => [
# {version: "4.5.0.0--py310hdfd78af_0",
# images: ["quay.io/biocontainers/gatk4:4.5.0.0--py310hdfd78af_0"]},
# {version: "4.4.0.0--py310hdfd78af_0",
# images: ["quay.io/biocontainers/gatk4:4.4.0.0--py310hdfd78af_0"]},
# ...
# ]
# Find the latest GATK 4.4.x release
let gatk44 = versions
|> filter(|v| starts_with(v.version, "4.4"))
|> first()
print("Pinning GATK to: " + gatk44.images[0])
You can use this to check whether a specific version exists before committing to it in a pipeline definition.
# Verify that bcftools 1.18 is available
let bc_versions = biocontainers_versions("bcftools")
let target = bc_versions |> find(|v| starts_with(v.version, "1.18"))
if target != None then
print("bcftools 1.18 available: " + target.images[0])
else
print("bcftools 1.18 not found in BioContainers")
Example: Building a Reproducible Tool Manifest
A variant calling pipeline needs exact container images for every tool. Use the BioContainers builtins to resolve each tool to a pinned image URI and export the manifest.
# Tools required for a germline variant calling pipeline
let required = [
{name: "bwa-mem2", min_version: "2.2"},
{name: "samtools", min_version: "1.18"},
{name: "gatk4", min_version: "4.4"},
{name: "bcftools", min_version: "1.18"},
]
let manifest = required |> map(|req| {
let versions = biocontainers_versions(req.name)
# Find the newest version that satisfies the minimum
let matching = versions
|> filter(|v| starts_with(v.version, req.min_version))
if len(matching) == 0 then {
print("WARNING: no " + req.name + " >= " + req.min_version + " found")
{tool: req.name, version: "MISSING", image: "MISSING"}
} else {
let chosen = matching |> first()
{tool: req.name, version: chosen.version, image: chosen.images[0]}
}
})
# Print the resolved manifest
print("Variant Calling Pipeline - Tool Manifest")
print("=========================================")
manifest |> each(|m| {
print(m.tool + ":")
print(" version: " + m.version)
print(" image: " + m.image)
})
# Export as structured data
manifest |> write_json("pipeline_manifest.json")
This script produces a lockfile-style manifest that can be checked into version control alongside the pipeline definition.
Example: Tool Discovery for a New Analysis
When starting a new analysis type, you need to survey what tools are available. Here we explore the methylation analysis landscape.
# What methylation tools exist in BioContainers?
let methyl_tools = biocontainers_search("methylation", 50)
print(str(len(methyl_tools)) + " methylation-related tools found")
print("")
# Group by version count to find well-maintained tools
let mature = methyl_tools
|> filter(|t| t.version_count >= 5)
|> sort_by(|t| -t.version_count)
let new_tools = methyl_tools
|> filter(|t| t.version_count < 3)
print("Mature tools (" + str(len(mature)) + "):")
mature |> each(|t| {
print(" " + t.name + " - " + str(t.version_count) + " versions"
+ " (latest: " + t.latest_version + ")")
print(" " + t.description)
})
print("")
print("Newer tools (" + str(len(new_tools)) + "):")
new_tools |> take(10) |> each(|t| {
print(" " + t.name + " - " + t.latest_version)
})
# Deep dive into the top candidate
let bismark = biocontainers_info("bismark")
print("")
print("Bismark detail:")
print(" " + bismark.description)
print(" " + str(len(bismark.versions)) + " releases")
print(" Aliases: " + join(bismark.aliases, ", "))
# Check image sizes across versions
bismark.versions |> take(5) |> each(|v| {
let docker = v.images |> filter(|img| img.type == "Docker") |> first()
if docker != None then
print(" " + v.version + ": " + str(docker.size / 1000000) + " MB")
})
Example: Container Image Audit
For an existing pipeline, verify that every tool has a valid BioContainers image and flag any that are outdated.
# Current pipeline tools and their pinned versions
let pinned = [
{tool: "bwa-mem2", version: "2.2.1--hd03093a_2"},
{tool: "samtools", version: "1.17--h50ea8bc_0"},
{tool: "gatk4", version: "4.3.0.0--py310hdfd78af_0"},
{tool: "bcftools", version: "1.17--h3cc50cf_1"},
{tool: "multiqc", version: "1.14--pyhdfd78af_0"},
{tool: "fastp", version: "0.23.2--hb7a2d85_2"},
]
let audit = pinned |> map(|entry| {
let info = biocontainers_info(entry.tool)
let all_versions = info.versions |> map(|v| v.version)
# Check if pinned version still exists
let exists = all_versions |> find(|v| v == entry.version) != None
# Check if there is a newer version
let latest = info.versions |> first()
let is_latest = latest.version == entry.version
# Count how many versions behind
let versions_behind = if is_latest then
0
else {
let idx = all_versions
|> enumerate()
|> find(|pair| pair.value == entry.version)
if idx != None then idx.index else -1
}
{
tool: entry.tool,
pinned: entry.version,
latest: latest.version,
exists: exists,
is_latest: is_latest,
versions_behind: versions_behind,
}
})
# Report
print("Pipeline Container Audit")
print("========================")
let missing = audit |> filter(|a| not a.exists)
let outdated = audit |> filter(|a| a.exists and not a.is_latest)
let current = audit |> filter(|a| a.is_latest)
if len(missing) > 0 then {
print("")
print("MISSING (pinned version no longer in registry):")
missing |> each(|a| print(" " + a.tool + " " + a.pinned
+ " => latest: " + a.latest))
}
if len(outdated) > 0 then {
print("")
print("OUTDATED:")
outdated |> each(|a| print(" " + a.tool + " " + a.pinned
+ " => " + a.latest
+ " (" + str(a.versions_behind) + " versions behind)"))
}
if len(current) > 0 then {
print("")
print("CURRENT:")
current |> each(|a| print(" " + a.tool + " " + a.pinned))
}
print("")
print(str(len(current)) + " current, "
+ str(len(outdated)) + " outdated, "
+ str(len(missing)) + " missing")
audit |> write_json("container_audit.json")
Summary
BioLang’s four BioContainers builtins – biocontainers_search,
biocontainers_popular, biocontainers_info, and biocontainers_versions –
bring the full BioContainers registry into your scripts as native data. Use
them to discover tools, pin container images for reproducibility, audit
existing pipelines, and explore new analysis domains. Combined with BioLang’s
pipes and collection operations, a few lines of code replace manual registry
browsing and ad hoc version tracking.
Galaxy ToolShed & Workflow Parsing
The Galaxy ToolShed is the app store for the Galaxy bioinformatics platform. It hosts thousands of community-contributed tools covering everything from read alignment and variant calling to phylogenetics and metabolomics. Each repository in the ToolShed carries metadata – owner, description, download counts, and revision history – making it a rich discovery resource even if you never run Galaxy itself.
BioLang provides read-only access to the Galaxy ToolShed through four built-in
functions. There is nothing to import and nothing to install. You can search
repositories, browse popular tools, list categories, and inspect individual
tools from the same scripts that process your sequence data. A fifth builtin,
galaxy_to_bl, takes a Galaxy workflow record and generates BioLang pipeline
code from it.
Searching Repositories
galaxy_search queries the ToolShed by name or keyword. It returns a list of
records, each describing a matching repository.
# Find BWA-related tools in the Galaxy ToolShed
galaxy_search("bwa")
|> each(|t| print(t.name + " by " + t.owner + " (" + str(t.downloads) + " downloads)"))
Each record in the result contains:
| Field | Type | Description |
|---|---|---|
name | String | Repository name |
owner | String | ToolShed username of the maintainer |
description | String | Short summary |
downloads | Int | Total download count |
url | String | ToolShed repository URL |
The default result set includes all matches. Pass a second argument to cap the number of results, which is useful in exploratory sessions where you want a quick overview.
# Top 5 matches for short-read aligners
galaxy_search("bwa", 5)
|> each(|t| print(t.name + " (" + t.owner + "): " + t.description))
Search terms are matched against repository names and descriptions, so broader queries work too.
# Find tools related to RNA-seq
galaxy_search("rna-seq", 10)
|> filter(|t| t.downloads > 1000)
|> each(|t| print(t.name + ": " + str(t.downloads) + " downloads"))
Popular Tools
galaxy_popular returns the most-downloaded repositories in the ToolShed,
sorted by download count descending. This is useful for discovering which
tools the Galaxy community relies on most heavily.
galaxy_popular(10)
|> map(|t| { name: t.name, owner: t.owner, downloads: t.downloads })
Without an argument, galaxy_popular() returns a default set. Pass a limit
to widen or narrow the list.
# Top 25 Galaxy tools by download count
galaxy_popular(25)
|> each(|t| print(t.name + " by " + t.owner + " - " + str(t.downloads) + " downloads"))
Categories
The ToolShed organizes repositories into categories such as “Assembly”,
“Variant Analysis”, “RNA”, and “Visualization”. galaxy_categories returns
them all as a list of records.
galaxy_categories()
|> each(|c| print(c.name + ": " + c.description))
You can use this to explore what kinds of tools exist before drilling into a specific area.
# Find categories related to sequencing
galaxy_categories()
|> filter(|c| contains(lower(c.name), "sequenc") or contains(lower(c.description), "sequenc"))
|> each(|c| print(c.name))
Repository Details
galaxy_tool returns a detailed record for a single repository, identified
by owner and name.
let tool = galaxy_tool("devteam", "bwa")
print("Tool: " + tool.name)
print("Owner: " + tool.owner)
print("Downloads: " + str(tool.downloads))
print("Last updated: " + tool.last_updated)
print("Description: " + tool.description)
The returned record includes fields beyond what the search results provide,
such as last_updated and the full repository URL. Use this to verify that
a tool is actively maintained before building a workflow around it.
# Check maintenance status of a tool before adopting it
let tool = galaxy_tool("iuc", "samtools_sort")
print(tool.name + " last updated: " + tool.last_updated)
print("Downloads: " + str(tool.downloads))
Cross-Registry Discovery
One of BioLang’s strengths is that nf-core, BioContainers, and Galaxy ToolShed are all accessible from the same script. When evaluating a tool, you can check all three registries in a single pass to understand the full landscape – whether curated pipelines exist, whether container images are available, and whether Galaxy wrappers have been written.
# Search for BWA across all three registries
let tool = "bwa"
let nf_results = nfcore_search(tool)
let bc_results = biocontainers_search(tool, 5)
let gx_results = galaxy_search(tool, 5)
print("=== Cross-registry search for: " + tool + " ===")
print("nf-core pipelines mentioning " + tool + ": " + str(len(nf_results)))
print("BioContainers: " + str(len(bc_results)))
print("Galaxy tools: " + str(len(gx_results)))
This pattern scales to any tool or keyword. Here is a more thorough version that prints details from each registry side by side.
# Compare tool availability across registries
let tools = ["samtools", "bcftools", "hisat2", "salmon"]
tools |> each(|name| {
print("--- " + name + " ---")
let gx = galaxy_search(name, 3)
let bc = biocontainers_search(name, 3)
if len(gx) > 0 then
print(" Galaxy: " + gx[0].name + " by " + gx[0].owner
+ " (" + str(gx[0].downloads) + " downloads)")
else
print(" Galaxy: not found")
if len(bc) > 0 then
print(" BioContainers: " + bc[0].name
+ " (latest: " + bc[0].latest_version + ")")
else
print(" BioContainers: not found")
print("")
})
Workflow Code Generation
Galaxy workflows are JSON documents that describe a directed acyclic graph of
tool invocations. BioLang can parse these workflows and generate equivalent
BioLang pipeline code via the galaxy_to_bl builtin. This is useful when
migrating from Galaxy to a script-based workflow, or when you want to use a
Galaxy workflow as a starting point for a BioLang pipeline.
galaxy_to_bl takes a record that represents a Galaxy workflow and returns a
string of BioLang code. The expected input format mirrors the structure of a
Galaxy workflow export.
# Construct a Galaxy workflow record
let workflow = {
name: "RNA-seq Analysis",
annotation: "Basic RNA-seq pipeline",
steps: [
{ name: "FastQC", tool_id: "fastqc/0.74", inputs: ["reads"], outputs: ["report"] },
{ name: "HISAT2", tool_id: "hisat2/2.2.1", inputs: ["reads"], outputs: ["aligned_bam"] },
{ name: "featureCounts", tool_id: "featurecounts/2.0.6", inputs: ["aligned_bam"], outputs: ["counts"] }
]
}
# Generate BioLang pipeline code
let bl_code = galaxy_to_bl(workflow)
print(bl_code)
The generated code maps each Galaxy step to a BioLang function call, preserving the data flow between steps. You can write it directly to a file and use it as a starting point for further customization.
# Save the generated code
let bl_code = galaxy_to_bl(workflow)
write_text("rnaseq.bl", bl_code)
print("Pipeline written to rnaseq.bl")
For more complex workflows with branching and multiple outputs, the generated code includes comments marking each step and its connections.
# A branching QC + alignment workflow
let workflow = {
name: "QC and Align",
annotation: "Parallel QC with alignment",
steps: [
{ name: "FastQC", tool_id: "fastqc/0.74", inputs: ["reads"], outputs: ["qc_report"] },
{ name: "Trimmomatic", tool_id: "trimmomatic/0.39", inputs: ["reads"], outputs: ["trimmed"] },
{ name: "BWA-MEM2", tool_id: "bwa_mem2/2.2.1", inputs: ["trimmed"], outputs: ["aligned"] },
{ name: "MultiQC", tool_id: "multiqc/1.14", inputs: ["qc_report"], outputs: ["summary"] }
]
}
let bl_code = galaxy_to_bl(workflow)
print(bl_code)
Configuration
The Galaxy ToolShed URL defaults to the main public instance at
https://toolshed.g2.bx.psu.edu. You can override it for private or
institutional ToolShed installations.
Set the URL in ~/.biolang/apis.yaml:
galaxy:
toolshed_url: "https://toolshed.example.org"
Or via the environment variable BIOLANG_GALAXY_TOOLSHED_URL:
export BIOLANG_GALAXY_TOOLSHED_URL="https://toolshed.example.org"
The environment variable takes precedence over the config file. If neither is set, the public ToolShed is used.
Example: Building a Tool Inventory
A core facility managing Galaxy and non-Galaxy workflows needs to know which tools are available in the ToolShed and whether container images exist for running them outside Galaxy. This script searches Galaxy for tools matching a keyword, then cross-references each one with BioContainers to check container availability.
# Build a tool inventory for variant analysis
let keyword = "variant"
# Step 1: Search Galaxy for variant-related tools
let gx_tools = galaxy_search(keyword, 20)
print("Found " + str(len(gx_tools)) + " Galaxy tools for: " + keyword)
print("")
# Step 2: For each Galaxy tool, check BioContainers
let inventory = gx_tools |> map(|t| {
# Search BioContainers using the tool name
let bc = biocontainers_search(t.name, 1)
let container_status = if len(bc) > 0 then
"available (" + bc[0].latest_version + ")"
else
"not found"
{
name: t.name,
owner: t.owner,
galaxy_downloads: t.downloads,
container: container_status
}
})
# Step 3: Report
print("Tool Inventory: " + keyword)
print("========================================")
inventory
|> sort_by(|a| -a.galaxy_downloads)
|> each(|t| {
print(t.name + " (" + t.owner + ")")
print(" Galaxy downloads: " + str(t.galaxy_downloads))
print(" BioContainers: " + t.container)
})
# Step 4: Summary statistics
let with_container = inventory |> filter(|t| not starts_with(t.container, "not"))
let without_container = inventory |> filter(|t| starts_with(t.container, "not"))
print("")
print("Summary:")
print(" Total tools: " + str(len(inventory)))
print(" With container image: " + str(len(with_container)))
print(" Without container: " + str(len(without_container)))
You can extend this pattern to cover additional registries or to export the inventory as a structured file.
# Export as JSON for downstream use
inventory |> write_json("variant_tool_inventory.json")
Example: Migrating a Galaxy Workflow
When moving a project from Galaxy to BioLang, you can combine the ToolShed lookup with code generation in a single script. Look up each tool to verify it exists, then generate the BioLang equivalent.
# Define the workflow steps from a Galaxy export
let workflow = {
name: "Whole Genome Variant Calling",
annotation: "BWA-MEM2 alignment with GATK HaplotypeCaller",
steps: [
{ name: "FastQC", tool_id: "fastqc/0.74", inputs: ["reads_r1", "reads_r2"], outputs: ["qc_report"] },
{ name: "BWA-MEM2", tool_id: "bwa_mem2/2.2.1", inputs: ["reads_r1", "reads_r2"], outputs: ["aligned_bam"] },
{ name: "MarkDuplicates", tool_id: "picard_markduplicates/3.1.1", inputs: ["aligned_bam"], outputs: ["dedup_bam"] },
{ name: "HaplotypeCaller", tool_id: "gatk4_haplotypecaller/4.5", inputs: ["dedup_bam"], outputs: ["raw_vcf"] },
{ name: "FilterVcf", tool_id: "gatk4_filtermutectcalls/4.5", inputs: ["raw_vcf"], outputs: ["filtered_vcf"] }
]
}
# Verify each tool exists in the ToolShed
print("Verifying tools...")
workflow.steps |> each(|step| {
# Extract base tool name from tool_id
let parts = split(step.tool_id, "/")
let tool_name = parts[0]
let results = galaxy_search(tool_name, 1)
if len(results) > 0 then
print(" " + step.name + ": found (" + results[0].owner + ")")
else
print(" " + step.name + ": WARNING - not found in ToolShed")
})
# Generate the BioLang pipeline
print("")
let bl_code = galaxy_to_bl(workflow)
print("Generated pipeline:")
print(bl_code)
# Save to file
write_text("wgs_variant_calling.bl", bl_code)
print("Pipeline saved to wgs_variant_calling.bl")
Summary
BioLang’s Galaxy ToolShed builtins – galaxy_search, galaxy_popular,
galaxy_categories, galaxy_tool, and galaxy_to_bl – bring the Galaxy
ecosystem into your scripts as native data. Use them to discover tools,
evaluate popularity and maintenance status, cross-reference with BioContainers
and nf-core, and generate BioLang pipeline code from Galaxy workflows. The
ToolShed becomes one more queryable registry alongside the others, all
accessible from the same pipes, filters, and record operations you use for
everything else.
Chapter 12: Statistics and Linear Algebra
Bioinformatics is quantitative at its core. Whether you are testing for differential expression, running principal component analysis on a transcriptome, or modeling drug kinetics, you need robust numerical tools. BioLang provides built-in statistics functions, matrix operations, and an ODE solver so these analyses live alongside your data processing code.
Descriptive Statistics
All descriptive stats operate on lists of numbers.
let coverage = tsv("sample_depth.tsv") |> col("depth")
let summary = {
mean: mean(coverage),
median: median(coverage),
sd: stdev(coverage),
var: variance(coverage),
q25: quantile(coverage, 0.25),
q75: quantile(coverage, 0.75),
iqr: quantile(coverage, 0.75) - quantile(coverage, 0.25),
}
print("Coverage: mean=" + str(summary.mean) + " median=" + str(summary.median))
These functions handle missing values gracefully: None entries are skipped
with a warning.
Hypothesis Testing
t-test
Compare expression levels between two conditions.
let tumor = tsv("expr.tsv") |> filter(|r| r.group == "tumor") |> col("BRCA1")
let normal = tsv("expr.tsv") |> filter(|r| r.group == "normal") |> col("BRCA1")
let result = ttest(tumor, normal)
# result => {statistic: 4.32, pvalue: 0.00018, df: 28, mean_diff: 2.15}
if result.pvalue < 0.05 then
print("BRCA1 significantly different (p=" + str(result.pvalue) + ")")
Wilcoxon Rank-Sum
For non-normally distributed data such as read counts.
let treated = [1240, 890, 1560, 2100, 780, 1890, 1345]
let control = [450, 520, 380, 610, 490, 430, 550]
let result = wilcoxon(treated, control)
# result => {statistic: 49.0, pvalue: 0.0012}
Chi-squared Test
Test whether observed genotype frequencies match Hardy-Weinberg expectations.
let observed = [210, 480, 310] # AA, Aa, aa genotypes
let n = sum(observed)
let p = (2 * observed[0] + observed[1]) / (2 * n)
let q = 1.0 - p
let expected = [p * p * n, 2 * p * q * n, q * q * n]
let result = chi_square(observed, expected)
# result => {statistic: 3.21, pvalue: 0.073, df: 1}
if result.pvalue > 0.05 then
print("Population is in Hardy-Weinberg equilibrium")
ANOVA
Compare expression across multiple tissue types.
let brain = tsv("expr.tsv") |> filter(|r| r.tissue == "brain") |> col("TP53")
let liver = tsv("expr.tsv") |> filter(|r| r.tissue == "liver") |> col("TP53")
let lung = tsv("expr.tsv") |> filter(|r| r.tissue == "lung") |> col("TP53")
let kidney = tsv("expr.tsv") |> filter(|r| r.tissue == "kidney") |> col("TP53")
let result = anova(brain, liver, lung, kidney)
# result => {f_statistic: 8.74, pvalue: 0.00003, df_between: 3, df_within: 76}
Fisher’s Exact Test
Test enrichment of a mutation in cases versus controls.
# Contingency table: [[mutation+/case, mutation-/case], [mutation+/ctrl, mutation-/ctrl]]
let table = [[45, 155], [12, 188]]
let result = fisher_exact(table)
# result => {odds_ratio: 5.48, pvalue: 0.00001}
Multiple Testing Correction
When testing thousands of genes, you must correct for multiple comparisons.
p_adjust supports Bonferroni, Benjamini-Hochberg (BH), and
Benjamini-Yekutieli (BY) methods.
let genes = tsv("gene_expression.tsv")
let gene_names = genes |> col("gene_name")
let tumor_cols = colnames(genes) |> filter(|c| starts_with(c, "tumor_"))
let normal_cols = colnames(genes) |> filter(|c| starts_with(c, "normal_"))
let pvalues = range(0, genes.num_rows)
|> map(|i| {
let t = tumor_cols |> map(|c| col(genes, c)[i])
let n = normal_cols |> map(|c| col(genes, c)[i])
ttest(t, n).pvalue
})
let adjusted_bh = p_adjust(pvalues, "BH")
let adjusted_bonf = p_adjust(pvalues, "bonferroni")
let significant = range(0, len(adjusted_bh))
|> filter(|i| adjusted_bh[i] < 0.05)
|> map(|i| {gene: gene_names[i], p: pvalues[i], q: adjusted_bh[i]})
print(str(len(significant)) + " genes significant at FDR < 0.05")
Correlation
Compute Pearson or Spearman correlation between expression profiles.
let expr = tsv("tpm_matrix.tsv")
let gene_a = expr |> col("EGFR")
let gene_b = expr |> col("ERBB2")
let pearson = cor(gene_a, gene_b)
# pearson => {r: 0.72, pvalue: 0.00001}
let spearman = spearman(gene_a, gene_b)
# spearman => {rho: 0.68, pvalue: 0.00004}
Linear Models
Fit a linear model to predict expression from covariates.
let data = tsv("sample_metadata.tsv")
let expression = data |> col("gene_expr")
let age = data |> col("age")
let batch = data |> col("batch_id")
let model = lm(expression, [age, batch])
# model => {
# coefficients: [{name: "intercept", estimate: 5.2, se: 0.8, p: 0.001},
# {name: "x1", estimate: 0.03, se: 0.01, p: 0.02},
# {name: "x2", estimate: -1.1, se: 0.4, p: 0.008}],
# r_squared: 0.45,
# adj_r_squared: 0.42,
# f_statistic: 12.3,
# pvalue: 0.00002
# }
Matrix Creation and Operations
Create matrices from data and perform standard operations.
# Create a 3x3 count matrix (genes x samples)
let counts = matrix([
[120, 340, 250],
[890, 1200, 1050],
[45, 30, 55]
])
# Transpose: samples x genes
let transposed = transpose(counts)
# Matrix multiplication
let product = mat_mul(transpose(counts), counts) # 3x3 covariance-like
# Element-wise operations with mat_map
let log_counts = mat_map(counts, |x| log2(x + 1))
# Basic properties
print("Trace: " + str(trace(product)))
print("Rank: " + str(rank(counts)))
print("Frobenius norm: " + str(norm(counts)))
Linear Algebra
Determinant and Inverse
let cov = matrix([
[1.0, 0.8, 0.3],
[0.8, 1.0, 0.5],
[0.3, 0.5, 1.0]
])
let det = determinant(cov)
print("Determinant: " + str(det)) # non-zero => invertible
let precision = inverse(cov) # precision matrix
Eigenvalues and Eigenvectors
let eigen = eigenvalues(cov)
# eigen => {values: [2.1, 0.7, 0.2], vectors: [[...], [...], [...]]}
# Proportion of variance explained by each component
let total = sum(eigen.values)
let prop = eigen.values |> map(|v| v / total)
print("PC1 explains " + str(prop[0] * 100) + "% of variance")
Singular Value Decomposition
let result = svd(counts)
# result => {u: matrix, s: [singular values], v: matrix}
Solving Linear Systems
# Solve Ax = b
let a = matrix([[2, 1], [1, 3]])
let b = [5, 11]
let x = solve(a, b)
# x => [1.0, 3.0]
ODE Solving
ode_solve uses a 4th-order Runge-Kutta integrator. The signature is:
ode_solve(derivative_fn, initial_state, [t_start, t_end, dt])
It returns {t: [...], y: [...]} where y is a list of state vectors at
each time point.
Example: Differential Expression Analysis
Run t-tests across all genes and correct for multiple testing.
let expr = tsv("counts_normalized.tsv")
let metadata = tsv("sample_info.tsv")
let case_ids = metadata |> filter(|r| r.condition == "treated") |> col("sample_id")
let ctrl_ids = metadata |> filter(|r| r.condition == "control") |> col("sample_id")
let gene_names = expr |> col("gene_id")
let results = gene_names |> map(|gene| {
let case_vals = case_ids |> map(|id| expr[id][index_of(gene_names, gene)])
let ctrl_vals = ctrl_ids |> map(|id| expr[id][index_of(gene_names, gene)])
let test = ttest(case_vals, ctrl_vals)
let fc = mean(case_vals) / mean(ctrl_vals)
{gene: gene, log2fc: log2(fc), pvalue: test.pvalue}
})
let p_vals = results |> map(|r| r.pvalue)
let q_vals = p_adjust(p_vals, "BH")
let de_genes = range(0, len(results))
|> map(|i| {...results[i], q_value: q_vals[i]})
|> filter(|r| r.q_value < 0.05 && abs(r.log2fc) > 1.0)
|> sort_by(|r| r.q_value)
print(str(len(de_genes)) + " differentially expressed genes")
de_genes |> take(20) |> each(|g| print(g.gene + " log2FC=" + str(g.log2fc)))
de_genes |> write_tsv("de_results.tsv")
Example: PCA on Gene Expression
Use the covariance matrix eigendecomposition to project samples into PC space.
let expr = tsv("tpm_matrix.tsv")
let sample_ids = expr |> colnames() |> filter(|n| n != "gene_id")
let n_genes = expr.num_rows
let n_samples = len(sample_ids)
# Build samples x genes matrix
let data = sample_ids
|> map(|id| expr |> col(id))
|> matrix()
# Center each gene (column) to zero mean
let centered = range(0, n_genes)
|> fold(data, |mat, j| {
let col = mat_col(mat, j)
let mu = mean(col)
mat_set_col(mat, j, col |> map(|x| x - mu))
})
# Covariance matrix (samples x samples)
let cov = mat_mul(centered, transpose(centered))
|> mat_map(|x| x / (n_genes - 1))
# Eigendecomposition
let eigen = eigenvalues(cov)
let total_var = sum(eigen.values)
# Project onto first 2 PCs
let pc1 = mat_col(eigen.vectors, 0)
let pc2 = mat_col(eigen.vectors, 1)
let pca_coords = range(0, n_samples)
|> map(|i| {
sample: sample_ids[i],
pc1: pc1[i],
pc2: pc2[i],
})
print("PC1: " + str(eigen.values[0] / total_var * 100) + "% variance")
print("PC2: " + str(eigen.values[1] / total_var * 100) + "% variance")
pca_coords |> write_tsv("pca_coordinates.tsv")
Example: Pharmacokinetic Modeling
Model oral drug absorption and elimination using a two-compartment ODE system.
# Parameters for a typical oral drug
let ka = 1.0 # absorption rate (1/hr)
let ke = 0.2 # elimination rate (1/hr)
let dose = 500.0 # mg
# State: [gut_amount, plasma_concentration]
# gut depletes by absorption, plasma gains from gut and loses by elimination
let pk_model = |t, y| [
-ka * y[0], # dA_gut/dt = -ka * A_gut
ka * y[0] - ke * y[1], # dC_plasma/dt = ka * A_gut - ke * C_plasma
]
let initial = [dose, 0.0]
let solution = ode_solve(pk_model, initial, [0.0, 24.0, 0.1])
# Find Cmax and Tmax
let plasma = solution.y |> map(|state| state[1])
let cmax = max(plasma)
let tmax_idx = index_of(plasma, cmax)
let tmax = solution.t[tmax_idx]
print("Tmax: " + str(tmax) + " hr")
print("Cmax: " + str(cmax) + " mg")
# Half-life
let t_half = log(2) / ke
print("Half-life: " + str(t_half) + " hr")
# AUC by trapezoidal rule
let auc = range(0, len(plasma) - 1)
|> map(|i| (plasma[i] + plasma[i + 1]) / 2.0 * (solution.t[i + 1] - solution.t[i]))
|> sum()
print("AUC(0-24h): " + str(auc) + " mg*hr")
Example: Population Genetics (Hardy-Weinberg)
Simulate genotype frequencies and test for equilibrium across loci.
# Observed genotype counts at 50 SNP loci
let loci = tsv("genotype_counts.tsv")
# columns: locus, AA, Aa, aa
let hwe_results = loci |> map(|loc| {
let obs = [loc.AA, loc.Aa, loc.aa]
let n = sum(obs)
let p = (2 * loc.AA + loc.Aa) / (2 * n)
let q = 1.0 - p
let exp = [p * p * n, 2 * p * q * n, q * q * n]
let test = chi_square(obs, exp)
{
locus: loc.locus,
p_freq: p,
q_freq: q,
chi2: test.statistic,
pvalue: test.pvalue,
}
})
# Correct for multiple testing
let p_vals = hwe_results |> map(|r| r.pvalue)
let q_vals = p_adjust(p_vals, "BH")
let departures = range(0, len(hwe_results))
|> map(|i| {...hwe_results[i], q_value: q_vals[i]})
|> filter(|r| r.q_value < 0.05)
print(str(len(departures)) + " loci depart from HWE (FDR < 0.05):")
departures |> each(|r| print(" " + r.locus + " chi2=" + str(r.chi2)
+ " q=" + str(r.q_value)))
Summary
BioLang’s numerical toolbox covers the statistical tests and linear algebra operations that appear throughout computational biology. Descriptive stats, hypothesis tests with multiple-testing correction, matrix decompositions, and ODE integration are all available as built-in functions, keeping your analysis in a single language from raw data to publication-ready results.
Chapter 13: Biological Database APIs
Bioinformatics analysis rarely lives in isolation. You query NCBI for gene annotations, check UniProt for protein function, pull variant consequences from Ensembl VEP, and look up pathways in KEGG. BioLang provides 16 built-in database client functions so you can fetch, cross-reference, and integrate biological data without leaving your script.
All bio API functions are builtins. No imports are needed.
NCBI
The National Center for Biotechnology Information hosts PubMed, Gene, Nucleotide, and dozens of other databases. BioLang provides three NCBI functions.
ncbi_search
Search any NCBI database (PubMed, Gene, Nucleotide, Protein, etc.).
# Search PubMed for recent CRISPR-Cas9 papers
# ncbi_search(db, query, max_results?)
let ids = ncbi_search("pubmed", "CRISPR-Cas9 delivery 2024", 20)
# ids => ["39012345", "39012346", ...]
# Get summaries for the IDs
let summaries = ncbi_summary(ids, "pubmed")
summaries |> each(|s| {
print(s.uid)
})
ncbi_gene
Fetch detailed gene information by gene ID or symbol.
# ncbi_gene(symbol_or_query, max_results?)
let tp53 = ncbi_gene("TP53")
# If a single gene matches, returns a record:
# {id, symbol, name, description, organism, chromosome, location, summary}
print(tp53.name + " on chr" + tp53.chromosome + ": " + tp53.location)
ncbi_sequence
Retrieve nucleotide or protein sequences from NCBI.
# ncbi_sequence(accession) — returns FASTA text
let fasta = ncbi_sequence("NM_000546.6")
print("FASTA (first 100 chars): " + fasta[0..100])
Set the NCBI_API_KEY environment variable to get 10 requests/second instead
of the default 3.
Ensembl
ensembl_gene
Look up a gene via the Ensembl REST API.
# ensembl_symbol(species, symbol) — lookup by symbol
let brca1 = ensembl_symbol("human", "BRCA1")
# Returns: {id, symbol, description, species, biotype, start, end, strand, chromosome}
print("Ensembl ID: " + brca1.id)
print("Location: " + brca1.chromosome + ":" + str(brca1.start) + "-" + str(brca1.end))
# ensembl_gene(ensembl_id) — lookup by Ensembl ID
let same = ensembl_gene("ENSG00000012048")
print("Symbol: " + same.symbol)
ensembl_vep
Predict the functional consequences of variants using the Variant Effect Predictor.
# ensembl_vep(hgvs) — predict variant consequences
let variants = [
"17:g.43091434C>T", # BRCA1 splice donor
"7:g.140753336A>T", # BRAF V600E
"12:g.25245350C>A", # KRAS G12V
]
let predictions = variants |> map(|v| ensembl_vep(v))
# Each result is a list of records with:
# {allele_string, most_severe_consequence, transcript_consequences: [...]}
predictions |> each(|pred| {
if len(pred) > 0 then {
let r = pred[0]
print(r.allele_string + " => " + r.most_severe_consequence)
}
})
UniProt
uniprot_search
Search the UniProt protein database.
# uniprot_search(query, limit?)
let kinases = uniprot_search("kinase AND organism_id:9606", 50)
# Returns list of records: {accession, name, organism, sequence_length, gene_names, function}
print(str(len(kinases)) + " human kinases found")
kinases |> take(5) |> each(|k| print(k.accession + " " + k.gene_names))
uniprot_entry
Get full details for a single UniProt accession.
let entry = uniprot_entry("P04637") # TP53
# Returns: {accession, name, organism, sequence_length, gene_names, function}
print(entry.name + ": " + str(entry.sequence_length) + " aa")
print("Genes: " + entry.gene_names)
print("Function: " + entry.function)
# Get protein features separately
let features = uniprot_features("P04637")
# Returns list of: {type, location, description}
let domains = features |> filter(|f| f.type == "Domain")
domains |> each(|d| print(" " + d.description + ": " + d.location))
UCSC Genome Browser
ucsc_sequence
Retrieve genomic sequences from the UCSC Genome Browser.
# Get the sequence of the BRCA1 promoter region
# ucsc_sequence(genome, chrom, start, end)
let promoter = ucsc_sequence("hg38", "chr17", 43170245, 43172245)
# Returns DNA sequence as a string
print("BRCA1 promoter length: " + str(len(promoter)) + " bp")
let gc = gc_content(dna(promoter))
print("GC content: " + str(gc))
KEGG
kegg_find
Search KEGG databases (pathway, enzyme, compound, etc.).
# kegg_find(db, query) — search KEGG databases
let pathways = kegg_find("pathway", "apoptosis human")
# Returns list of: {id, description}
kegg_get
Retrieve a specific KEGG entry.
# kegg_get(entry_id) — returns raw KEGG text
let apoptosis = kegg_get("hsa04210")
# Returns the KEGG flat-file text for the entry
print("Entry text length: " + str(len(apoptosis)) + " chars")
STRING
string_network
Query protein-protein interaction networks from the STRING database.
# string_network(identifiers, species)
let network = string_network("TP53", 9606)
# Returns interaction network data
print(network)
PDB
pdb_entry
Retrieve protein structure metadata from the Protein Data Bank.
let structure = pdb_entry("1TUP") # TP53 DNA-binding domain
# structure => {
# id: "1TUP",
# title: "Crystal structure of the p53 core domain...",
# resolution: 2.2,
# method: "X-RAY DIFFRACTION",
# chains: [{id: "A", entity: "Tumor suppressor p53", length: 195}, ...],
# ligands: [...],
# }
print(structure.title)
print("Resolution: " + str(structure.resolution) + " A")
Reactome
reactome_pathways
Find pathways associated with a gene or set of genes.
# reactome_pathways(gene_or_genes)
let pathways = reactome_pathways("BRCA1")
# Returns pathway records
pathways |> each(|p| print(p))
Gene Ontology
go_term
Look up a GO term by its identifier.
let term = go_term("GO:0006915")
# term => {id: "GO:0006915", name: "apoptotic process",
# namespace: "biological_process",
# definition: "A programmed cell death process..."}
go_annotations
Get GO annotations for a gene.
# go_annotations(gene_or_accession)
let annotations = go_annotations("TP53")
# Returns list of: {go_id, term, aspect}
annotations |> take(10) |> each(|a| print(" " + a.go_id + " [" + a.aspect + "]: " + a.term))
COSMIC
cosmic_gene
Query the Catalogue Of Somatic Mutations In Cancer. Requires COSMIC_API_KEY.
# cosmic_gene(gene) — requires COSMIC_API_KEY env var
let cosmic = cosmic_gene("BRAF")
# Returns mutation data for the gene
print(cosmic)
NCBI Datasets
datasets_gene
Use the NCBI Datasets API for gene metadata.
# datasets_gene(symbol_or_id)
let info = datasets_gene("EGFR")
print(info)
Environment Variables
Some APIs require or benefit from API keys:
| Variable | Effect |
|---|---|
NCBI_API_KEY | NCBI: 10 req/sec instead of 3 |
COSMIC_API_KEY | Required for COSMIC queries |
Set these in your shell or in a .env file before running your script.
Example: Gene Annotation Pipeline
Fetch gene info from NCBI, cross-reference with UniProt, and pull pathways from KEGG.
let gene_list = ["TP53", "BRCA1", "EGFR", "KRAS", "PIK3CA"]
let annotated = gene_list |> map(|symbol| {
let ncbi = ncbi_gene(symbol)
let up_hits = uniprot_search("gene:" + symbol + " AND organism_id:9606", 1)
let prot_len = if len(up_hits) > 0 then up_hits[0].sequence_length else 0
let pathways = kegg_find("pathway", symbol + " human")
|> take(3)
{
symbol: symbol,
name: ncbi.name,
chromosome: ncbi.chromosome,
location: ncbi.location,
protein_length: prot_len,
pathways: pathways |> map(|p| p.description),
}
})
annotated |> each(|g| {
print(g.symbol + " (" + g.name + ") - chr" + g.chromosome)
print(" Protein: " + str(g.protein_length) + " aa")
print(" Pathways: " + str(g.pathways))
})
annotated |> write_json("gene_annotations.json")
Example: Variant Interpretation
Predict variant effects with Ensembl VEP and check for known cancer mutations in COSMIC.
let variants = tsv("candidate_variants.tsv")
|> map(|r| r.chrom + ":" + str(r.pos) + ":" + r.ref + ":" + r.alt)
let interpreted = variants |> map(|v| {
let vep = ensembl_vep(v)
let worst = vep.consequences
|> sort_by(|c| c.impact_rank)
|> first()
let gene = worst.gene_symbol
# Check COSMIC if it is a missense or nonsense variant
let cosmic_info = if worst.consequence == "missense_variant" or
worst.consequence == "stop_gained" then
cosmic_gene(gene)
|> |cg| cg.mutations
|> find(|m| m.aa_change == worst.amino_acid_change)
else
None
{
variant: v,
gene: gene,
consequence: worst.consequence,
impact: worst.impact,
sift: worst.sift_prediction,
polyphen: worst.polyphen_prediction,
cosmic_count: if cosmic_info != None then cosmic_info.count else 0,
cosmic_known: cosmic_info != None,
}
})
# Flag high-impact or COSMIC-known variants
let flagged = interpreted
|> filter(|v| v.impact == "HIGH" or v.cosmic_known)
|> sort_by(|v| -v.cosmic_count)
print(str(len(flagged)) + " variants flagged for review")
flagged |> write_tsv("flagged_variants.tsv")
Example: Protein Interaction Network
Build a STRING interaction network for a set of differentially expressed genes and annotate with Reactome pathways.
let de_genes = tsv("de_results.tsv")
|> filter(|r| r.q_value < 0.01 and abs(r.log2fc) > 2.0)
|> map(|r| r.gene)
# Get STRING interactions for the top 50 DE genes
let top_genes = de_genes |> take(50)
let network = string_network(top_genes, 9606)
print(str(len(network.nodes)) + " nodes, "
+ str(len(network.edges)) + " edges")
# Find hub genes (most connections)
let degree = network.nodes |> map(|node| {
let edges = network.edges
|> filter(|e| e.source == node.id or e.target == node.id)
{gene: node.id, degree: len(edges)}
})
|> sort_by(|d| -d.degree)
print("Hub genes:")
degree |> take(10) |> each(|d| print(" " + d.gene + ": " + str(d.degree) + " interactions"))
# Pathway enrichment for hub genes
let hub_genes = degree |> take(10) |> map(|d| d.gene)
let pathways = reactome_pathways(hub_genes)
|> filter(|p| p.p_value < 0.05)
|> sort_by(|p| p.p_value)
print("Enriched pathways:")
pathways |> take(10) |> each(|p| print(" " + p.name + " (p=" + str(p.p_value) + ")"))
Summary
BioLang’s built-in bio API functions let you query 12 major biological databases directly from your scripts. Combine them with pipes, maps, and filters to build annotation pipelines that cross-reference genes, variants, proteins, and pathways in a few lines of code. Set API keys via environment variables to increase rate limits where supported.
Chapter 14: Streams and Scale
Genomics data is large. A single whole-genome sequencing run produces hundreds of gigabytes of FASTQ. A population-scale VCF can hold billions of variant records. Loading everything into memory is not an option. BioLang addresses this with streams: lazy, single-pass iterators that process data record by record without materializing the entire dataset.
What Is a Stream?
A stream is a StreamValue – a lazy sequence of items produced on demand.
File readers in BioLang return streams by default. Streams are consumed once:
after you iterate through a stream, it is exhausted.
# read_fastq returns a stream, not a list
let reads = read_fastq("whole_genome_R1.fastq.gz")
# This processes one record at a time -- constant memory
let high_quality = reads
|> filter(|r| mean_phred(r.quality) >= 30)
|> map(|r| {id: r.id, length: r.length, gc: gc_content(r.seq)})
|> write_tsv("read_stats.tsv")
No matter how large the FASTQ file is, this script uses the same amount of
memory. Each record flows through filter, then map, then is written out
before the next record is read.
Stream Operations
Streams support the same functional operations as lists, but they execute lazily.
map
Transform each element.
read_vcf("variants.vcf.gz")
|> map(|v| {chrom: v.chrom, pos: v.pos, qual: v.qual, af: v.info.AF})
|> write_tsv("variant_summary.tsv")
filter
Keep only elements matching a predicate.
read_bam("aligned.bam")
|> filter(|r| r.mapping_quality >= 30 and not r.is_duplicate)
|> count()
|> |n| print(str(n) + " high-quality unique alignments")
take_while
Consume from a stream while a condition holds, then stop.
# Read the first million reads from a FASTQ for quick QC
let sample = read_fastq("deep_sequencing.fastq.gz")
|> take_while(|r, i| i < 1000000)
let quals = sample |> map(|r| mean_phred(r.quality))
print("Sampled stats: " + str(mean(quals)) + " mean Q")
each
Execute a side effect for every element (the stream analogue of a for loop).
let bad_count = 0
read_fastq("sample.fastq.gz")
|> each(|r| {
if mean_phred(r.quality) < 20 then
bad_count = bad_count + 1
})
print(str(bad_count) + " low-quality reads")
fold / reduce
Accumulate a value across the entire stream.
let total_bases = read_fastq("sample.fastq.gz")
|> fold(0, |acc, r| acc + r.length)
print("Total bases: " + str(total_bases))
Chunked Processing
When you need batch operations on a stream, chunk groups elements into
fixed-size lists. This is useful for writing output in blocks or batching API
calls.
# Process a FASTQ in chunks of 10,000 reads
read_fastq("large_sample.fastq.gz")
|> chunk(10000)
|> each(|batch| {
let mean_q = batch |> map(|r| mean_phred(r.quality)) |> mean()
print("Batch: " + str(len(batch)) + " reads, "
+ "mean Q=" + str(mean_q))
})
Chunking is also useful for writing to multiple output files.
# Split a FASTQ into files of 1 million reads each
let file_num = 0
read_fastq("huge_sample.fastq.gz")
|> chunk(1000000)
|> each(|batch| {
let path = "split/chunk_" + str(file_num) + ".fastq.gz"
batch |> write_fastq(path)
file_num = file_num + 1
})
print("Wrote " + str(file_num) + " chunk files")
Window Operations
window creates a sliding window over a list (or materialized portion of a
stream). This is essential for computing positional statistics across a genome.
# Sliding window GC content
let sequence = read_fasta("chr1.fa") |> first() |> |r| r.seq
let win_size = 1000
let step = 500
let gc_track = window(sequence, win_size)
|> enumerate()
|> map(|pair| {
pos: pair[0] * step,
gc: gc_content(pair[1]),
})
gc_track |> write_tsv("chr1_gc_content.tsv")
For windowed operations on streams (where you cannot look back), use
sliding_window which maintains a buffer.
# Compute rolling average base quality across a BAM
read_bam("sample.bam")
|> filter(|r| r.chrom == "chr17")
|> sort_by(|r| r.pos)
|> window(100)
|> map(|win| {
pos: win[50].pos,
mean_mapq: mean(win |> map(|r| r.mapping_quality)),
})
|> write_tsv("chr17_mapq_rolling.tsv")
Parallel Operations
par_map
Apply a function to each element of a list in parallel, distributing work across available CPU cores.
let chromosomes = ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6",
"chr7", "chr8", "chr9", "chr10", "chr11", "chr12",
"chr13", "chr14", "chr15", "chr16", "chr17", "chr18",
"chr19", "chr20", "chr21", "chr22", "chrX", "chrY"]
let per_chrom_stats = par_map(chromosomes, |chrom| {
let reads = read_bam("sample.bam") |> filter(|r| r.chrom == chrom)
let count = reads |> count()
let depth = samtools_depth("sample.bam", chrom) |> mean()
{chrom: chrom, reads: count, mean_depth: depth}
})
per_chrom_stats |> sort_by(|s| s.chrom) |> write_tsv("chrom_stats.tsv")
par_filter
Filter elements in parallel. Useful when the predicate itself is expensive.
let variants = tsv("all_variants.tsv")
let pathogenic = par_filter(variants, |v| {
let vep = ensembl_vep(v.chrom + ":" + str(v.pos) + ":"
+ v.ref + ":" + v.alt)
let worst = vep.consequences |> first()
worst.impact == "HIGH"
})
print(str(len(pathogenic)) + " high-impact variants")
Unit Helpers
BioLang provides genomic unit helpers that make size comparisons readable.
let region_size = mb(3) # 3,000,000
let read_length = bp(150) # 150
let window = kb(50) # 50,000
let genome_size = gb(3) # 3,000,000,000
# Use in calculations
let coverage = 1800000000 / genome_size # ~0.6x
print("Coverage: " + str(coverage) + "x")
# Filter regions by size
let large_svs = read_vcf("structural_variants.vcf.gz")
|> filter(|v| v.info.SVLEN != None and abs(v.info.SVLEN) > mb(1))
print(str(large_svs |> count()) + " SVs larger than 1 Mb")
These helpers are simple multipliers but they prevent off-by-three-zeros bugs that plague genomics scripts.
Example: Process a 100GB FASTQ
Stream 4 billion reads, filter by quality, and compute statistics without loading the file into memory.
let input = "novaseq_R1.fastq.gz" # 100 GB compressed
let stats = {
total: 0,
passed: 0,
total_bases: 0,
gc_bases: 0,
qual_sum: 0,
}
read_fastq(input)
|> each(|r| {
stats.total = stats.total + 1
let q = mean_phred(r.quality)
if q >= 30 then {
stats.passed = stats.passed + 1
stats.total_bases = stats.total_bases + r.length
let gc = gc_content(dna(r.seq)) * r.length
stats.gc_bases = stats.gc_bases + gc
stats.qual_sum = stats.qual_sum + q
}
})
let pct_pass = stats.passed / stats.total * 100
let gc = stats.gc_bases / stats.total_bases * 100
let mean_q = stats.qual_sum / stats.passed
print("Total reads: " + str(stats.total))
print("Passed Q>=30: " + str(stats.passed) + " (" + str(pct_pass) + "%)")
print("Mean quality: " + str(mean_q))
print("GC content: " + str(gc) + "%")
This script processes the entire file in a single pass. Memory usage stays constant regardless of file size.
Example: Parallel Variant Annotation Across Chromosomes
Split variant annotation work by chromosome to use all cores.
let vcf_path = "cohort.vcf.gz"
let chromosomes = range(1, 23) |> map(|n| "chr" + str(n))
let chromosomes = concat(chromosomes, ["chrX", "chrY"])
let annotated = par_map(chromosomes, |chrom| {
let variants = read_vcf(vcf_path)
|> filter(|v| v.chrom == chrom)
|> collect()
let results = variants |> map(|v| {
let vep = ensembl_vep(v.chrom + ":" + str(v.pos) + ":" + v.ref + ":" + v.alt)
let worst = vep.consequences
|> sort_by(|c| c.impact_rank)
|> first()
{
...v,
consequence: worst.consequence,
impact: worst.impact,
gene: worst.gene_symbol,
}
})
print(chrom + ": " + str(len(results)) + " variants annotated")
results
})
|> flatten()
annotated |> write_tsv("annotated_variants.tsv")
print("Annotated " + str(len(annotated)) + " variants across "
+ str(len(chromosomes)) + " chromosomes")
Example: Sliding Window GC Content
Compute GC content in 1 kb windows across an entire chromosome.
let chr_seq = read_fasta("GRCh38_chr22.fa") |> first() |> |r| r.seq
let chr_len = len(chr_seq)
let win = kb(1)
let step = bp(500)
print("Chromosome 22: " + str(chr_len) + " bp")
print("Computing GC in " + str(win) + " bp windows, step " + str(step))
let gc_profile = window(chr_seq, win)
|> enumerate()
|> map(|pair| {
start: pair[0] * step,
end: pair[0] * step + win,
gc: gc_content(pair[1]),
})
# Identify GC-rich and GC-poor regions
let gc_rich = gc_profile |> filter(|w| w.gc > 0.60)
let gc_poor = gc_profile |> filter(|w| w.gc < 0.35)
print("GC-rich windows (>60%): " + str(len(gc_rich)))
print("GC-poor windows (<35%): " + str(len(gc_poor)))
gc_profile |> write_tsv("chr22_gc_profile.tsv")
Example: Batch Processing Thousands of Samples
Process a large sample cohort using chunked parallelism to avoid overwhelming system resources.
let manifest = tsv("sample_manifest.tsv") # 2,000 samples
print("Processing " + str(len(manifest)) + " samples")
let batch_size = 16 # process 16 samples at a time
let all_results = manifest
|> chunk(batch_size)
|> enumerate()
|> map(|pair| {
let batch_idx = pair[0]
let batch = pair[1]
print("Batch " + str(batch_idx + 1) + "/"
+ str(ceil(len(manifest) / batch_size)))
par_map(batch, |sample| {
# Align
let bam = bwa_mem("GRCh38.fa", sample.r1, sample.r2, threads: 2)
|> samtools_sort(threads: 2)
samtools_index(bam)
# QC metrics
let flagstat = samtools_flagstat(bam)
let depth = samtools_depth(bam) |> mean()
let insert_size = picard_insert_size(bam)
{
id: sample.sample_id,
bam: bam,
mapped_pct: flagstat.mapped_pct,
mean_depth: depth,
median_insert: insert_size.median,
status: if flagstat.mapped_pct > 90 and depth > 20
then "PASS" else "FAIL",
}
})
})
|> flatten()
let passed = all_results |> filter(|r| r.status == "PASS")
let failed = all_results |> filter(|r| r.status == "FAIL")
print("Results: " + str(len(passed)) + " PASS, "
+ str(len(failed)) + " FAIL")
if len(failed) > 0 then {
print("Failed samples:")
failed |> each(|r| print(" " + r.id + " depth=" + str(r.mean_depth)
+ " mapped=" + str(r.mapped_pct) + "%"))
}
all_results |> write_tsv("cohort_qc_results.tsv")
Memory Patterns to Know
| Pattern | Memory | Use When |
|---|---|---|
read_fastq(f) |> filter(...) |> count() | O(1) | Counting, simple stats |
read_vcf(f) |> collect() | O(n) | Need random access |
read_bam(f) |> chunk(k) |> each(...) | O(k) | Batch writing, API calls |
par_map(list, f) | O(n) | List already in memory |
stream |> fold(init, f) | O(1) | Aggregation |
The rule of thumb: stay in stream mode as long as possible. Call collect()
only when you genuinely need the full dataset in memory (for sorting, random
access, or passing to a function that requires a list).
Summary
Streams let BioLang handle files of any size in constant memory. Combine them
with chunk for batch processing, window for positional analysis, and
par_map/par_filter for multi-core parallelism. The unit helpers bp(),
kb(), mb(), and gb() keep genomic arithmetic readable. Together, these
tools let you write scripts that scale from a test FASTQ to a
population-scale dataset without changing the code.
SQLite & Notifications
BioLang includes built-in SQLite support and notification builtins so your pipelines can persist results and alert you when they finish — no external tools required.
SQLite
Bioinformatics workflows constantly produce tabular results: QC metrics, variant counts, sample manifests. SQLite gives you a zero-config embedded database to store, query, and compare results across runs.
Opening a Database
# File-based (creates if missing)
let db = sqlite("project_results.db")
# In-memory (temporary, fast)
let scratch = sqlite()
Querying
sql() executes any SQL statement. SELECT queries return a Table; write
statements return the number of affected rows. Use ? placeholders for
parameterized queries.
# Create a table
sql(db, "CREATE TABLE IF NOT EXISTS qc (
sample TEXT PRIMARY KEY,
total_reads INTEGER,
pass_reads INTEGER,
pass_rate REAL
)")
# Insert data
sql(db, "INSERT INTO qc VALUES (?, ?, ?, ?)",
["tumor_01", 5000000, 4850000, 97.0])
# Query — returns a Table
let results = sql(db, "SELECT * FROM qc WHERE pass_rate > ?", [95.0])
print(results)
Since sql() returns a standard BioLang Table, you can pipe it directly:
sql(db, "SELECT symbol, chrom, start, end FROM genes WHERE chrom = ?", ["chr17"])
|> filter(|g| g.start > 40000000)
|> sort_by(|g| g.start)
|> print()
Bulk Insert
sql_insert() inserts an entire Table or list of records in a single
transaction. This is much faster than individual INSERT statements.
# From a Table (e.g., read from TSV)
let stats = read_tsv("variant_stats.tsv")
sql_insert(db, "variants", stats)
# From a list of records
let samples = [
{id: "S1", depth: 42.3, mapped_pct: 98.1},
{id: "S2", depth: 38.7, mapped_pct: 97.5},
]
let inserted = sql_insert(db, "qc_results", samples)
print(str(inserted) + " rows inserted")
Metadata
# List all tables
sql_tables(db)
# ["qc", "variants", "qc_results"]
# Inspect a table's schema
sql_schema(db, "qc")
# cid | name | type | notnull | pk
# -----+-------------+---------+---------+-----
# 0 | sample | TEXT | false | true
# 1 | total_reads | INTEGER | false | false
# 2 | pass_reads | INTEGER | false | false
# 3 | pass_rate | REAL | false | false
Pipeline Example
Store QC results from every sample run, then query across all runs:
let db = sqlite("lab_results.db")
sql(db, "CREATE TABLE IF NOT EXISTS qc (
sample TEXT, run_date TEXT, total INTEGER, passing INTEGER, rate REAL
)")
pipeline sample_qc(sample_id, fastq_path) {
stage stats {
let reads = read_fastq(fastq_path) |> collect()
let total = len(reads)
let passing = reads |> filter(|r| mean_phred(r.quality) >= 25) |> len()
let rate = passing / total * 100.0
sql(db, "INSERT INTO qc VALUES (?, date('now'), ?, ?, ?)",
[sample_id, total, passing, rate])
{total: total, passing: passing, rate: rate}
}
}
# After processing many samples, query trends
let low_quality = sql(db,
"SELECT sample, rate FROM qc WHERE rate < ? ORDER BY rate",
[95.0])
print("Samples below 95% pass rate:")
print(low_quality)
SQLite Builtins Summary
| Builtin | Returns | Description |
|---|---|---|
sqlite(path?) | DbHandle | Open/create database |
sql(db, query, params?) | Table or Int | Execute SQL |
sql_insert(db, table, data) | Int | Bulk insert (transactional) |
sql_tables(db) | List | List table names |
sql_schema(db, table) | Table | Column metadata |
sql_close(db) | Nil | Close connection (optional) |
is_db(value) | Bool | Type check |
Notifications
Long-running pipelines (alignment, variant calling, cohort analysis) can take hours. BioLang’s notification builtins send you a message when your analysis finishes or fails — Slack, Teams, Telegram, Discord, or email.
The notify() Builtin
notify() is the smart router. It reads BIOLANG_NOTIFY to determine which
provider to use, then sends the message.
# Simple string
notify("Alignment complete: 24 samples processed")
# Structured record — provider formats it natively
notify({
title: "QC Pipeline Complete",
status: "success",
fields: {
samples: 24,
pass_rate: "96%",
output: "/data/results/cohort.vcf.gz"
}
})
If BIOLANG_NOTIFY is not set, notify() prints to stderr as a fallback.
Provider-Specific Builtins
Each provider has a dedicated builtin for direct use:
# Slack (env: SLACK_WEBHOOK)
slack("Variant calling finished: 1,234 SNPs, 456 indels")
# Microsoft Teams (env: TEAMS_WEBHOOK)
teams("RNA-seq pipeline complete: 12 samples normalized")
# Telegram (env: TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID)
telegram("Alignment done: tumor.sorted.bam")
# Discord (env: DISCORD_WEBHOOK)
discord("FASTQ QC complete: 98% pass rate")
# Email (env: SMTP_HOST, SMTP_USER, SMTP_PASS)
email("lab@example.com", "Pipeline Complete", "Analysis finished successfully")
All webhook-based builtins accept an optional first argument for an explicit webhook URL, so you can skip env vars:
slack("https://hooks.slack.com/services/xxx", "Pipeline done")
teams("https://outlook.office.com/webhook/xxx", "Pipeline done")
Structured Messages
Pass a record instead of a string for rich formatting. Each provider renders it natively (Slack Block Kit, Teams Adaptive Cards, Discord embeds):
slack({
title: "Cohort Analysis Complete",
status: "success",
fields: {
samples: 48,
variants: "2.3M",
ti_tv: 2.1,
output: "cohort.annotated.vcf.gz"
}
})
Pipeline Integration
Use notify() in pipeline stages or with defer for guaranteed delivery:
pipeline variant_pipeline {
defer {
notify("Pipeline variant_pipeline finished")
}
stage align {
shell("bwa-mem2 mem -t 16 GRCh38.fa R1.fq.gz R2.fq.gz | samtools sort -o aligned.bam")
"aligned.bam"
}
stage call {
shell("gatk HaplotypeCaller -R GRCh38.fa -I " + align + " -O raw.vcf.gz")
let variants = read_vcf("raw.vcf.gz") |> collect()
let snps = variants |> filter(|v| v.is_snp) |> len()
let indels = variants |> filter(|v| v.is_indel) |> len()
notify({
title: "Variant Calling Complete",
fields: {SNPs: snps, Indels: indels}
})
"raw.vcf.gz"
}
}
Setup
Set environment variables once in your shell profile:
# Pick your provider
export BIOLANG_NOTIFY=slack
# Slack
export SLACK_WEBHOOK=https://hooks.slack.com/services/T.../B.../xxx
# Telegram
export TELEGRAM_BOT_TOKEN=123456:ABC-DEF
export TELEGRAM_CHAT_ID=-1001234567890
# Email (for notify() with BIOLANG_NOTIFY=email)
export SMTP_HOST=smtp.gmail.com
export SMTP_USER=you@gmail.com
export SMTP_PASS=app-password
export NOTIFY_EMAIL_TO=lab@example.com
Notification Builtins Summary
| Builtin | Provider | Env Vars |
|---|---|---|
notify(msg) | Auto | BIOLANG_NOTIFY + provider vars |
slack(msg) | Slack | SLACK_WEBHOOK |
teams(msg) | Teams | TEAMS_WEBHOOK |
telegram(msg) | Telegram | TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID |
discord(msg) | Discord | DISCORD_WEBHOOK |
email(to, subj, body) | SMTP | SMTP_HOST, SMTP_USER, SMTP_PASS |
Knowledge Graphs
BioLang includes a built-in graph data structure for modeling biological networks — protein interactions, gene regulatory networks, metabolic pathways, and more.
Creating Graphs
# Empty undirected graph
let g = graph()
# Directed graph
let g = graph(true)
Adding Nodes and Edges
Nodes are identified by string IDs and can carry arbitrary attributes:
let g = graph()
let g = add_node(g, "BRCA1", {biotype: "protein_coding", chrom: "chr17"})
let g = add_node(g, "TP53", {biotype: "protein_coding", chrom: "chr17"})
let g = add_edge(g, "BRCA1", "TP53", {score: 0.99, source: "STRING"})
Adding an edge auto-creates nodes that don’t exist yet:
let g = graph()
let g = add_edge(g, "A", "B") # both A and B are created
has_node(g, "A") # true
Querying the Graph
# Build a small PPI network
let g = graph()
let g = add_edge(g, "BRCA1", "TP53", {score: 0.99})
let g = add_edge(g, "TP53", "MDM2", {score: 0.97})
let g = add_edge(g, "BRCA1", "BARD1", {score: 0.95})
let g = add_edge(g, "MDM2", "CDKN2A", {score: 0.85})
# Direct neighbors
neighbors(g, "TP53") # ["BRCA1", "MDM2"]
# Node degree
degree(g, "BRCA1") # 2
# Shortest path (BFS)
shortest_path(g, "BARD1", "CDKN2A") # ["BARD1", "BRCA1", "TP53", "MDM2", "CDKN2A"]
# All nodes and edges
nodes(g) # ["BARD1", "BRCA1", "CDKN2A", "MDM2", "TP53"]
edges(g) # Table with from, to, weight columns
Graph Analysis
Connected Components
Find disconnected subnetworks:
let g = graph()
let g = add_edge(g, "BRCA1", "TP53")
let g = add_edge(g, "BRCA1", "BARD1")
let g = add_node(g, "ISOLATED_GENE")
let g = add_edge(g, "CDK2", "CCND1")
let components = connected_components(g)
print("Number of components: " + str(len(components)))
# 3: [BRCA1, TP53, BARD1], [ISOLATED_GENE], [CDK2, CCND1]
Induced Subgraph
Extract a subgraph containing only specified nodes and their connecting edges:
let g = graph()
let g = add_edge(g, "A", "B")
let g = add_edge(g, "B", "C")
let g = add_edge(g, "C", "D")
let sub = subgraph(g, ["A", "B", "C"])
has_edge(sub, "A", "B") # true
has_edge(sub, "C", "D") # false (D not in subgraph)
Node Attributes
let g = graph()
let g = add_node(g, "EGFR", {
biotype: "protein_coding",
chrom: "chr7",
pathway: "EGFR signaling"
})
let attrs = node_attr(g, "EGFR")
print(attrs.pathway) # "EGFR signaling"
Removing Nodes and Edges
let g = graph()
let g = add_edge(g, "A", "B")
let g = add_edge(g, "B", "C")
# Remove a single edge
let g = remove_edge(g, "A", "B")
has_edge(g, "A", "B") # false
# Remove a node (also removes all connected edges)
let g = remove_node(g, "B")
has_node(g, "B") # false
Directed vs Undirected
By default, graphs are undirected — edges go both ways:
let g = graph()
let g = add_edge(g, "A", "B")
neighbors(g, "B") # ["A"] — B sees A as neighbor
For directed graphs (e.g., regulatory networks):
let g = graph(true)
let g = add_edge(g, "TF", "TARGET")
neighbors(g, "TF") # ["TARGET"]
neighbors(g, "TARGET") # [] — directed, no reverse edge
Real-World Example: STRING Network Analysis
# Fetch protein interactions from STRING
let network = string_network("BRCA1", 9606)
# Build graph from API results
let g = graph()
let g = network |> reduce(g, |g, edge|
add_edge(g, edge.preferredName_A, edge.preferredName_B, {score: edge.score})
)
# Find hub genes (highest degree)
let gene_degrees = nodes(g) |> map(|n| {gene: n, deg: degree(g, n)})
gene_degrees
|> sort_by(|r| r.deg, desc: true)
|> take(10)
|> each(|r| print(r.gene + ": " + str(r.deg) + " interactions"))
# Check connectivity
let components = connected_components(g)
print("Connected components: " + str(len(components)))
Builtin Reference
| Function | Args | Description |
|---|---|---|
graph() | [directed] | Create empty graph (default: undirected) |
add_node(g, id, [attrs]) | graph, Str, [Record] | Add node with optional attributes |
add_edge(g, from, to, [attrs]) | graph, Str, Str, [Record] | Add edge (auto-creates nodes) |
remove_node(g, id) | graph, Str | Remove node and its edges |
remove_edge(g, from, to) | graph, Str, Str | Remove first matching edge |
neighbors(g, id) | graph, Str | List of neighbor node IDs |
degree(g, id) | graph, Str | Number of edges for a node |
shortest_path(g, from, to) | graph, Str, Str | BFS shortest path (nil if none) |
connected_components(g) | graph | List of node-ID lists per component |
nodes(g) | graph | Sorted list of all node IDs |
edges(g) | graph | Table with from, to, weight columns |
has_node(g, id) | graph, Str | Bool |
has_edge(g, from, to) | graph, Str, Str | Bool |
subgraph(g, ids) | graph, List[Str] | Induced subgraph |
node_attr(g, id) | graph, Str | Attributes record for a node |
PDB Structures & Enrichment Analysis
BioLang provides direct access to the RCSB Protein Data Bank and built-in gene set enrichment analysis.
PDB Structures
Fetching Entries
let entry = pdb_entry("4HHB")
print(entry.title) # "THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN"
print(entry.method) # "X-RAY DIFFRACTION"
print(entry.resolution) # 1.74
print(entry.organism) # "Homo sapiens"
print(entry.release_date) # "1984-07-17"
Searching
let ids = pdb_search("insulin receptor")
print("Found " + str(len(ids)) + " structures")
ids |> take(5) |> each(|id| print(id))
Chains and Entities
# Get all polymer entities (chains) in a structure
let chains = pdb_chains("4HHB")
chains |> each(|c|
print(c.description + " (" + c.entity_type + "): " + str(len(c.sequence)) + " residues")
)
# Get a specific entity
let entity = pdb_entity("4HHB", 1)
print(entity.description)
print(entity.sequence)
# Get sequence as Protein value
let seq = pdb_sequence("4HHB", 1)
print(len(seq)) # sequence length
print(molecular_weight(seq)) # if available
Real-World Example: Compare Hemoglobin Chains
let alpha = pdb_entity("4HHB", 1)
let beta = pdb_entity("4HHB", 2)
print("Alpha chain: " + str(len(alpha.sequence)) + " residues")
print("Beta chain: " + str(len(beta.sequence)) + " residues")
print("Alpha type: " + alpha.entity_type)
PubMed Integration
Search PubMed and retrieve abstracts:
# Search for recent CRISPR papers
let results = pubmed_search("CRISPR Cas9 therapy", 5)
print("Total results: " + str(results.count))
# Fetch abstract for a specific paper
let abstract = pubmed_abstract(results.ids[0])
print(abstract)
Literature Review Pipeline
# Search for papers about a gene of interest
let gene = "BRCA1"
let results = pubmed_search(gene + " cancer therapy 2024", 20)
print("Found " + str(results.count) + " papers for " + gene)
results.ids
|> take(5)
|> each(|pmid|
let text = pubmed_abstract(pmid)
print("PMID " + pmid + ":")
print(text)
print("---")
)
Enrichment Analysis
Over-Representation Analysis (ORA)
Test whether your gene list is enriched for specific biological pathways:
# Load gene sets (GMT format from MSigDB, GO, KEGG, etc.)
let gene_sets = read_gmt("h.all.v2024.1.Hs.symbols.gmt")
# Your differentially expressed genes
let de_genes = ["BRCA1", "TP53", "CDK2", "CCND1", "RB1", "E2F1", "PCNA", "MCM2"]
# Run ORA with background size (total genes measured)
let results = enrich(de_genes, gene_sets, 20000)
# Filter significant results
results
|> filter(|r| r.fdr < 0.05)
|> each(|r| print(r.term + ": p=" + str(r.p_value) + " FDR=" + str(r.fdr)))
The enrich() function uses the hypergeometric test with Benjamini-Hochberg FDR correction.
Output columns: term, overlap, p_value, fdr, genes
Gene Set Enrichment Analysis (GSEA)
For ranked gene lists (e.g., by fold change or t-statistic):
# Prepare ranked table with gene and score columns
let ranked = read_tsv("de_results.tsv")
|> select(["gene", "log2fc"])
|> rename({log2fc: "score"})
|> sort_by(|r| r.score, desc: true)
# Load gene sets
let gene_sets = read_gmt("c2.cp.kegg.v2024.1.Hs.symbols.gmt")
# Run GSEA (1000 permutations)
let results = gsea(ranked, gene_sets)
# Top enriched pathways
results
|> filter(|r| r.fdr < 0.25)
|> each(|r| print(r.term + ": NES=" + str(r.nes) + " FDR=" + str(r.fdr)))
Output columns: term, es (enrichment score), nes (normalized ES), p_value, fdr, leading_edge
GMT File Format
The GMT (Gene Matrix Transposed) format is tab-delimited:
PATHWAY_NAME\tDescription\tGENE1\tGENE2\tGENE3\t...
Standard sources:
- MSigDB (Broad Institute) — Hallmark, KEGG, Reactome, GO sets
- Enrichr — curated gene set libraries
- WikiPathways — community-curated pathways
Real-World Example: RNA-seq Enrichment Pipeline
# 1. Read differential expression results
let de = read_tsv("deseq2_results.tsv")
# 2. Get significant genes
let sig_genes = de
|> filter(|r| r.padj < 0.05 and abs(r.log2FoldChange) > 1)
|> map(|r| r.gene)
|> collect()
print("Significant DE genes: " + str(len(sig_genes)))
# 3. Load pathway gene sets
let hallmark = read_gmt("h.all.v2024.1.Hs.symbols.gmt")
let kegg = read_gmt("c2.cp.kegg.v2024.1.Hs.symbols.gmt")
# 4. Run ORA on both
let h_results = enrich(sig_genes, hallmark, 20000)
let k_results = enrich(sig_genes, kegg, 20000)
# 5. Report significant pathways
print("\n=== Hallmark Pathways ===")
h_results |> filter(|r| r.fdr < 0.05) |> each(|r|
print(r.term + " (overlap=" + str(r.overlap) + ", FDR=" + str(r.fdr) + ")")
)
print("\n=== KEGG Pathways ===")
k_results |> filter(|r| r.fdr < 0.05) |> each(|r|
print(r.term + " (overlap=" + str(r.overlap) + ", FDR=" + str(r.fdr) + ")")
)
Builtin Reference
PDB
| Function | Args | Returns | Description |
|---|---|---|---|
pdb_entry(id) | Str | Record | Fetch PDB entry metadata |
pdb_search(query) | Str | List[Str] | Full-text search, returns PDB IDs |
pdb_entity(id, entity_id) | Str, Int | Record | Get specific polymer entity |
pdb_sequence(id, entity_id) | Str, Int | Protein | Get entity sequence as Protein value |
pdb_chains(id) | Str | List[Record] | Get all polymer entities for a structure |
PubMed
| Function | Args | Returns | Description |
|---|---|---|---|
pubmed_search(query, [max]) | Str, [Int] | Record{count, ids} | Search PubMed articles |
pubmed_abstract(pmid) | Str | Str | Fetch abstract text |
Enrichment
| Function | Args | Returns | Description |
|---|---|---|---|
read_gmt(path) | Str | Map{name -> List[Str]} | Parse GMT gene set file |
enrich(genes, sets, bg) | List, Map, Int | Table | ORA with hypergeometric test + BH FDR |
ora(genes, sets, bg) | List, Map, Int | Table | Alias for enrich() |
gsea(ranked, sets) | Table, Map | Table | GSEA with permutation test + BH FDR |
Chapter 17: Literate Notebooks
BioLang notebooks (.bln files) combine Markdown prose with executable code blocks.
They’re ideal for documenting analyses, creating reproducible reports, and sharing
methods with collaborators.
Running a Notebook
bl notebook analysis.bln
Prose sections are rendered with ANSI formatting in the terminal. Code blocks are executed in order, with output interleaved between prose.
The .bln Format
A .bln file is plain text. Everything outside a code block is Markdown prose.
Code blocks can use either fenced syntax or dash delimiters.
Fenced Code Blocks
Use standard Markdown fences with biolang, bl, or no language tag:
## Load Data
Read the FASTQ file and compute basic stats.
```biolang
let reads = read_fastq("sample.fq")
print(f"Loaded {len(reads)} reads")
```
## Filter
Keep high-quality reads.
```bl
let filtered = reads |> filter(|r| mean_phred(r.quality) > 25)
print(f"Kept {len(filtered)} reads")
```
Dash Delimiters
The original .bln syntax uses --- on its own line:
## Load Data
---
let reads = read_fastq("sample.fq")
print(f"Loaded {len(reads)} reads")
---
## Results
The output above shows the read count.
Both styles can be mixed in the same notebook. Fenced blocks with other language
tags (e.g. ```python) are treated as prose and not executed.
State Carries Over
Variables defined in one code block are available in all later blocks. This lets you build up an analysis step by step, explaining each part in prose.
Cell Directives
Add special comment annotations at the top of a code block to control behavior:
| Directive | Effect |
|---|---|
# @hide | Execute but don’t display the code |
# @skip | Don’t execute this block |
# @echo | Print the code before executing |
# @hide-output | Execute (and show code) but suppress printed output |
Directives are stripped from the code before execution. Multiple directives can be combined:
```biolang
# @hide
let config = {min_quality: 30, reference: "GRCh38"}
```
```biolang
# @echo
let reads = read_fastq("sample.fq")
|> filter(|r| mean_phred(r.quality) >= config.min_quality)
print(f"Filtered: {len(reads)} reads")
```
With # @hide, the config block runs silently. With # @echo, readers see both
the code and its output.
HTML Export
Generate a self-contained HTML report:
bl notebook analysis.bln --export html > report.html
The HTML output includes:
- Inline CSS with a dark theme
- BioLang syntax highlighting (keywords, strings, comments, pipes)
- Rendered Markdown prose
- Code output blocks
- No external dependencies – a single standalone file
This is useful for sharing results via email or publishing to a website.
Jupyter Interop
Import from Jupyter
Convert an existing .ipynb notebook to .bln:
bl notebook experiment.ipynb --from-ipynb > experiment.bln
- Markdown cells become prose sections
- Code cells become fenced
biolangblocks - Outputs are discarded (they’ll be regenerated)
Export to Jupyter
Convert a .bln to .ipynb:
bl notebook analysis.bln --to-ipynb > analysis.ipynb
- Prose becomes Markdown cells
- Code blocks become code cells with
"language": "biolang"metadata - Uses nbformat v4, compatible with JupyterLab and VS Code
Example: GC Content Report
Here’s a complete notebook that analyzes GC content:
# GC Content Analysis
This notebook computes per-contig GC content and flags outliers.
## Setup
```biolang
# @hide
let threshold = 2.0
```
## Load Data
```biolang
let seqs = read_fasta("contigs.fa")
print(f"Loaded {len(seqs)} sequences")
```
## Statistics
```biolang
# @echo
let gc_values = seqs |> map(|s| gc_content(s.seq))
let mu = mean(gc_values)
let sigma = stdev(gc_values)
print(f"Mean GC: {mu:.3f} +/- {sigma:.4f}")
```
## Outliers
Contigs more than 2 std devs from the mean may indicate
contamination or horizontal gene transfer.
```biolang
let outliers = seqs
|> filter(|s| abs(gc_content(s.seq) - mu) > threshold * sigma)
print(f"Found {len(outliers)} outlier contigs")
```
Run it:
bl notebook gc_analysis.bln
Export it:
bl notebook gc_analysis.bln --export html > gc_report.html
Tips
- Use
# @hidefor setup – configuration, imports, and helper functions that readers don’t need to see - Use
# @echofor key steps – shows both code and output for the important analysis logic - Use
# @skipduring development – temporarily disable expensive cells - The HTML export is self-contained – share the
.htmlfile directly - Convert existing Jupyter notebooks with
--from-ipynbto migrate analyses
Chapter 15: Plugins
BioLang ships with a rich set of built-in functions, but bioinformatics is vast. You may need to wrap a niche aligner, call an R statistical package, or connect to a lab-specific LIMS API. The plugin system lets you extend BioLang with external tools written in Python, R, TypeScript/Deno, or native binaries, all communicating over a simple JSON protocol on stdin/stdout.
Plugin Architecture
A BioLang plugin is a standalone program that:
- Reads JSON requests from stdin
- Processes the request
- Writes JSON responses to stdout
BioLang manages the plugin lifecycle: it starts the subprocess when the plugin is first called, keeps it alive for subsequent calls, and shuts it down when the script ends. This avoids process startup overhead on repeated invocations.
The Plugin Manifest
Every plugin has a plugin.json file that describes its interface.
{
"name": "blast-wrapper",
"version": "1.0.0",
"description": "BLAST+ sequence search wrapper",
"kind": "python",
"entry": "blast_plugin.py",
"functions": [
{
"name": "blastn",
"description": "Nucleotide BLAST search",
"params": {
"query": {"type": "string", "description": "Path to query FASTA"},
"db": {"type": "string", "description": "BLAST database name or path"},
"evalue": {"type": "number", "default": 1e-5, "description": "E-value threshold"},
"max_hits": {"type": "integer", "default": 10, "description": "Maximum hits to return"},
"threads": {"type": "integer", "default": 4, "description": "CPU threads"}
},
"returns": {
"type": "array",
"items": {
"type": "object",
"properties": {
"subject": "string",
"identity": "number",
"evalue": "number",
"bitscore": "number",
"alignment_length": "integer"
}
}
}
},
{
"name": "blastp",
"description": "Protein BLAST search",
"params": {
"query": {"type": "string", "description": "Path to query FASTA"},
"db": {"type": "string", "description": "BLAST database name or path"},
"evalue": {"type": "number", "default": 1e-5},
"max_hits": {"type": "integer", "default": 10}
},
"returns": {
"type": "array"
}
},
{
"name": "makeblastdb",
"description": "Create a BLAST database from a FASTA file",
"params": {
"input": {"type": "string", "description": "Input FASTA path"},
"dbtype": {"type": "string", "enum": ["nucl", "prot"], "description": "Sequence type"},
"out": {"type": "string", "description": "Output database prefix"}
},
"returns": {
"type": "object"
}
}
]
}
The kind field tells BioLang how to launch the plugin:
| Kind | Interpreter |
|---|---|
python | python3 (or python on Windows) |
r | Rscript |
deno | deno run |
typescript | deno run (same as deno) |
native | Direct executable |
Communication Protocol
BioLang sends JSON messages to the plugin’s stdin and reads JSON responses from stdout. Each message is a single line of JSON terminated by a newline.
Request Format
{"id": "req_001", "function": "blastn", "params": {"query": "input.fa", "db": "nt", "evalue": 1e-10}}
Response Format
Success:
{"id": "req_001", "result": [{"subject": "NM_000546.6", "identity": 99.5, "evalue": 0.0, "bitscore": 1842}]}
Error:
{"id": "req_001", "error": {"code": "BLAST_FAILED", "message": "Database nt not found"}}
Lifecycle Messages
BioLang sends a shutdown message when the script ends:
{"id": "shutdown", "function": "__shutdown__", "params": {}}
The plugin should clean up resources and exit.
Installing Plugins
From a Directory
bl add ./my-plugins/blast-wrapper
This copies (or symlinks) the plugin into ~/.biolang/plugins/blast-wrapper/.
From a Git Repository
bl add https://github.com/lab/biolang-deseq2.git
BioLang clones the repository into the plugins directory.
Removing Plugins
bl remove blast-wrapper
Listing Installed Plugins
From the command line:
bl plugins
From the REPL:
:plugins
Both show the plugin name, version, kind, and available functions.
Using Plugins in Scripts
Once installed, import a plugin by name and call its functions.
import "blast-wrapper"
# Build a custom database
makeblastdb(input: "my_sequences.fa", dbtype: "nucl", out: "custom_db")
# Search against it
let hits = blastn(query: "query.fa", db: "custom_db",
evalue: 1e-20, max_hits: 5)
hits |> each(|h| print(h.subject + " identity=" + str(h.identity)
+ "% evalue=" + str(h.evalue)))
Plugin functions integrate seamlessly with pipes and other BioLang features.
read_fasta("contigs.fa")
|> filter(|r| len(r.seq) > kb(1))
|> write_fasta("long_contigs.fa")
blastn(query: "long_contigs.fa", db: "nt", evalue: 1e-30, threads: 8)
|> filter(|h| h.identity > 95.0)
|> sort_by(|h| -h.bitscore)
|> write_tsv("blast_hits.tsv")
Example: Python BLAST Wrapper Plugin
Here is the complete Python implementation for the blast-wrapper plugin
described above.
Directory Structure
blast-wrapper/
plugin.json # manifest (shown earlier)
blast_plugin.py # implementation
blast_plugin.py
import json
import subprocess
import sys
import csv
from io import StringIO
def blastn(params):
cmd = [
"blastn",
"-query", params["query"],
"-db", params["db"],
"-evalue", str(params.get("evalue", 1e-5)),
"-max_target_seqs", str(params.get("max_hits", 10)),
"-num_threads", str(params.get("threads", 4)),
"-outfmt", "6 sseqid pident evalue bitscore length",
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
return {"error": {"code": "BLAST_FAILED", "message": result.stderr.strip()}}
hits = []
reader = csv.reader(StringIO(result.stdout), delimiter="\t")
for row in reader:
hits.append({
"subject": row[0],
"identity": float(row[1]),
"evalue": float(row[2]),
"bitscore": float(row[3]),
"alignment_length": int(row[4]),
})
return {"result": hits}
def blastp(params):
cmd = [
"blastp",
"-query", params["query"],
"-db", params["db"],
"-evalue", str(params.get("evalue", 1e-5)),
"-max_target_seqs", str(params.get("max_hits", 10)),
"-outfmt", "6 sseqid pident evalue bitscore length",
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
return {"error": {"code": "BLAST_FAILED", "message": result.stderr.strip()}}
hits = []
reader = csv.reader(StringIO(result.stdout), delimiter="\t")
for row in reader:
hits.append({
"subject": row[0],
"identity": float(row[1]),
"evalue": float(row[2]),
"bitscore": float(row[3]),
"alignment_length": int(row[4]),
})
return {"result": hits}
def makeblastdb(params):
cmd = [
"makeblastdb",
"-in", params["input"],
"-dbtype", params["dbtype"],
"-out", params["out"],
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
return {"error": {"code": "MAKEBLASTDB_FAILED", "message": result.stderr.strip()}}
return {"result": {"database": params["out"], "status": "created"}}
DISPATCH = {
"blastn": blastn,
"blastp": blastp,
"makeblastdb": makeblastdb,
}
def main():
for line in sys.stdin:
line = line.strip()
if not line:
continue
request = json.loads(line)
func_name = request["function"]
if func_name == "__shutdown__":
break
handler = DISPATCH.get(func_name)
if handler is None:
response = {"id": request["id"],
"error": {"code": "UNKNOWN_FUNCTION",
"message": f"No function: {func_name}"}}
else:
resp = handler(request["params"])
response = {"id": request["id"], **resp}
sys.stdout.write(json.dumps(response) + "\n")
sys.stdout.flush()
if __name__ == "__main__":
main()
Install and use:
bl add ./blast-wrapper
# In a BioLang script:
import "blast-wrapper"
let hits = blastn(query: "primers.fa", db: "refseq_genomic",
evalue: 1e-5, max_hits: 20)
hits |> filter(|h| h.identity == 100.0)
|> each(|h| print("Perfect match: " + h.subject))
Example: R DESeq2 Plugin for Differential Expression
Directory Structure
deseq2-plugin/
plugin.json
deseq2_plugin.R
plugin.json
{
"name": "deseq2",
"version": "1.0.0",
"description": "DESeq2 differential expression analysis",
"kind": "r",
"entry": "deseq2_plugin.R",
"functions": [
{
"name": "deseq2_de",
"description": "Run DESeq2 differential expression on a count matrix",
"params": {
"counts_file": {"type": "string", "description": "Path to raw counts TSV (genes x samples)"},
"metadata_file": {"type": "string", "description": "Path to sample metadata TSV"},
"design": {"type": "string", "default": "~ condition", "description": "DESeq2 design formula"},
"contrast": {"type": "array", "description": "Contrast vector, e.g. ['condition', 'treated', 'control']"},
"alpha": {"type": "number", "default": 0.05, "description": "Adjusted p-value threshold"},
"lfc_threshold": {"type": "number", "default": 0.0, "description": "Log2 fold change threshold for lfcShrink"}
},
"returns": {
"type": "object",
"properties": {
"results_file": "string",
"n_significant": "integer",
"n_up": "integer",
"n_down": "integer"
}
}
},
{
"name": "deseq2_normalize",
"description": "Return DESeq2-normalized counts",
"params": {
"counts_file": {"type": "string"},
"metadata_file": {"type": "string"},
"design": {"type": "string", "default": "~ condition"}
},
"returns": {
"type": "object",
"properties": {
"normalized_file": "string"
}
}
}
]
}
deseq2_plugin.R
library(jsonlite)
library(DESeq2)
deseq2_de <- function(params) {
counts <- read.delim(params$counts_file, row.names = 1)
metadata <- read.delim(params$metadata_file, row.names = 1)
# Ensure sample order matches
metadata <- metadata[colnames(counts), , drop = FALSE]
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = metadata,
design = as.formula(params$design)
)
dds <- DESeq(dds)
contrast <- params$contrast
res <- results(dds, contrast = contrast, alpha = params$alpha)
if (params$lfc_threshold > 0) {
res <- lfcShrink(dds, contrast = contrast, res = res, type = "ashr")
}
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
out_file <- sub("\\.tsv$", "_deseq2_results.tsv", params$counts_file)
write.table(res_df, out_file, sep = "\t", quote = FALSE, row.names = FALSE)
sig <- res_df[!is.na(res_df$padj) & res_df$padj < params$alpha, ]
list(
result = list(
results_file = out_file,
n_significant = nrow(sig),
n_up = sum(sig$log2FoldChange > 0),
n_down = sum(sig$log2FoldChange < 0)
)
)
}
deseq2_normalize <- function(params) {
counts <- read.delim(params$counts_file, row.names = 1)
metadata <- read.delim(params$metadata_file, row.names = 1)
metadata <- metadata[colnames(counts), , drop = FALSE]
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = metadata,
design = as.formula(params$design)
)
dds <- estimateSizeFactors(dds)
norm_counts <- counts(dds, normalized = TRUE)
out_file <- sub("\\.tsv$", "_normalized.tsv", params$counts_file)
write.table(as.data.frame(norm_counts), out_file, sep = "\t", quote = FALSE)
list(result = list(normalized_file = out_file))
}
dispatch <- list(
deseq2_de = deseq2_de,
deseq2_normalize = deseq2_normalize
)
# Main loop: read JSON from stdin, dispatch, write JSON to stdout
con <- file("stdin", "r")
while (TRUE) {
line <- readLines(con, n = 1)
if (length(line) == 0) break
request <- fromJSON(line)
if (request$`function` == "__shutdown__") break
handler <- dispatch[[request$`function`]]
if (is.null(handler)) {
response <- list(
id = request$id,
error = list(code = "UNKNOWN_FUNCTION",
message = paste("No function:", request$`function`))
)
} else {
resp <- tryCatch(
handler(request$params),
error = function(e) list(error = list(code = "R_ERROR", message = e$message))
)
response <- c(list(id = request$id), resp)
}
cat(toJSON(response, auto_unbox = TRUE), "\n")
flush(stdout())
}
close(con)
Using the DESeq2 Plugin
bl add ./deseq2-plugin
# In a BioLang script:
import "deseq2"
# Run differential expression
let de = deseq2_de(
counts_file: "raw_counts.tsv",
metadata_file: "sample_metadata.tsv",
design: "~ condition",
contrast: ["condition", "treated", "control"],
alpha: 0.05,
lfc_threshold: 1.0
)
print(str(de.n_significant) + " DE genes ("
+ str(de.n_up) + " up, " + str(de.n_down) + " down)")
# Read the results back into BioLang for further analysis
let results = tsv(de.results_file)
|> filter(|r| r.padj != None and r.padj < 0.01)
|> sort_by(|r| r.padj)
# Cross-reference top hits with bio APIs
results |> take(10) |> each(|r| {
let gene_info = ncbi_gene(r.gene)
print(r.gene + " log2FC=" + str(r.log2FoldChange)
+ " padj=" + str(r.padj)
+ " — " + gene_info.name)
})
# Get normalized counts for downstream analysis
let norm = deseq2_normalize(
counts_file: "raw_counts.tsv",
metadata_file: "sample_metadata.tsv"
)
print("Normalized counts: " + norm.normalized_file)
Writing a TypeScript/Deno Plugin
For plugins that connect to REST APIs or need async I/O, Deno is a good choice.
lims-connector/plugin.json
{
"name": "lims-connector",
"version": "1.0.0",
"description": "Connect to lab LIMS for sample metadata",
"kind": "deno",
"entry": "lims_plugin.ts",
"functions": [
{
"name": "lims_samples",
"description": "Fetch sample metadata from LIMS",
"params": {
"project": {"type": "string", "description": "LIMS project ID"},
"status": {"type": "string", "default": "all", "description": "Filter by status"}
},
"returns": {"type": "array"}
},
{
"name": "lims_submit_results",
"description": "Push analysis results back to LIMS",
"params": {
"sample_id": {"type": "string"},
"results_file": {"type": "string"},
"qc_status": {"type": "string", "enum": ["pass", "fail", "warning"]}
},
"returns": {"type": "object"}
}
]
}
lims_plugin.ts
import { readLines } from "https://deno.land/std/io/mod.ts";
const LIMS_URL = Deno.env.get("LIMS_API_URL") || "https://lims.example.com/api";
const LIMS_TOKEN = Deno.env.get("LIMS_API_TOKEN") || "";
interface Request {
id: string;
function: string;
params: Record<string, unknown>;
}
async function limsSamples(params: Record<string, unknown>) {
const project = params.project as string;
const status = (params.status as string) || "all";
const url = `${LIMS_URL}/projects/${project}/samples?status=${status}`;
const resp = await fetch(url, {
headers: { "Authorization": `Bearer ${LIMS_TOKEN}` },
});
if (!resp.ok) {
return { error: { code: "LIMS_ERROR", message: await resp.text() } };
}
const data = await resp.json();
return { result: data.samples };
}
async function limsSubmitResults(params: Record<string, unknown>) {
const sampleId = params.sample_id as string;
const resultsFile = params.results_file as string;
const qcStatus = params.qc_status as string;
const results = await Deno.readTextFile(resultsFile);
const resp = await fetch(`${LIMS_URL}/samples/${sampleId}/results`, {
method: "POST",
headers: {
"Authorization": `Bearer ${LIMS_TOKEN}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ results: JSON.parse(results), qc_status: qcStatus }),
});
if (!resp.ok) {
return { error: { code: "LIMS_ERROR", message: await resp.text() } };
}
return { result: { sample_id: sampleId, status: "submitted" } };
}
const dispatch: Record<string, (p: Record<string, unknown>) => Promise<unknown>> = {
lims_samples: limsSamples,
lims_submit_results: limsSubmitResults,
};
for await (const line of readLines(Deno.stdin)) {
if (!line.trim()) continue;
const request: Request = JSON.parse(line);
if (request.function === "__shutdown__") break;
const handler = dispatch[request.function];
let response;
if (!handler) {
response = {
id: request.id,
error: { code: "UNKNOWN_FUNCTION", message: `No function: ${request.function}` },
};
} else {
const resp = await handler(request.params);
response = { id: request.id, ...resp };
}
console.log(JSON.stringify(response));
}
Using the LIMS Plugin
import "lims-connector"
# Fetch samples from the LIMS
let samples = lims_samples(project: "WGS-2024-042", status: "sequenced")
# Process each sample
let results = samples |> par_map(|sample| {
let bam = bwa_mem("GRCh38.fa", sample.fastq_r1, sample.fastq_r2, threads: 8)
|> samtools_sort(threads: 4)
let stats = samtools_flagstat(bam)
let depth = samtools_depth(bam) |> mean()
let qc = if stats.mapped_pct > 90 and depth > 30 then "pass"
else if stats.mapped_pct > 80 then "warning"
else "fail"
let result = {sample_id: sample.id, mapped_pct: stats.mapped_pct,
mean_depth: depth, bam: bam}
result |> write_json(sample.id + "_results.json")
# Push results back to LIMS
lims_submit_results(sample_id: sample.id,
results_file: sample.id + "_results.json",
qc_status: qc)
result
})
Plugin Discovery
BioLang searches for plugins in this order:
./plugins/in the current working directory~/.biolang/plugins/in the user home directory- Paths listed in the
BIOLANG_PLUGIN_PATHenvironment variable
Each directory is scanned for subdirectories containing a plugin.json file.
Plugin Best Practices
Keep plugins focused. A plugin should wrap one tool or one API. If you have BLAST and HMMER, make them separate plugins.
Use stderr for logging. BioLang captures stdout for the JSON protocol. Diagnostic messages, progress indicators, and debug output should go to stderr.
Handle errors gracefully. Return error responses instead of crashing. The
error response format lets BioLang surface meaningful messages to the user.
Document parameters. The description fields in plugin.json are shown
by :plugins in the REPL and bl plugins on the command line.
Pin versions. If your plugin depends on an external tool, document the required version in the manifest description so users know what to install.
Summary
The plugin system extends BioLang with any language that can read and write
JSON on stdio. Write a plugin.json manifest, implement the stdin/stdout
dispatch loop, and install with bl add. Plugins integrate fully with pipes,
parallel operations, and the rest of BioLang. Python plugins wrap command-line
bioinformatics tools, R plugins bring statistical packages like DESeq2, and
Deno/TypeScript plugins connect to REST APIs and external services.
Appendix A: Builtin Reference
BioLang ships with a comprehensive standard library of builtins designed for bioinformatics workflows. Every function listed here is available without imports – they are part of the language runtime.
Sequence Operations
Builtins that operate on bio-typed sequences (dna, rna, protein).
| Builtin | Description |
|---|---|
complement(seq) -> Dna | Rna | Watson-Crick complement of a nucleotide sequence |
reverse_complement(seq) -> Dna | Rna | Reverse complement – the opposing strand |
transcribe(seq) -> Rna | Transcribe DNA to RNA (T to U) |
translate(seq) -> Protein | Translate an RNA or DNA coding sequence to amino acids |
gc_content(seq) -> Float | GC fraction of a nucleotide sequence (0.0 – 1.0) |
find_motif(seq, pattern) -> List[Int] | All start positions where pattern occurs in seq |
hamming_distance(a, b) -> Int | Number of mismatched positions between equal-length sequences |
edit_distance(a, b) -> Int | Edit distance between two sequences |
find_orfs(seq, min_len?) -> List[Record] | Open reading frames with start, stop, and frame fields |
restriction_sites(seq, enzyme?) -> List[Record] | Recognition sites for restriction enzymes |
tm(seq) -> Float | Melting temperature estimate for a short oligonucleotide |
slice(seq, start, end) -> Dna | Rna | Protein | Extract a subsequence by 0-based coordinates |
# Example: quick primer analysis
let primer = dna"ATCGATCGATCG"
let rc = reverse_complement(primer)
let temp = tm(primer)
print("Primer Tm = " + str(temp) + "C, reverse complement = " + str(rc))
Collection Operations
General-purpose operations on lists, records, and sets.
| Builtin | Description |
|---|---|
len(coll) -> Int | Number of elements in a list, string, or sequence |
push(list, item) -> List | Append an element, returning a new list |
pop(list) -> List | Remove the last element, returning a new list |
concat(a, b) -> List | Concatenate two lists |
flatten(nested) -> List | Flatten one level of nesting |
reverse(list) -> List | Reverse element order |
contains(coll, item) -> Bool | True if item is present |
index_of(list, item) -> Int | None | First index of item, or None |
last(list) -> Any | Last element |
first(list) -> Any | First element |
head(list, n) -> List | First n elements |
tail(list, n) -> List | Last n elements |
unique(list) -> List | Remove duplicates, preserving order |
zip(a, b) -> List | Pair elements from two lists into a list of tuples |
enumerate(list) -> List | Pair each element with its 0-based index |
chunk(list, size) -> List[List] | Split into fixed-size sublists |
window(list, size) -> List[List] | Sliding window of the given size |
scan(list, init, fn) -> List | Running accumulation (like reduce but keeps intermediates) |
range(start, end, step?) -> List | Integer range |
set(list) -> Set | Convert a list to a deduplicated set |
keys(record) -> List[Str] | Field names of a record |
values(record) -> List | Field values of a record |
has_key(record, key) -> Bool | True if the record contains the named field |
sort_by(list, fn) -> List | Sort by a key function |
group_by(list, fn) -> Record | Group elements by a key function into a record of lists |
partition(list, fn) -> [List, List] | Split into elements that pass and fail a predicate |
# Example: enumerate quality-filtered reads
let good_reads = reads
|> filter(|r| mean_phred(r.quality) > 30)
|> enumerate()
|> head(5)
Higher-Order Functions
Functions that accept other functions as arguments – the backbone of BioLang’s pipeline style.
| Builtin | Description |
|---|---|
map(coll, fn) -> List | Apply fn to every element |
filter(coll, fn) -> List | Keep elements where fn returns true |
reduce(coll, init, fn) -> Any | Fold elements into a single value |
sort(coll, fn?) -> List | Sort, optionally by comparator |
each(coll, fn) -> None | Execute fn for side effects on every element |
flat_map(coll, fn) -> List | Map then flatten one level |
take_while(coll, fn) -> List | Take leading elements while predicate holds |
any(coll, fn) -> Bool | True if fn returns true for at least one element |
all(coll, fn) -> Bool | True if fn returns true for every element |
none(coll, fn) -> Bool | True if fn returns true for no elements |
find(coll, fn) -> Any | None | First element satisfying fn |
find_index(coll, fn) -> Int | None | Index of first element satisfying fn |
par_map(coll, fn) -> List | Parallel map across available cores |
par_filter(coll, fn) -> List | Parallel filter across available cores |
mat_map(matrix, fn) -> Matrix | Apply fn element-wise to a matrix |
try_call(fn, args) -> Result | Call fn with args, capturing errors instead of panicking |
# Example: parallel GC content across a genome's chromosomes
let gc_values = chromosomes
|> par_map(|chr| {name: chr.name, gc: gc_content(chr.seq)})
|> sort_by(|r| r.gc)
Table Operations
Tabular data manipulation inspired by dataframe semantics – designed for sample sheets, variant tables, and expression matrices.
| Builtin | Description |
|---|---|
table(data) -> Table | Create a table from a list of records |
select(tbl, ...cols) -> Table | Pick columns by name |
mutate(tbl, name, fn) -> Table | Add or transform a column |
summarize(tbl, ...aggs) -> Table | Aggregate columns (sum, mean, etc.) |
group_by(tbl, col) -> GroupedTable | Group rows by a column value (table variant) |
join(a, b, on, how?) -> Table | Join two tables on a key column |
csv(path) -> Table | Read a CSV file into a table |
tsv(path) -> Table | Read a TSV file into a table |
write_csv(tbl, path) -> None | Write a table to CSV |
write_tsv(tbl, path) -> None | Write a table to TSV |
nrow(tbl) -> Int | Number of rows |
ncol(tbl) -> Int | Number of columns |
colnames(tbl) -> List[Str] | Column name list |
row_names(tbl) -> List[Str] | Row name list (if set) |
# Example: summarize variant counts per chromosome
tsv("variants.tsv")
|> group_by("chrom")
|> summarize(count: len, mean_qual: |rows| mean(rows |> map(|r| r.quality)))
|> write_csv("chrom_summary.csv")
Bio File I/O
Read and write standard bioinformatics file formats. Readers return lazy streams that integrate with pipes; writers flush on completion.
| Builtin | Description |
|---|---|
read_fasta(path) -> List[Record] | Parse FASTA; each record has id, desc, seq fields |
read_fastq(path) -> List[Record] | Parse FASTQ; each record has id, seq, quality, length fields |
read_vcf(path) -> List[Record] | Parse VCF; each record has chrom, pos, ref, alt, qual, info fields |
read_bed(path) -> List[Record] | Parse BED; each record has chrom, start, end and optional fields |
read_gff(path) -> List[Record] | Parse GFF/GTF; each record has seqid, type, start, end, attributes |
write_fasta(records, path) -> None | Write records to FASTA format |
write_fastq(records, path) -> None | Write records to FASTQ format |
write_bed(records, path) -> None | Write records to BED format |
# Example: filter FASTQ reads by quality and write survivors
read_fastq("sample_R1.fastq.gz")
|> filter(|r| mean_phred(r.quality) > 30)
|> write_fastq("sample_R1.filtered.fq")
Genomic Intervals
Interval arithmetic for coordinate-based genomic analysis. Intervals carry
chrom, start, end, and optional strand and data fields.
| Builtin | Description |
|---|---|
interval(chrom, start, end, strand?) -> Interval | Create a genomic interval |
interval_tree(intervals) -> IntervalTree | Build an index for fast overlap queries |
query_overlaps(tree, query) -> List[Interval] | All intervals overlapping the query |
query_nearest(tree, query, k?) -> List[Interval] | K nearest intervals to the query |
coverage(intervals) -> List[Record] | Per-base or per-region coverage depth |
merge_intervals(intervals, dist?) -> List[Interval] | Merge overlapping or nearby intervals |
intersect_intervals(a, b) -> List[Interval] | Pairwise intersection of two interval sets |
subtract_intervals(a, b) -> List[Interval] | Regions in a not covered by b |
# Example: find promoter-peak overlaps
let promoters = read_bed("examples/sample-data/promoters.bed") |> map(|r| interval(r.chrom, r.start, r.end))
let peaks = read_bed("examples/sample-data/chip_peaks.bed") |> map(|r| interval(r.chrom, r.start, r.end))
let tree = interval_tree(peaks)
let hits = promoters |> flat_map(|p| query_overlaps(tree, p))
print("Found " + str(len(hits)) + " promoter-peak overlaps")
Variants
Builtins for working with genetic variant records. Variant objects carry
chrom, pos, ref, alt, qual, and info fields.
| Builtin | Description |
|---|---|
variant(chrom, pos, ref, alt) -> Variant | Construct a variant record |
is_snp(v) -> Bool | True if single-nucleotide polymorphism |
is_indel(v) -> Bool | True if insertion or deletion |
is_transition(v) -> Bool | True if purine-purine or pyrimidine-pyrimidine substitution |
is_transversion(v) -> Bool | True if purine-pyrimidine substitution |
variant_type(v) -> Str | Classification string: “snp”, “ins”, “del”, “mnv”, “complex” |
is_het(v) -> Bool | True if heterozygous genotype |
is_hom_ref(v) -> Bool | True if homozygous reference |
is_hom_alt(v) -> Bool | True if homozygous alternate |
is_multiallelic(v) -> Bool | True if more than one alt allele |
parse_vcf_info(info_str) -> Record | Parse a VCF INFO field string into a record |
variant_summary(variants) -> Record | Aggregate counts of SNPs, indels, Ti/Tv ratio, het/hom ratio |
# Example: compute Ti/Tv ratio for a VCF
let vars = read_vcf("examples/sample-data/calls.vcf") |> filter(|v| v.qual > 30)
let summary = variant_summary(vars)
print("Ti/Tv = " + str(summary.ti_tv_ratio) + ", SNPs = " + str(summary.snp_count))
Statistics
Statistical functions for quality control, expression analysis, and hypothesis testing.
| Builtin | Description |
|---|---|
mean(xs) -> Float | Arithmetic mean |
median(xs) -> Float | Median value |
stdev(xs) -> Float | Sample standard deviation |
variance(xs) -> Float | Sample variance |
quantile(xs, q) -> Float | Quantile at fraction q (0.0 – 1.0) |
min(xs) -> Num | Minimum value |
max(xs) -> Num | Maximum value |
sum(xs) -> Num | Sum of all elements |
cor(xs, ys) -> Float | Pearson correlation coefficient |
ttest(xs, ys) -> Record | Two-sample t-test; returns {statistic, pvalue} |
chi_square(observed, expected) -> Record | Chi-squared test; returns {statistic, pvalue, df} |
wilcoxon(xs, ys) -> Record | Wilcoxon rank-sum test |
anova(groups) -> Record | One-way ANOVA across groups |
fisher_exact(table) -> Record | Fisher’s exact test on a 2x2 contingency table |
p_adjust(pvals, method?) -> List[Float] | Multiple testing correction (default: Benjamini-Hochberg) |
lm(xs, ys) -> Record | Simple linear regression; returns {slope, intercept, r_squared} |
ks_test(xs, ys) -> Record | Kolmogorov-Smirnov test |
mean_phred(quals) -> Float | Mean Phred quality score from a quality string |
# Example: differential expression significance
let control = [5.2, 4.8, 5.1, 4.9]
let treatment = [8.1, 7.5, 8.3, 7.9]
let result = ttest(control, treatment)
print("p-value = " + str(result.pvalue))
Linear Algebra
Matrix operations for expression matrices, PCA, distance calculations, and numerical biology.
| Builtin | Description |
|---|---|
matrix(data) -> Matrix | Create a matrix from a list of lists (row-major) |
transpose(m) -> Matrix | Transpose rows and columns |
mat_mul(a, b) -> Matrix | Matrix multiplication |
determinant(m) -> Float | Determinant of a square matrix |
inverse(m) -> Matrix | Matrix inverse |
eigenvalues(m) -> List[Float] | Eigenvalues of a square matrix |
svd(m) -> Record | Singular value decomposition; returns {u, s, vt} |
solve(a, b) -> Matrix | Solve the linear system Ax = b |
trace(m) -> Float | Sum of diagonal elements |
norm(m, p?) -> Float | Matrix or vector norm (default: Frobenius / L2) |
rank(m) -> Int | Numerical rank |
identity(n) -> Matrix | n x n identity matrix |
zeros(rows, cols) -> Matrix | Matrix of zeros |
ones(rows, cols) -> Matrix | Matrix of ones |
diag(values) -> Matrix | Diagonal matrix from a list of values |
mat_map(m, fn) -> Matrix | Apply fn element-wise |
# Example: PCA on a gene expression matrix
let expr = tsv("examples/sample-data/counts.tsv") |> table()
let m = matrix(expr |> select("gene_a", "gene_b", "gene_c"))
let decomp = svd(m)
print("Top 3 singular values: " + str(head(decomp.s, 3)))
Math
Standard mathematical functions available for scoring, normalization, and modeling.
| Builtin | Description |
|---|---|
abs(x) -> Num | Absolute value |
ceil(x) -> Int | Round up to nearest integer |
floor(x) -> Int | Round down to nearest integer |
round(x, digits?) -> Float | Round to digits decimal places (default: 0) |
sqrt(x) -> Float | Square root |
log(x) -> Float | Natural logarithm |
log2(x) -> Float | Base-2 logarithm (common in fold-change analysis) |
log10(x) -> Float | Base-10 logarithm |
exp(x) -> Float | Euler’s number raised to x |
pow(base, exp) -> Float | Exponentiation |
sin(x) -> Float | Sine |
cos(x) -> Float | Cosine |
tan(x) -> Float | Tangent |
ode_solve(fn, y0, t_span, dt?) -> List[Record] | Numerical ODE integration (Runge-Kutta) |
# Example: log2 fold-change between conditions
let control = 12.5
let treatment = 50.0
let lfc = log2(treatment / control)
print("Log2 fold-change = " + str(lfc))
String Operations
Text manipulation for parsing identifiers, annotations, and formatted output.
| Builtin | Description |
|---|---|
split(s, delim) -> List[Str] | Split string on delimiter |
join(list, delim) -> Str | Join list elements into a string |
trim(s) -> Str | Strip leading and trailing whitespace |
upper(s) -> Str | Convert to uppercase |
lower(s) -> Str | Convert to lowercase |
starts_with(s, prefix) -> Bool | True if s begins with prefix |
ends_with(s, suffix) -> Bool | True if s ends with suffix |
replace(s, from, to) -> Str | Replace all occurrences |
matches(s, pattern) -> Bool | True if regex pattern matches |
format(template, ...args) -> Str | Printf-style formatting |
BioLang also supports f-strings for inline interpolation:
# Example: parse a FASTA header
let header = ">sp|P12345|MYG_HUMAN Myoglobin OS=Homo sapiens"
let parts = split(header, "|")
let acc = parts[1]
print("Accession: " + acc)
Type Operations
Runtime type inspection and conversion – useful for dynamic dispatch in pipelines that handle mixed bio types.
| Builtin | Description |
|---|---|
type(val) -> Str | Runtime type name as a string |
is_dna(val) -> Bool | True if val is a DNA sequence |
is_rna(val) -> Bool | True if val is an RNA sequence |
is_protein(val) -> Bool | True if val is a protein sequence |
is_interval(val) -> Bool | True if val is a genomic interval |
is_variant(val) -> Bool | True if val is a variant record |
is_record(val) -> Bool | True if val is a record |
is_list(val) -> Bool | True if val is a list |
is_table(val) -> Bool | True if val is a table |
is_nil(val) -> Bool | True if val is nil |
is_int(val) -> Bool | True if val is an integer |
is_float(val) -> Bool | True if val is a float |
is_str(val) -> Bool | True if val is a string |
is_bool(val) -> Bool | True if val is a boolean |
into(val, target_type) -> Any | Convert between compatible types |
# Example: route processing based on sequence type
let seq = read_fasta("input.fa") |> first() |> |r| r.seq
if is_dna(seq) then
print("DNA, GC = " + str(gc_content(seq)))
else if is_protein(seq) then
print("Protein, length = " + str(len(seq)))
Bio APIs
Remote database queries for annotation enrichment. All API builtins are async-aware and return structured records.
| Builtin | Description |
|---|---|
ncbi_search(db, query, max?) -> List[Record] | Search NCBI Entrez databases |
ncbi_gene(gene_id) -> Record | Fetch NCBI Gene record |
ncbi_sequence(acc) -> Record | Fetch sequence by accession from NCBI Nucleotide/Protein |
ensembl_gene(ensembl_id) -> Record | Ensembl gene lookup by Ensembl ID |
ensembl_symbol(species, symbol) -> Record | Ensembl gene lookup by species and symbol |
ensembl_vep(variants) -> List[Record] | Variant Effect Predictor annotation |
uniprot_search(query, max?) -> List[Record] | Search UniProt by keyword or accession |
uniprot_entry(acc) -> Record | Full UniProt entry |
ucsc_sequence(genome, chrom, start, end) -> Dna | Fetch genomic sequence from UCSC DAS |
kegg_get(entry) -> Record | Retrieve a KEGG database entry |
kegg_find(db, query) -> List[Record] | Search KEGG databases |
string_network(proteins, species?) -> Record | STRING protein-protein interaction network |
pdb_entry(pdb_id) -> Record | Fetch PDB structure metadata |
reactome_pathways(gene) -> List[Record] | Reactome pathway memberships for a gene |
go_term(go_id) -> Record | Gene Ontology term details |
go_annotations(gene, species?) -> List[Record] | GO annotations for a gene |
cosmic_gene(symbol) -> Record | COSMIC cancer gene census entry |
datasets_gene(symbol, taxon?) -> Record | NCBI Datasets gene data |
biomart_query(dataset, attributes, filters?) -> Table | BioMart query returning a table |
nfcore_list(sort_by?, limit?) -> List[Record] | List nf-core pipelines |
nfcore_search(query, limit?) -> List[Record] | Search nf-core pipelines by name/topic |
nfcore_info(name) -> Record | Detailed nf-core pipeline metadata |
nfcore_releases(name) -> List[Record] | Release history for an nf-core pipeline |
nfcore_params(name) -> Record | Parameter schema for an nf-core pipeline |
biocontainers_search(query, limit?) -> List[Record] | Search BioContainers registry |
biocontainers_popular(limit?) -> List[Record] | Popular BioContainers tools |
biocontainers_info(tool) -> Record | Detailed tool info with versions |
biocontainers_versions(tool) -> List[Record] | All versions with container image URIs |
galaxy_search(query, limit?) -> List[Record] | Search Galaxy ToolShed repositories |
galaxy_popular(limit?) -> List[Record] | Popular Galaxy ToolShed tools |
galaxy_categories() -> List[Record] | Galaxy ToolShed tool categories |
galaxy_tool(owner, name) -> Record | Galaxy ToolShed repository details |
nf_parse(path) -> Record | Parse a Nextflow .nf file into a structured Record |
nf_to_bl(record) -> Str | Generate BioLang pipeline code from parsed Nextflow |
galaxy_to_bl(record) -> Str | Generate BioLang pipeline code from Galaxy workflow |
api_endpoints() -> Record | Show current API endpoint URLs |
# Example: annotate a gene list with pathway data
let genes = ["BRCA1", "TP53", "EGFR"]
genes |> each(|g| {
let pathways = reactome_pathways(g)
print(g + ": " + str(len(pathways)) + " pathways")
})
Utility
General-purpose helpers for debugging, timing, unit conversion, and serialization.
| Builtin | Description |
|---|---|
print(val) -> None | Print a value followed by a newline |
assert(cond, msg?) -> None | Panic with msg if cond is false |
sleep(ms) -> None | Pause execution for ms milliseconds |
now() -> Float | Current timestamp in seconds since epoch |
elapsed(start) -> Float | Seconds elapsed since start (from now()) |
bp(n) -> Int | Identity; documents that n is in base pairs |
kb(n) -> Int | Convert kilobases to base pairs (n * 1000) |
mb(n) -> Int | Convert megabases to base pairs (n * 1_000_000) |
gb(n) -> Int | Convert gigabases to base pairs (n * 1_000_000_000) |
to_json(val) -> Str | Serialize any value to a JSON string |
from_json(s) -> Any | Parse a JSON string into a BioLang value |
env(name) -> Str | None | Read an environment variable |
exit(code?) -> Never | Terminate the process with an exit code (default: 0) |
# Example: time a heavy operation
let t0 = now()
let result = read_fasta("genome.fa")
|> flat_map(|r| find_orfs(r.seq, 300))
print("Found " + str(len(result)) + " ORFs in " + str(elapsed(t0)) + "s")
Appendix B: Operator Reference
BioLang operators are listed below from highest precedence (binds tightest) to lowest. Within the same precedence level, operators are left-associative unless noted otherwise.
Precedence Table
1. Access and Call (highest)
| Symbol | Name | Description | Example |
|---|---|---|---|
. | Field access | Access a record field or method | variant.chrom |
() | Call | Invoke a function or builtin | gc_content(seq) |
[] | Index | Index into a list, string, or sequence | reads[0] |
2. Exponentiation
| Symbol | Name | Description | Example |
|---|---|---|---|
** | Power | Raise to a power (right-associative) | 2 ** 10 gives 1024 |
3. Unary
| Symbol | Name | Description | Example |
|---|---|---|---|
- | Negate | Arithmetic negation | -log2(fc) |
! | Logical NOT | Boolean negation | !is_snp(v) |
not | Logical NOT (word) | Same as !, reads more naturally in predicates | not is_indel(v) |
~ | Bitwise NOT | Bitwise complement | ~flags |
4. Multiplicative
| Symbol | Name | Description | Example |
|---|---|---|---|
* | Multiply | Arithmetic multiplication | coverage * depth |
/ | Divide | Arithmetic division | gc_count / len(seq) |
% | Modulo | Remainder after division; useful for reading frame math | pos % 3 |
5. Additive
| Symbol | Name | Description | Example |
|---|---|---|---|
+ | Add | Arithmetic addition; also concatenates strings | start + kb(1) |
- | Subtract | Arithmetic subtraction | end - start |
6. Type Cast
| Symbol | Name | Description | Example |
|---|---|---|---|
as | Cast | Convert between compatible types | qual as Int |
7. Ranges
| Symbol | Name | Description | Example |
|---|---|---|---|
.. | Range (exclusive) | Half-open range; useful for genomic coordinates | 0..len(seq) |
..= | Range (inclusive) | Closed range | 1..=22 for autosomal chromosomes |
8. Bit Shifts
| Symbol | Name | Description | Example |
|---|---|---|---|
<< | Left shift | Shift bits left | 1 << 4 |
>> | Right shift | Shift bits right; useful for SAM flag decoding | flags >> 8 |
9. Bitwise AND
| Symbol | Name | Description | Example |
|---|---|---|---|
& | Bitwise AND | Test individual flag bits in BAM flags | flags & 0x4 to check unmapped |
10. Bitwise XOR
| Symbol | Name | Description | Example |
|---|---|---|---|
^ | Bitwise XOR | Exclusive or | a ^ b |
11. Comparison
| Symbol | Name | Description | Example |
|---|---|---|---|
== | Equal | Structural equality | variant_type(v) == "snp" |
!= | Not equal | Structural inequality | v.chrom != "chrM" |
< | Less than | Numeric or lexicographic comparison | v.qual < 30 |
> | Greater than | gc_content(seq) > 0.6 | |
<= | Less or equal | len(seq) <= kb(1) | |
>= | Greater or equal | depth >= 10 | |
~ | Regex match | True if the left string matches the right pattern | header ~ "^>chr[0-9]+" |
12. Logical AND
| Symbol | Name | Description | Example |
|---|---|---|---|
&& | AND | Short-circuiting logical and | is_snp(v) && v.qual > 30 |
and | AND (word) | Same as &&, reads naturally in filters | is_snp(v) and v.qual > 30 |
13. Logical OR
| Symbol | Name | Description | Example |
|---|---|---|---|
|| | OR | Short-circuiting logical or | is_het(v) || is_hom_alt(v) |
or | OR (word) | Same as || | is_het(v) or is_hom_alt(v) |
14. Null Coalesce
| Symbol | Name | Description | Example |
|---|---|---|---|
?? | Null coalesce | Use the right-hand value when the left is nil | v.info["AF"] ?? 0.0 |
15. Pipe
| Symbol | Name | Description | Example |
|---|---|---|---|
|> | Pipe | Pass the left-hand value as the first argument to the right-hand function | reads |> filter(|r| mean_phred(r.quality) > 30) |
|>> | Tap pipe | Like pipe, but passes the value through unchanged; used for side effects | reads |>> print |> len() |
The pipe operators are the idiomatic way to build multi-step analysis pipelines in BioLang:
read_fastq("sample.fq")
|> filter(|r| mean_phred(r.quality) > 25)
|>> |reads| print("After QC: " + str(len(reads)) + " reads")
|> map(|r| {id: r.id, gc: gc_content(r.seq)})
|> write_csv("gc_report.csv")
16. Assignment (lowest)
| Symbol | Name | Description | Example |
|---|---|---|---|
= | Assign | Bind a value to a name | let gc = gc_content(seq) |
+= | Add-assign | Increment in place | count += 1 |
-= | Subtract-assign | Decrement in place | remaining -= 1 |
*= | Multiply-assign | Multiply in place | score *= weight |
/= | Divide-assign | Divide in place | total /= n |
?= | Nil-assign | Assign only if the target is currently nil | cache ?= compute() |
17. Structural (not expression operators)
These symbols appear in specific syntactic positions and do not participate in general expression precedence.
| Symbol | Name | Description | Example |
|---|---|---|---|
=> | Fat arrow | Separates patterns from bodies in match and given arms | match v { "snp" => handle_snp() } |
-> | Arrow | Return type annotation in function signatures | fn gc(seq: Dna) -> Float |
Newlines and Statement Termination
BioLang uses newlines as statement terminators – there are no semicolons. A newline ends the current expression unless it is suppressed.
Newline Suppression Rules
A newline does not terminate a statement when it appears:
-
After a binary operator. The expression continues on the next line.
let total = a + b + c -
After an opening delimiter (
(,[,{). The expression continues until the matching closing delimiter.let record = { chrom: "chr1", start: 1000, end: 2000 } -
After a pipe operator (
|>or|>>). The pipeline continues on the next line.reads |> filter(|r| mean_phred(r.quality) > 30) |> map(|r| r.seq) -
After a comma inside argument lists, list literals, and record literals.
let xs = [ 1, 2, 3 ] -
After keywords that expect a continuation:
then,else,do,and,or.
These rules mean that multi-line pipelines and data structures work naturally without any explicit continuation characters.
Blank Lines
Blank lines (lines containing only whitespace) are ignored between statements. Use them freely to organize code into logical sections.
Comments
| Syntax | Name | Description |
|---|---|---|
# | Line comment | Everything from # to end of line is ignored |
## | Doc comment | Attached to the following declaration; extractable by documentation tools |
# This is a regular comment
## Compute GC content for a DNA sequence.
## Returns a float between 0.0 and 1.0.
fn gc(seq: Dna) -> Float
gc_content(seq)
end
Doc comments (##) attach to the immediately following fn, let, or
type declaration and are preserved in the AST for tooling and
auto-generated documentation.