Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

BioLang is a domain-specific language built for bioinformatics. It is not a general-purpose language that happens to have biology libraries — every feature, from its type system to its operators, exists because bioinformatics workflows demand it.

Why a DSL?

Bioinformaticians spend most of their time gluing tools together: reading FASTA files, filtering variants, piping data through transformations, querying biological databases, and building reproducible pipelines. General-purpose languages require dozens of imports, boilerplate, and framework knowledge before you can write your first useful script.

BioLang eliminates that friction. A complete variant filtering workflow:

read_vcf("variants.vcf")
    |> filter(|v| v.qual > 30)
    |> filter(|v| is_snp(v))
    |> filter(|v| is_transition(v))
    |> map(|v| {chrom: v.chrom, pos: v.pos, ref: v.ref_allele, alt: v.alt_allele, qual: v.qual})
    |> write_csv("filtered_transitions.csv")

No imports. No boilerplate. No framework. DNA, RNA, proteins, intervals, variants, and quality scores are first-class types — not strings pretending to be biology.

Design Principles

Pipe-first. The |> operator is the backbone of BioLang. Bioinformatics is inherently a series of transformations: raw reads become aligned reads become variant calls become annotated reports. Pipes make this natural:

samples
    |> par_map(|s| read_fastq(s.path))
    |> map(|reads| filter(reads, |r| mean(r.quality) > 20))
    |> map(|reads| gc_content(reads))
    |> summarize(mean_gc: mean)

Biology is built in. DNA, RNA, protein, and quality score literals are part of the syntax. Genomic intervals and variants have dedicated types with domain-specific methods. You do not need to install a package to reverse-complement a sequence:

let seq = dna"ATCGATCGATCG"
let rc = reverse_complement(seq)    # dna"CGATCGATCGAT"
let rna = transcribe(seq)           # rna"AUCGAUCGAUCG"
let protein = translate(rna)        # protein"IDRS"

Tables are native. Bioinformatics lives in tables — sample sheets, count matrices, variant lists, BED files. BioLang tables support column operations, grouping, summarization, and joins without any library:

let counts = csv("gene_counts.csv")
counts
    |> mutate(log_count: |row| log2(row.count + 1))
    |> group_by("sample")
    |> summarize(mean_expr: |g| mean(g.log_count), n_genes: |g| len(g))

Twelve bio databases at your fingertips. NCBI, Ensembl, UniProt, UCSC, KEGG, STRING, PDB, Reactome, Gene Ontology, COSMIC, BioMart, and NCBI Datasets are built-in — no API keys required for most (NCBI key optional for higher rate limits, COSMIC key required):

let gene = ncbi_gene("TP53")
let pathways = reactome_pathways("TP53")
let interactions = string_network("TP53", species: 9606)

Pipelines are a language construct. Not a separate workflow engine — pipelines with stages, parallel execution, and dependency tracking are part of the language itself:

pipeline qc_pipeline {
    stage read_data {
        read_fastq("sample.fq")
    }
    stage filter_reads {
        read_data.result |> filter(|r| mean(r.quality) >= 25)
    }
    stage report {
        let gc = filter_reads.result |> map(|r| gc_content(r.seq))
        {mean_gc: mean(gc), n_reads: len(filter_reads.result)}
    }
}

Streams handle scale. File readers return lazy streams. Process a 100 GB FASTQ without loading it into memory. Parallel maps distribute work across cores:

read_fastq("massive_file.fq")
    |> filter(|r| mean(r.quality) > 25)
    |> par_map(|r| gc_content(r.seq))
    |> summarize(mean_gc: mean, n_reads: len)

Who This Book Is For

This book is for bioinformaticians, computational biologists, and genomics researchers who want to write analysis scripts faster and more clearly. You do not need deep programming experience — if you have used R, Python, or shell scripts for bioinformatics, you will feel at home.

Every example in this book uses real biological data and real analysis tasks. There are no toy examples. By the end, you will be able to:

  • Process sequencing data (FASTA, FASTQ, BAM, VCF, BED, GFF)
  • Query biological databases directly from your scripts
  • Build reproducible analysis pipelines
  • Perform statistical analysis on expression and variant data
  • Scale to large datasets using streams and parallel operations
  • Extend BioLang with plugins in Python, R, or TypeScript

Running the Examples

Install BioLang:

cargo install biolang

Or build from source:

git clone https://github.com/oriclabs/biolang
cd biolang
cargo build --release

Run any example from this book:

bl run example.bl

Or use the interactive REPL:

bl repl

Every code block in this book is valid BioLang. Copy, paste, run.

Sample Data

Many examples in this book reference files like contigs.fa, reads.fq, calls.vcf, and gene_counts.csv. BioLang ships with sample data for all of these in the examples/sample-data/ directory:

examples/sample-data/
  contigs.fa          4 assembled contigs (FASTA)
  reads.fq            8 sequencing reads with quality scores (FASTQ)
  calls.vcf           12 variants across 5 chromosomes (VCF)
  promoters.bed       10 promoter regions (BED)
  chip_peaks.bed      7 ChIP-seq peaks (BED)
  annotations.gff     16 gene features for 6 genes (GFF3)
  samples.csv         6-sample sheet with conditions and batches (CSV)
  gene_counts.csv     15 cancer genes × 6 samples expression matrix (CSV)
  counts.tsv          8 genomic region read counts (TSV)
  data.sam            6 SAM alignment records with headers (SAM)

To run the quickstart script that exercises all sample data:

bl run examples/quickstart.bl

When a book example references a bare filename like read_fasta("contigs.fa"), you can substitute the sample data path:

let sequences = read_fasta("examples/sample-data/contigs.fa")

Chapter 1: Getting Started

BioLang is a pipe-first domain-specific language built for bioinformatics workflows. This chapter walks you through installation, the interactive REPL, running scripts, and writing your first real analysis.

Installation

From crates.io

cargo install biolang

This installs the bl binary, which provides both the REPL and the script runner.

From source

git clone https://github.com/oriclabs/biolang.git
cd biolang
cargo build --release
cp target/release/bl ~/.local/bin/

Verify installation

bl --version

Updating

BioLang has built-in update checking. Run bl version to see the current version and check if a newer release is available:

bl version
# BioLang v0.1.0
# Checking for updates... up to date.

To upgrade to the latest release:

bl upgrade

This downloads the correct binary for your platform from GitHub Releases and replaces the current bl executable.

BioLang also checks for updates automatically in the background when you run bl run or bl repl. If a newer version is available, a one-line notice appears on stderr. This check runs at most once per 24 hours and never blocks startup. Disable it with:

export BIOLANG_NO_UPDATE_CHECK=1

The REPL

Launch the interactive REPL:

bl

You will see the BioLang prompt:

BioLang v0.1.0
Type :help for commands, :quit to exit.
bl>

Try evaluating a bio literal directly:

bl> dna"ATCGATCG" |> gc_content()
0.5

REPL Commands

The REPL supports several meta-commands, all prefixed with :.

:env – Inspect current bindings

bl> let ref_genome = "GRCh38"
bl> let min_mapq = 30
bl> :env
ref_genome : Str = "GRCh38"
min_mapq   : Int = 30

:reset – Clear all bindings

bl> :reset
Environment cleared.

:load and :save – Session persistence

Load a script into the current session, executing every statement:

bl> :load preprocessing.bl
Loaded 42 bindings from preprocessing.bl

Save the current session bindings to a file:

bl> :save my_session.bl
Saved 12 bindings to my_session.bl

:time – Benchmark an expression

bl> :time read_fastq("sample_R1.fastq.gz") |> filter(|r| mean(r.quality) >= 30) |> len()
Result: 1847293
Elapsed: 4.38s

:type – Check the type of an expression

bl> :type dna"ATCG"
DNA
bl> :type {chrom: "chr1", start: 100, end: 200}
Record{chrom: Str, start: Int, end: Int}

:plugins – List available plugins

bl> :plugins
fastq      read_fastq, write_fastq
fasta      read_fasta, write_fasta
sam        read_sam, read_bam
vcf        read_vcf, write_vcf
bed        read_bed, write_bed
table      csv, tsv, write_csv

:profile – Profile an expression

bl> :profile read_fasta("reference.fa") |> filter(|r| seq_len(r.seq) > 1000) |> len()
Total:     2.14s
  read:    1.87s (87.4%)
  filter:  0.26s (12.1%)
  len:     0.01s (0.5%)
Result: 24891

Running Scripts

BioLang scripts use the .bl extension. Run a script with:

bl run gc_analysis.bl

Pass arguments to a script:

bl run qc_report.bl -- --input sample.fastq.gz --min-quality 20

Arguments are available inside the script via the args record:

# qc_report.bl
let input_file = args.input
let min_qual = into(args.min_quality ?? "20", "Int")

let reads = read_fastq(input_file)
  |> filter(|r| mean(r.quality) >= min_qual)

print(f"Passing reads: {len(reads)}")
print(f"Mean quality: {reads |> map(|r| mean(r.quality)) |> mean()}")

Your First Script: FASTA GC Content Analyzer

BioLang includes sample data in examples/sample-data/ — see the Introduction for the full list. The script below uses examples/sample-data/contigs.fa.

Create a file called gc_scan.bl:

# gc_scan.bl
# Read a FASTA file, compute per-sequence GC content, report statistics.

let sequences = read_fasta("examples/sample-data/contigs.fa")

# Compute GC content for each sequence
let gc_table = sequences
  |> map(|seq| {
    name: seq.id,
    length: seq_len(seq.seq),
    gc: gc_content(seq.seq)
  })
  |> table()

# Summary statistics
let gc_vals = gc_table |> select("gc")
let mean_gc = mean(gc_vals)
let std_gc = stdev(gc_vals)
let min_gc = min(gc_vals)
let max_gc = max(gc_vals)
let n_seqs = len(gc_vals)

print(f"Analyzed {n_seqs} sequences")
print(f"GC content: {mean_gc:.3f} (range: {min_gc:.3f} - {max_gc:.3f})")
print(f"Standard deviation: {std_gc:.4f}")

# Flag outlier contigs (GC > 2 std devs from mean)
let outliers = gc_table
  |> filter(|row| abs(row.gc - mean_gc) > 2.0 * std_gc)
  |> sort("gc", descending: true)

print(f"\nOutlier contigs ({len(outliers)}):")
outliers |> each(|row| print(f"  {row.name}: GC={row.gc:.3f}, length={row.length}"))

Run it:

bl run gc_scan.bl

Example output:

Analyzed 847 sequences
GC content: 0.412 (range: 0.198 - 0.687)
Standard deviation: 0.0531

Outlier contigs (12):
  contig_441: GC=0.687, length=3421
  contig_002: GC=0.621, length=15789
  ...

Project Structure

Initialize a BioLang project:

bl init my-rnaseq-pipeline
cd my-rnaseq-pipeline

This creates the following structure:

my-rnaseq-pipeline/
  .biolang/
    config.yaml       # project configuration
    plugins/           # local plugin overrides
  src/
    main.bl            # entry point
  data/                # input data directory
  results/             # output directory

.biolang/config.yaml

name: my-rnaseq-pipeline
version: 0.1.0
entry: src/main.bl

paths:
  data: ./data
  results: ./results
  reference: /shared/references/GRCh38

defaults:
  min_quality: 30
  threads: 8

Access project config values in your scripts:

# src/main.bl
# Access project paths via import
import "src/paths.bl" as paths

let min_qual = 30

read_fastq(f"{paths.data}/sample_R1.fastq.gz")
  |> filter(|r| mean(r.quality) >= min_qual)
  |> write_fastq(f"{paths.results}/filtered_R1.fastq.gz")

Multi-file projects

Use import to split your pipeline across files:

# src/main.bl
import "src/qc.bl" as qc
import "src/alignment.bl" as align
import "src/variant_calling.bl" as vc

let samples = csv("data/sample_sheet.csv")

samples |> each(|sample| {
  let cleaned = qc.run(sample.fastq_r1, sample.fastq_r2)
  let bam = align.run(cleaned.r1, cleaned.r2, sample.reference)
  vc.run(bam, sample.reference, sample.sample_id)
})
# src/qc.bl
let run = |r1, r2| {
  let filt_r1 = read_fastq(r1) |> filter(|r| mean(r.quality) >= 30) |> write_fastq(f"{r1}.filtered.fq.gz")
  let filt_r2 = read_fastq(r2) |> filter(|r| mean(r.quality) >= 30) |> write_fastq(f"{r2}.filtered.fq.gz")
  {r1: filt_r1, r2: filt_r2}
}

BIOLANG_PATH

The BIOLANG_PATH environment variable controls where BioLang searches for imported modules and plugins. It accepts a colon-separated (or semicolon on Windows) list of directories:

export BIOLANG_PATH="/home/user/biolang-libs:/shared/team-modules"

Resolution order for import "module.bl":

  1. Relative to the importing file
  2. Project .biolang/plugins/ directory
  3. Each directory in BIOLANG_PATH
  4. System-wide library path (~/.biolang/lib/)

This is useful for sharing utility modules across projects:

# This resolves via BIOLANG_PATH if not found locally
import "genomics_utils.bl" as gutils

let kmers = dna"ATCGATCGATCG" |> gutils.kmer_frequencies(k: 3)
print(kmers)

What’s Next

You now have BioLang installed, know how to use the REPL for interactive exploration, and can write and run scripts. In the next chapter, we will explore bio literals – the first-class DNA, RNA, protein, and quality score types that make BioLang unique for bioinformatics work.

Chapter 2: Bio Literals

BioLang has first-class types for biological sequences. These are not strings – they carry domain semantics, validate their contents, and support biological operations directly. This chapter covers DNA, RNA, Protein, and Quality literals, along with the operations that make them powerful.

DNA Literals

A DNA literal is created with the dna prefix:

let seq = dna"ATCGATCGATCG"

DNA literals accept only valid IUPAC nucleotide codes: A, T, C, G, and ambiguity codes N, R, Y, S, W, K, M, B, D, H, V.

# Ambiguous positions are valid
let probe = dna"ATCNNGATCG"

# Case is normalized to uppercase
let lower = dna"atcgatcg"
print(lower)   # => dna"ATCGATCG"

Invalid bases cause a compile-time error:

# This is a compile error -- U is not a DNA base
let bad = dna"AUCG"
# Error: invalid DNA base 'U' at position 1

RNA Literals

RNA uses the rna prefix and contains A, U, C, G:

let mrna = rna"AUGCUUAAGGCUAG"

Protein Literals

Protein sequences use standard single-letter amino acid codes:

let p53_fragment = protein"MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPS"

Protein literals validate against the 20 standard amino acids plus X (unknown), * (stop), U (selenocysteine), and O (pyrrolysine):

let with_stop = protein"MVLSPADKTNVK*"

Quality Score Literals

Phred+33 quality scores are first-class values:

let quals = qual"IIIIIHHHGGFFEEDCBA"

Each character maps to a Phred quality score. You can work with them numerically:

let q = qual"IIIIIHHHGG"
print(mean(q))       # => 37.2
print(min(q))        # => 38  (G = 38)
print(max(q))        # => 40  (I = 40)

# Filter reads by mean quality
let reads = read_fastq("sample.fastq.gz")
let hq_reads = reads |> filter(|r| mean(r.quality) >= 30)

Sequence Operations

Complement and Reverse Complement

let forward = dna"ATCGATCG"

let comp = forward |> complement()
print(comp)   # => dna"TAGCTAGC"

let rc = forward |> reverse_complement()
print(rc)     # => dna"CGATCGAT"

Reverse complement is the workhorse of strand-aware bioinformatics:

# Check if a primer binds to either strand
let template = dna"ATCGATCGAATTCGCTAGC"
let primer = dna"GCTAGC"

let fwd_match = template |> find_motif(primer)
let rev_match = template |> find_motif(primer |> reverse_complement())

print(f"Forward hits: {len(fwd_match)}, Reverse hits: {len(rev_match)}")

Transcription and Translation

let gene = dna"ATGAAAGCTTTTCGATAG"

# Transcribe DNA to RNA (T -> U)
let mrna = gene |> transcribe()
print(mrna)   # => rna"AUGAAAGCUUUUCGAUAG"

# Translate RNA (or DNA) to protein
let protein_seq = gene |> translate()
print(protein_seq)   # => protein"MKAFR*"

# Translate with a different codon table (e.g., mitochondrial)
let mito_protein = gene |> translate(table: 2)

GC Content

let contig = dna"ATCGATCGGGCCCATATATGCGCGC"

let gc = contig |> gc_content()
print(f"GC: {gc:.2%}")   # => GC: 56.00%

# Sliding window GC content for detecting isochores
let gc_windows = contig |> window(size: 10, step: 5) |> map(|w| gc_content(w))
print(gc_windows)

Subsequences with slice and len

let genome_fragment = dna"ATCGATCGATCGATCGATCG"

print(seq_len(genome_fragment))           # => 20
print(genome_fragment |> slice(0, 6)) # => dna"ATCGAT"

# Extract a coding region by coordinates
let cds_start = 3
let cds_end = 15
let cds = genome_fragment |> slice(cds_start, cds_end)

Motif Searching

find_motif returns a list of match positions:

let seq = dna"ATCGAATTCGATCGAATTCG"

# Find EcoRI recognition site
let ecori_sites = seq |> find_motif(dna"GAATTC")
print(ecori_sites)   # => [3, 13]

# Regex search on sequences (returns match records)
let matches = seq |> regex_find("GA[AT]TTC")
matches |> each(|m| print(f"Match at {m.start}: {m.text}"))

Example: Find All ORFs in a DNA Sequence

An open reading frame (ORF) starts with ATG and ends at the first in-frame stop codon (TAA, TAG, TGA). This script finds all ORFs in all six reading frames.

# orf_finder.bl
# Find all open reading frames in a FASTA sequence.

let find_orfs = |seq, min_length| {
  let orfs = []
  let codons = seq |> chunk(3)
  let in_orf = false
  let orf_start = 0

  codons |> enumerate() |> each(|i, codon| {
    if !in_orf && codon == dna"ATG" then {
      in_orf = true
      orf_start = i * 3
    }
    if in_orf && (codon == dna"TAA" || codon == dna"TAG" || codon == dna"TGA") then {
      let orf_end = (i + 1) * 3
      let orf_len = orf_end - orf_start
      if orf_len >= min_length then {
        orfs = orfs ++ [{start: orf_start, end: orf_end, length: orf_len, seq: seq |> slice(orf_start, orf_end)}]
      }
      in_orf = false
    }
  })
  orfs
}

let sequences = read_fasta("contigs.fa")

# Search all 6 reading frames
let all_orfs = sequences |> flat_map(|entry| {
  let fwd = entry.seq
  let rev = entry.seq |> reverse_complement()

  0..3 |> flat_map(|frame| {
    let fwd_orfs = fwd |> slice(frame, seq_len(fwd)) |> find_orfs(300)
      |> map(|orf| {
        ...orf,
        seq_id: entry.id,
        strand: "+",
        frame: frame,
        start: orf.start + frame
      })
    let rev_orfs = rev |> slice(frame, seq_len(rev)) |> find_orfs(300)
      |> map(|orf| {
        ...orf,
        seq_id: entry.id,
        strand: "-",
        frame: frame
      })
    fwd_orfs ++ rev_orfs
  })
})

print(f"Found {len(all_orfs)} ORFs (>= 300 bp)")
all_orfs
  |> sort("length", descending: true)
  |> take(20)
  |> each(|orf| print(f"  {orf.seq_id} {orf.strand}:{orf.start}-{orf.end} ({orf.length} bp)"))

Example: Codon Usage Table

Given a coding sequence, calculate the frequency of each codon and display the codon usage bias.

# codon_usage.bl
# Build a codon usage table from a coding sequence.

let cds_records = read_fasta("cds_sequences.fa")

# Aggregate codons across all coding sequences
let codon_counts = cds_records
  |> flat_map(|rec| rec.seq |> chunk(3))
  |> filter(|codon| seq_len(codon) == 3)
  |> group_by(|codon| to_string(codon))
  |> map(|group| {codon: group.key, count: len(group.values)})

let total = codon_counts |> map(|c| c.count) |> sum()

let usage_table = codon_counts
  |> map(|c| {
    ...c,
    amino_acid: into(c.codon, "DNA") |> translate() |> to_string(),
    frequency: c.count / total,
    per_thousand: (c.count / total) * 1000.0
  })
  |> sort("amino_acid")

print(f"Codon usage from {len(cds_records)} sequences ({total} codons)\n")
print("Codon  AA  Count     Freq   /1000")
print("-----  --  --------  -----  -----")
usage_table |> each(|row|
  print(f"{row.codon}    {row.amino_acid}   {row.count:>8}  {row.frequency:.4f}  {row.per_thousand:.1f}")
)

# Identify rare codons (< 10 per thousand)
let rare = usage_table |> filter(|c| c.per_thousand < 10.0)
print(f"\nRare codons ({len(rare)}):")
rare |> each(|c| print(f"  {c.codon} ({c.amino_acid}): {c.per_thousand:.1f}/1000"))

Example: Restriction Enzyme Site Finder

Find all restriction enzyme recognition sites in a sequence and predict fragment sizes for a virtual digest.

# digest.bl
# Virtual restriction digest of a DNA sequence.

# Common restriction enzymes: name -> recognition site
let enzymes = [
  {name: "EcoRI",   site: dna"GAATTC",  cut_offset: 1},
  {name: "BamHI",   site: dna"GGATCC",  cut_offset: 1},
  {name: "HindIII", site: dna"AAGCTT",  cut_offset: 1},
  {name: "NotI",    site: dna"GCGGCCGC", cut_offset: 2},
  {name: "XhoI",    site: dna"CTCGAG",  cut_offset: 1},
  {name: "SalI",    site: dna"GTCGAC",  cut_offset: 1},
  {name: "NcoI",    site: dna"CCATGG",  cut_offset: 1},
  {name: "PstI",    site: dna"CTGCAG",  cut_offset: 5}
]

let entry = read_fasta("plasmid.fa") |> first()
let seq = entry.seq
print(f"Sequence length: {seq_len(seq)} bp\n")

# Find all cut sites for each enzyme
let results = enzymes |> map(|enzyme| {
  let sites = seq |> find_motif(enzyme.site)
  let cut_positions = sites |> map(|pos| pos + enzyme.cut_offset)
  {name: enzyme.name, site: to_string(enzyme.site), n_cuts: len(sites), positions: cut_positions}
})

# Report
results |> each(|r| {
  if r.n_cuts > 0 then {
    print(f"{r.name} ({r.site}): {r.n_cuts} site(s) at {r.positions}")
  } else {
    print(f"{r.name} ({r.site}): no sites")
  }
})

# Virtual double digest with EcoRI + BamHI
let ecori_result = results |> find(|r| r.name == "EcoRI")
let ecori_cuts = ecori_result.positions
let bamhi_result = results |> find(|r| r.name == "BamHI")
let bamhi_cuts = bamhi_result.positions
let all_cuts = (ecori_cuts ++ bamhi_cuts) |> sort() |> unique()

# Calculate fragment sizes (circular plasmid)
let fragments = if len(all_cuts) == 0 then [seq_len(seq)]
  else {
    let sorted_cuts = all_cuts |> sort()
    let frags = sorted_cuts
      |> window(2)
      |> map(|pair| pair[1] - pair[0])
    # Add the wrap-around fragment for circular DNA
    let wrap = seq_len(seq) - last(sorted_cuts) + first(sorted_cuts)
    frags ++ [wrap]
  }

print(f"\nEcoRI + BamHI double digest: {len(fragments)} fragments")
fragments
  |> sort(descending: true)
  |> enumerate()
  |> each(|i, size| print(f"  Fragment {i + 1}: {size} bp"))

Summary

Bio literals are the foundation of BioLang. They carry biological meaning, enforce validity at parse time, and chain naturally with biological operations through pipes. Instead of calling string manipulation functions on raw text, you work directly with typed sequences that know about complements, codons, reading frames, and motifs.

In the next chapter, we will cover the full type system – variables, records, type coercion, and nil handling – which builds on these bio types to represent complex biological data structures.

Chapter 3: Variables and Types

BioLang has a rich type system designed around bioinformatics data. This chapter covers variable bindings, the full set of types, records, type checking and coercion, and nil handling.

let Bindings

Variables are introduced with let and are immutable by default:

let sample_id = "TCGA-A1-A0SP"
let min_depth = 30
let reference = dna"ATCGATCGATCG"

Attempting to rebind a let variable in the same scope is an error:

let threshold = 0.05
let threshold = 0.01   # Error: 'threshold' is already bound in this scope

Reassignment

Use = without let to reassign an existing variable:

let total_reads = 0
total_reads = total_reads + 1847293

let status = "pending"
status = "complete"

Primitive Types

TypeExamplesDescription
Int42, -1, 1_000_00064-bit integer
Float3.14, 1e-10, 0.0564-bit float
Str"hello", f"GC: {gc}"UTF-8 string
Booltrue, falseBoolean
NilnilAbsence of value
let num_samples = 48
let p_value = 3.2e-8
let experiment = "RNA-seq time course"
let is_paired_end = true
let adapter_seq = nil   # not yet determined

Bio Types

These types carry domain semantics beyond raw data:

TypeLiteralDescription
DNAdna"ATCG"DNA sequence (IUPAC)
RNArna"AUGC"RNA sequence
Proteinprotein"MVLK"Amino acid sequence
Qualityqual"IIIHH"Phred+33 quality scores
Intervalinterval("chr1", 1000, 2000)Genomic interval
Variantvariant("chr17", 7674220, "C", "T")Sequence variant
let primer_fwd = dna"TGTAAAACGACGGCCAGT"
let stop_codon = rna"UAG"
let signal_peptide = protein"MKWVTFISLLFLFSSAYS"
let read_quals = qual"IIIIIIIIHHHHHGGGGFFFF"

# Genomic coordinates
let tp53_exon1 = interval("chr17", 7668421, 7669690)
let brca1_snp = variant("chr17", 43094464, "G", "A")

Interval Type

Intervals represent genomic regions with chromosome, start, and end:

let region = interval("chr1", 1000000, 2000000)
print(region.chrom)   # => "chr1"
print(region.start)   # => 1000000
print(region.end)     # => 2000000
print(region.length)  # => 1000000

# Interval arithmetic
let exon1 = interval("chr1", 1000, 1500)
let exon2 = interval("chr1", 1400, 2000)
print(overlaps(exon1, exon2))       # => true
print(intersection(exon1, exon2))   # => interval("chr1", 1400, 1500)

Variant Type

Variants represent sequence changes at specific positions:

let snv = variant("chr7", 55249071, "C", "T")
print(snv.chrom)    # => "chr7"
print(snv.pos)      # => 55249071
print(snv.ref_allele)  # => "C"
print(snv.alt_allele)  # => "T"

# Classify variant type
print(variant_type(snv))   # => "SNV"

let indel = variant("chr7", 55249071, "CT", "C")
print(variant_type(indel)) # => "DEL"

Records

Records are BioLang’s primary structured data type. They map naturally to the tabular, metadata-rich world of bioinformatics:

let sample = {
  sample_id: "S001",
  patient: "P-4821",
  tissue: "tumor",
  reads: 45_000_000,
  mean_coverage: 32.5,
  contamination: 0.02
}

print(sample.sample_id)       # => "S001"
print(sample.mean_coverage)   # => 32.5

String Keys

Record keys can be identifiers or strings:

let annotation = {
  "gene_name": "EGFR",
  "Gene Ontology": "GO:0005524",
  chrom: "chr7"
}

print(annotation."Gene Ontology")   # => "GO:0005524"

Record Spread

The spread operator ... merges one record into another. Later keys override earlier ones:

let defaults = {
  aligner: "bwa-mem2",
  min_mapq: 30,
  mark_duplicates: true,
  reference: "GRCh38"
}

let sample_config = {
  ...defaults,
  sample_id: "S001",
  min_mapq: 20   # override the default
}

print(sample_config.aligner)   # => "bwa-mem2"
print(sample_config.min_mapq)  # => 20

This pattern is essential for bioinformatics pipelines where you have standard parameters with per-sample overrides.

Nested Records

Records can nest to arbitrary depth:

let variant_annotation = {
  variant: variant("chr17", 7674220, "C", "T"),
  gene: {
    name: "TP53",
    transcript: "NM_000546.6",
    exon: 7,
    consequence: "missense_variant"
  },
  population: {
    gnomad_af: 0.00003,
    clinvar: "Pathogenic",
    cosmic_count: 4521
  },
  prediction: {
    sift: {score: 0.001, label: "deleterious"},
    polyphen: {score: 0.998, label: "probably_damaging"}
  }
}

print(variant_annotation.gene.consequence)
print(variant_annotation.prediction.sift.score)

Type Checking

BioLang provides introspection functions for runtime type checking:

let seq = dna"ATCG"
print(type(seq))       # => "DNA"
print(is_dna(seq))     # => true
print(is_rna(seq))     # => false

let rec = {name: "BRCA1", chrom: "chr17"}
print(type(rec))       # => "Record"
print(is_record(rec))  # => true
print(is_str(rec))     # => false

Available type check functions:

is_int(val)
is_float(val)
is_str(val)
is_bool(val)
is_nil(val)
is_dna(val)
is_rna(val)
is_protein(val)
is_quality(val)
is_list(val)
is_set(val)
is_record(val)
is_table(val)
is_interval(val)
is_variant(val)

Type Coercion with into

into(value, "TargetType") converts between compatible types:

# String to DNA
let seq = into("ATCGATCG", "DNA")

# DNA to RNA (transcription)
let mrna = into(dna"ATCGATCG", "RNA")

# Int to Float
let ratio = into(42, "Float")

# String to Int
let depth = into("30", "Int")

# List of records to Table
let rows = [{gene: "TP53", pval: 0.001}, {gene: "BRCA1", pval: 0.05}]
let tbl = into(rows, "Table")

# Variant to Interval (single-base or span)
let region = into(variant("chr1", 1000, "A", "T"), "Interval")

Type Aliases

Define aliases to make code self-documenting:

type Locus = Record
type SampleMeta = Record
type QCMetrics = Record

let build_locus = |chrom, start, end, gene| {
  chrom: chrom, start: start, end: end, gene: gene
}

let tp53 = build_locus("chr17", 7668421, 7687490, "TP53")

Nil Handling

Bioinformatics data is full of missing values – absent annotations, failed QC metrics, unmapped reads. BioLang has two operators for nil handling.

?? Null Coalesce

Returns the left side if non-nil, otherwise the right:

let depth = sample.coverage ?? 0
let gene_name = annotation.symbol ?? annotation.id ?? "unknown"

# Useful for setting defaults in pipeline parameters
let min_qual = args.min_quality ?? 30
let threads = args.threads ?? 4

?. Optional Chain

Safely accesses nested fields, returning nil if any intermediate value is nil:

let clinvar_status = variant_record?.annotation?.clinvar?.significance
# Returns nil if annotation, clinvar, or significance is missing

# Combine with ??
let pathogenicity = variant_record?.annotation?.clinvar?.significance ?? "Unknown"

Example: Building a Sample Metadata Registry

# sample_registry.bl
# Parse a sample sheet and build a typed metadata registry.

let defaults = {
  platform: "Illumina",
  chemistry: "NovaSeq 6000",
  read_length: 150,
  paired: true,
  reference: "GRCh38"
}

let raw_sheet = csv("sample_sheet.csv")

let registry = raw_sheet |> map(|row| {
  ...defaults,
  sample_id: row.Sample_ID,
  patient_id: row.Patient ?? "unknown",
  tissue: row.Tissue,
  condition: row.Condition ?? "normal",
  fastq_r1: row.FASTQ_R1,
  fastq_r2: row.FASTQ_R2 ?? nil,
  paired: !is_nil(row.FASTQ_R2),
  library_prep: row.Library ?? "WGS",
  date: row.Date
})

# Validate: every sample must have R1
let invalid = registry |> filter(|s| is_nil(s.fastq_r1))
if len(invalid) > 0 then {
  print(f"ERROR: {len(invalid)} samples missing FASTQ R1:")
  invalid |> each(|s| print(f"  {s.sample_id}"))
  exit(1)
}

# Group by condition for downstream analysis
let by_condition = registry |> group_by(|s| s.condition)
by_condition |> each(|group| {
  print(f"{group.key}: {len(group.values)} samples")
  group.values |> each(|s| print(f"  {s.sample_id} ({s.tissue})"))
})

# Summary
let tumor_count = registry |> filter(|s| s.condition == "tumor") |> len()
let normal_count = registry |> filter(|s| s.condition == "normal") |> len()
print(f"\nRegistry: {len(registry)} samples ({tumor_count} tumor, {normal_count} normal)")

Example: Parsing and Validating Variant Records

# validate_variants.bl
# Read a VCF file, parse variants into typed records, validate, classify.

let raw_variants = read_vcf("calls.vcf.gz")

# Build typed variant records with full annotation
let typed_variants = raw_variants |> map(|v| {
  var: variant(v.chrom, v.pos, v.ref, v.alt),
  qual: into(v.qual ?? "0", "Float"),
  depth: into(v.info?.DP ?? "0", "Int"),
  allele_freq: into(v.info?.AF ?? "0.0", "Float"),
  genotype: v.samples?.GT ?? "./.",
  filter_status: v.filter
})

# Validate variant quality
let validate = |v| {
  let issues = []
  if v.qual < 30.0 then issues = issues ++ ["low_qual"]
  if v.depth < 10 then issues = issues ++ ["low_depth"]
  if v.allele_freq < 0.05 then issues = issues ++ ["low_af"]
  if v.genotype == "./." then issues = issues ++ ["missing_gt"]
  {...v, issues: issues, is_valid: len(issues) == 0}
}

let validated = typed_variants |> map(validate)

# Classify variants
let classify = |v| {
  let vtype = variant_type(v.var)
  let size = if vtype == "SNV" then 1
    else abs(len(v.var.alt_allele) - len(v.var.ref_allele))
  {...v, variant_type: vtype, size: size}
}

let classified = validated |> map(classify)

# Report summary
let total = len(classified)
let valid = classified |> filter(|v| v.is_valid) |> len()
let by_type = classified |> group_by(|v| v.variant_type)

print(f"Total variants: {total}")
print(f"Passing filters: {valid} ({valid / total * 100.0:.1f}%)")
print(f"\nBy type:")
by_type |> each(|g| print(f"  {g.key}: {len(g.values)}"))

# Flag high-impact variants
let high_impact = classified
  |> filter(|v| v.is_valid && v.variant_type != "SNV" && v.size > 50)
  |> sort("size", descending: true)

print(f"\nHigh-impact structural variants (>50bp): {len(high_impact)}")
high_impact |> take(10) |> each(|v| {
  print(f"  {v.var.chrom}:{v.var.pos} {v.variant_type} size={v.size} AF={v.allele_freq:.3f}")
})

Summary

BioLang’s type system combines standard programming types with biological data types and record-based data modeling. Records with spread make it natural to build layered configurations and annotated data structures. Nil handling with ?? and ?. keeps pipelines resilient to the missing data that pervades genomics workflows.

The next chapter covers collections and tables – the workhorses for processing the large tabular datasets that dominate bioinformatics.

Chapter 4: Collections and Tables

Bioinformatics data is inherently tabular and set-oriented. BioLang provides Lists, Sets, Tables, and Ranges as first-class collection types, with rich operations for filtering, grouping, and summarizing biological datasets.

Lists

Lists are ordered, heterogeneous sequences:

let chromosomes = ["chr1", "chr2", "chr3", "chrX", "chrY"]
let read_lengths = [150, 151, 149, 150, 148, 150]
let mixed = ["TP53", 42, true, dna"ATCG"]

Lists support standard access and manipulation:

let genes = ["BRCA1", "TP53", "EGFR", "KRAS", "BRAF"]

print(genes[0])        # => "BRCA1"
print(genes[-1])       # => "BRAF"
print(len(genes))      # => 5

# Concatenation
let more = genes ++ ["PIK3CA", "PTEN"]

# Nested lists (e.g., paired-end read pairs)
let pairs = [
  ["sample1_R1.fq.gz", "sample1_R2.fq.gz"],
  ["sample2_R1.fq.gz", "sample2_R2.fq.gz"]
]
print(pairs[0][1])   # => "sample1_R2.fq.gz"

Sets

Sets are unordered collections of unique elements. They are the right tool for gene lists, sample IDs, and any membership-testing workload:

let panel_genes = set(["BRCA1", "BRCA2", "TP53", "EGFR", "KRAS", "BRAF"])
let mutated_genes = set(["TP53", "KRAS", "APC", "PIK3CA"])

# Set operations
let overlap = intersection(panel_genes, mutated_genes)
print(overlap)   # => set(["TP53", "KRAS"])

let all_genes = union(panel_genes, mutated_genes)
let panel_only = difference(panel_genes, mutated_genes)
let mutated_only = difference(mutated_genes, panel_genes)

print(f"On panel and mutated: {len(overlap)}")
print(f"Mutated, not on panel: {len(mutated_only)}")

# Membership
print(contains(panel_genes, "TP53"))   # => true

A real use case – finding shared variants between tumor and normal:

let tumor_variants = read_vcf("tumor.vcf.gz")
  |> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
  |> set()

let normal_variants = read_vcf("normal.vcf.gz")
  |> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
  |> set()

let somatic = difference(tumor_variants, normal_variants)
let germline = intersection(tumor_variants, normal_variants)
print(f"Somatic: {len(somatic)}, Germline: {len(germline)}")

Tables

Tables are the central data structure for tabular bioinformatics data. They are created from lists of records:

let samples = table([
  {sample_id: "S001", tissue: "tumor",  reads: 42_000_000, coverage: 35.2},
  {sample_id: "S002", tissue: "normal", reads: 38_000_000, coverage: 31.4},
  {sample_id: "S003", tissue: "tumor",  reads: 51_000_000, coverage: 42.1},
  {sample_id: "S004", tissue: "normal", reads: 44_000_000, coverage: 36.8}
])

Column Access

let coverages = samples |> select("coverage")
print(coverages)   # => [35.2, 31.4, 42.1, 36.8]

# Multiple columns
let subset = samples |> select(["sample_id", "coverage"])

Row Iteration

samples |> each(|row| {
  print(f"{row.sample_id}: {row.coverage}x ({row.tissue})")
})

Table Operations

select – Choose Columns

let qc_summary = full_table |> select(["sample_id", "total_reads", "pct_mapped", "mean_coverage"])

mutate – Add or Transform Columns

let enriched = samples |> mutate(
  reads_millions: |row| row.reads / 1_000_000.0,
  is_high_coverage: |row| row.coverage >= 30.0,
  label: |row| f"{row.sample_id}_{row.tissue}"
)

summarize – Aggregate Statistics

let stats = samples |> summarize(
  n_samples: count(),
  total_reads: sum(reads),
  mean_coverage: mean(coverage),
  min_coverage: min(coverage),
  max_coverage: max(coverage)
)

print(f"Samples: {stats.n_samples}, Mean coverage: {stats.mean_coverage:.1f}x")

group_by – Split-Apply-Combine

let by_tissue = samples
  |> group_by("tissue")
  |> summarize(
    n: count(),
    mean_cov: mean(coverage),
    mean_reads: mean(reads)
  )

by_tissue |> each(|row|
  print(f"{row.tissue}: n={row.n}, cov={row.mean_cov:.1f}x, reads={row.mean_reads:.0f}")
)

sort – Order Rows

# Sort by coverage descending
let ranked = samples |> sort("coverage", descending: true)

# Multi-column sort
let ordered = samples |> sort(["tissue", "coverage"], descending: [false, true])

filter – Select Rows by Condition

let high_cov = samples |> filter(|row| row.coverage >= 30.0)
let tumor_only = samples |> filter(|row| row.tissue == "tumor")

# Chained filters
let good_tumor = samples
  |> filter(|row| row.tissue == "tumor")
  |> filter(|row| row.coverage >= 30.0)
  |> filter(|row| row.reads >= 40_000_000)

Joins (Conceptual)

Table joins are planned for a future release. Currently, use map with lookups:

let annotations = table([
  {sample_id: "S001", patient: "P101", stage: "III"},
  {sample_id: "S002", patient: "P101", stage: "III"},
  {sample_id: "S003", patient: "P202", stage: "II"}
])

# Manual left join via lookup
let ann_map = annotations |> group_by("sample_id") |> map(|g| {key: g.key, val: g.values[0]})
let joined = samples |> map(|row| {
  let ann = ann_map |> find(|a| a.key == row.sample_id)
  {...row, patient: ann?.val?.patient ?? "unknown", stage: ann?.val?.stage ?? "unknown"}
})

Ranges

Ranges generate sequences of integers, useful for genomic coordinate windows:

let indices = 0..10          # [0, 1, 2, ..., 9]
let inclusive = 0..=10       # [0, 1, 2, ..., 10]
let stepped = 0..100..10    # [0, 10, 20, ..., 90]

# Chromosome positions in 1 Mb windows
let chrom_length = 248_956_422   # chr1
let windows = 0..chrom_length..1_000_000
  |> map(|start| {
    chrom: "chr1",
    start: start,
    end: min(start + 1_000_000, chrom_length)
  })

print(f"Generated {len(windows)} windows for chr1")

Example: Sample QC Summary Table from FASTQ Stats

Read multiple FASTQ files, compute QC metrics, and build a summary table.

# fastq_qc_summary.bl
# Build a QC summary table from raw FASTQ files.

let sample_sheet = csv("samples.csv")

let compute_stats = |reads| {
  let quals = reads |> map(|r| mean(r.quality))
  let lengths = reads |> map(|r| seq_len(r.seq))
  let gc_vals = reads |> map(|r| gc_content(r.seq))
  {
    total_reads: len(reads),
    total_bases: sum(lengths),
    mean_length: mean(lengths),
    mean_quality: mean(quals),
    q30_pct: reads |> filter(|r| mean(r.quality) >= 30) |> len() / len(reads) * 100.0,
    gc_pct: mean(gc_vals) * 100.0
  }
}

let qc_results = sample_sheet |> map(|row| {
  let r1 = read_fastq(row.fastq_r1)
  let r1_stats = compute_stats(r1)
  let r2_stats = if !is_nil(row.fastq_r2) then
    compute_stats(read_fastq(row.fastq_r2))
  else nil

  {
    sample_id: row.sample_id,
    total_reads: r1_stats.total_reads + (r2_stats?.total_reads ?? 0),
    total_bases: r1_stats.total_bases + (r2_stats?.total_bases ?? 0),
    mean_length: r1_stats.mean_length,
    mean_quality: r1_stats.mean_quality,
    q30_pct: r1_stats.q30_pct,
    gc_pct: r1_stats.gc_pct
  }
})

let qc_table = table(qc_results)

# Add pass/fail column
let qc_flagged = qc_table |> mutate(
  qc_pass: |row| row.mean_quality >= 30.0 && row.q30_pct >= 80.0 && row.total_reads >= 10_000_000
)

# Summary by pass/fail
let summary = qc_flagged |> group_by("qc_pass") |> summarize(
  count: count(),
  mean_reads: mean(total_reads),
  mean_q: mean(mean_quality)
)

print("QC Summary:")
print("=" * 60)
qc_flagged |> each(|row| {
  let flag = if row.qc_pass then "PASS" else "FAIL"
  print(f"  [{flag}] {row.sample_id}: {row.total_reads / 1e6:.1f}M reads, Q={row.mean_quality:.1f}, Q30={row.q30_pct:.1f}%")
})

let pass_count = qc_flagged |> filter(|r| r.qc_pass) |> len()
let fail_count = qc_flagged |> filter(|r| !r.qc_pass) |> len()
print(f"\n{pass_count} passed, {fail_count} failed out of {len(qc_flagged)} samples")

# Write results
qc_flagged |> write_csv("qc_summary.csv")

Example: Gene Expression Matrix – Filter and Normalize

Load a counts matrix, filter low-count genes, and apply TPM normalization.

# expression_normalize.bl
# Filter low-count genes and compute TPM from a raw counts matrix.

let counts = csv("gene_counts.csv")   # rows = genes, columns = samples
let gene_lengths = csv("gene_lengths.csv")  # gene_id, length

# Build a length lookup
let length_map = gene_lengths
  |> group_by("gene_id")
  |> map(|g| {key: g.key, length: g.values[0].length})

# Convert to table with gene metadata
let expr_table = table(counts)
let sample_cols = colnames(expr_table) |> filter(|c| c != "gene_id")

# Filter: keep genes with >= 10 counts in at least 3 samples
let filtered = expr_table |> filter(|row| {
  let expressed = sample_cols |> filter(|col| row[col] >= 10) |> len()
  expressed >= 3
})

print(f"Genes before filter: {len(expr_table)}")
print(f"Genes after filter: {len(filtered)}")

# RPK: reads per kilobase — transform each row
let rpk_table = filtered |> map(|gene_row| {
  let gene_len = length_map |> find(|g| g.key == gene_row.gene_id)
  let len_kb = (gene_len?.length ?? 1000) / 1000.0
  let updated = sample_cols |> reduce({...gene_row}, |acc, col| {
    ...acc, [col]: gene_row[col] / len_kb
  })
  updated
}) |> table()

# TPM: normalize RPK to sum to 1 million per sample
let rpk_sums = sample_cols |> map(|col| {
  col: col,
  total: rpk_table |> select(col) |> sum()
})

let tpm_table = rpk_table |> map(|row| {
  sample_cols |> reduce({...row}, |acc, col| {
    let col_sum = rpk_sums |> find(|s| s.col == col)
    {...acc, [col]: (row[col] / col_sum.total) * 1e6}
  })
}) |> table()

# Summary statistics per gene
let gene_stats = tpm_table |> mutate(
  mean_tpm: |row| sample_cols |> map(|c| row[c]) |> mean(),
  max_tpm: |row| sample_cols |> map(|c| row[c]) |> max(),
  cv: |row| {
    let vals = sample_cols |> map(|c| row[c])
    stdev(vals) / (mean(vals) + 1e-10)
  }
)

# Top variable genes
let top_variable = gene_stats
  |> sort("cv", descending: true)
  |> take(20)

print("\nTop 20 most variable genes (by CV):")
top_variable |> each(|g|
  print(f"  {g.gene_id}: mean TPM={g.mean_tpm:.1f}, CV={g.cv:.2f}")
)

tpm_table |> write_csv("tpm_normalized.csv")

Example: Merge Variant Calls from Multiple Samples

Combine VCF calls from multiple samples into a unified genotype matrix.

# merge_variants.bl
# Merge variant calls from multiple samples into a unified table.

let sample_vcfs = [
  {sample: "tumor_A",  vcf: "tumor_A.vcf.gz"},
  {sample: "tumor_B",  vcf: "tumor_B.vcf.gz"},
  {sample: "normal_A", vcf: "normal_A.vcf.gz"},
  {sample: "normal_B", vcf: "normal_B.vcf.gz"}
]

# Read all variants with sample tag
let all_calls = sample_vcfs |> flat_map(|s| {
  read_vcf(s.vcf) |> map(|v| {
    key: f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}",
    chrom: v.chrom,
    pos: v.pos,
    ref_allele: v.ref,
    alt_allele: v.alt,
    sample: s.sample,
    qual: into(v.qual ?? "0", "Float"),
    depth: into(v.info?.DP ?? "0", "Int"),
    af: into(v.info?.AF ?? "0.0", "Float"),
    genotype: v.samples?.GT ?? "./."
  })
})

# Collect unique variant sites
let unique_sites = all_calls
  |> map(|v| v.key)
  |> unique()
  |> sort()

print(f"Total calls: {len(all_calls)}")
print(f"Unique variant sites: {len(unique_sites)}")

# Build genotype matrix: one row per variant site, one column per sample
let sample_names = sample_vcfs |> map(|s| s.sample)
let calls_by_key = all_calls |> group_by(|v| v.key)

let merged = unique_sites |> map(|site_key| {
  let site_calls = calls_by_key |> find(|g| g.key == site_key)
  let first = site_calls.values[0]

  let base = {
    chrom: first.chrom,
    pos: first.pos,
    ref_allele: first.ref_allele,
    alt_allele: first.alt_allele,
    n_samples: len(site_calls.values)
  }

  # Add per-sample genotype columns
  let gt_fields = sample_names |> reduce({}, |acc, sname| {
    let call = site_calls.values |> find(|v| v.sample == sname)
    {...acc, [f"{sname}_GT"]: call?.genotype ?? "./.", [f"{sname}_AF"]: call?.af ?? 0.0}
  })

  {...base, ...gt_fields}
})

let merged_table = table(merged)

# Find variants present in tumor but absent in normal (somatic candidates)
let somatic = merged_table |> filter(|row| {
  let tumor_present = row.tumor_A_GT != "./." || row.tumor_B_GT != "./."
  let normal_absent = row.normal_A_GT == "./." && row.normal_B_GT == "./."
  tumor_present && normal_absent
})

let shared_tumor = merged_table |> filter(|row|
  row.tumor_A_GT != "./." && row.tumor_B_GT != "./."
)

print(f"\nSomatic candidates: {len(somatic)}")
print(f"Shared across tumors: {len(shared_tumor)}")
print(f"\nTop somatic candidates:")
somatic
  |> sort("n_samples", descending: true)
  |> take(10)
  |> each(|v| print(f"  {v.chrom}:{v.pos} {v.ref_allele}>{v.alt_allele} (in {v.n_samples} sample(s))"))

merged_table |> write_csv("merged_genotypes.csv")

Summary

Collections and tables give you the tools to handle the data shapes that dominate bioinformatics: sample sheets, QC matrices, expression counts, and multi-sample variant calls. Combined with group_by, summarize, and mutate, you can express complex data wrangling pipelines concisely.

The next chapter introduces pipes and transforms – the core of BioLang’s workflow composition model.

Chapter 5: Pipes and Transforms

The pipe operator is the defining feature of BioLang. Every bioinformatics workflow is a chain of transformations: read data, filter, transform, summarize, output. Pipes make these chains explicit, readable, and composable.

The Pipe Operator |>

The pipe a |> f(b) desugars to f(a, b). The left-hand value becomes the first argument of the right-hand function:

# These are equivalent:
gc_content(dna"ATCGATCGATCG")
dna"ATCGATCGATCG" |> gc_content()

# And these:
filter(reads, |r| mean(r.quality) >= 30)
reads |> filter(|r| mean(r.quality) >= 30)

Why Pipe-First Matters for Bioinformatics

Without pipes, nested function calls read inside-out:

# Hard to follow -- you read from the innermost call outward
write_csv(
  sort(
    summarize(
      group_by(
        filter(read_vcf("calls.vcf.gz"), |v| v.qual >= 30),
        |v| v.chrom
      ),
      n: count(),
      mean_qual: mean(qual)
    ),
    "n",
    descending: true
  ),
  "chrom_summary.csv"
)

With pipes, the same workflow reads top-to-bottom, matching how you think about the analysis:

read_vcf("calls.vcf.gz")
  |> filter(|v| v.qual >= 30)
  |> group_by(|v| v.chrom)
  |> summarize(n: count(), mean_qual: mean(qual))
  |> sort("n", descending: true)
  |> write_csv("chrom_summary.csv")

Each line is one transformation step. You can read the pipeline like a protocol.

Chaining Transforms

The real power emerges when you chain multiple operations. Here is a complete read-to-report workflow:

# Read a BAM file, compute per-chromosome alignment statistics
read_bam("sample.sorted.bam")
  |> filter(|r| r.mapq >= 30 && !r.is_duplicate && !r.is_unmapped)
  |> group_by(|r| r.chrom)
  |> summarize(
    mapped_reads: count(),
    mean_mapq: mean(mapq),
    mean_insert: mean(insert_size)
  )
  |> sort("mapped_reads", descending: true)
  |> mutate(pct: |row| row.mapped_reads / sum(mapped_reads) * 100.0)
  |> write_csv("alignment_stats.csv")

The Tap Pipe |>>

Sometimes you need a side effect in the middle of a chain – logging, writing an intermediate file, printing a progress message – without disrupting the data flow. The tap pipe |>> evaluates the right side for its effect but passes the left side through unchanged:

read_fastq("sample_R1.fastq.gz")
  |>> |reads| print(f"Raw reads: {len(reads)}")
  |> filter(|r| mean(r.quality) >= 25)
  |>> |reads| print(f"After quality filter: {len(reads)}")
  |> filter(|r| seq_len(r.seq) >= 50)
  |>> |reads| print(f"After length filter: {len(reads)}")
  |> write_fastq("filtered_R1.fastq.gz")

Output:

Raw reads: 2847291
After quality filter: 2541083
After length filter: 2539847

The tap pipe is invaluable for debugging pipelines:

read_vcf("raw_calls.vcf.gz")
  |>> |vs| print(f"Step 0 - Raw: {len(vs)} variants")
  |> filter(|v| v.filter == "PASS")
  |>> |vs| print(f"Step 1 - PASS only: {len(vs)}")
  |> filter(|v| v.qual >= 30)
  |>> |vs| print(f"Step 2 - Qual >= 30: {len(vs)}")
  |> filter(|v| into(v.info?.DP ?? "0", "Int") >= 10)
  |>> |vs| print(f"Step 3 - Depth >= 10: {len(vs)}")
  |> write_vcf("filtered_calls.vcf.gz")

You can also use tap to write intermediate checkpoints:

read_fasta("assembly.fa")
  |> filter(|s| seq_len(s.seq) >= 1000)
  |>> |seqs| write_fasta(seqs, "contigs_gt1kb.fa")
  |> filter(|s| gc_content(s.seq) >= 0.3 && gc_content(s.seq) <= 0.7)
  |>> |seqs| write_fasta(seqs, "contigs_normal_gc.fa")
  |> sort(|s| seq_len(s.seq), descending: true)
  |> take(100)
  |> write_fasta("top100_contigs.fa")

Higher-Order Functions in Pipe Chains

map – Transform Each Element

# Extract GC content per contig
let gc_profile = read_fasta("contigs.fa")
  |> map(|seq| {id: seq.id, gc: gc_content(seq.seq), length: seq_len(seq.seq)})

filter – Select Elements by Predicate

# Keep only high-quality mapped reads
let good_reads = read_bam("aligned.bam")
  |> filter(|r| r.mapq >= 30)
  |> filter(|r| !r.is_duplicate)
  |> filter(|r| r.is_proper_pair)

reduce – Accumulate a Single Value

# Total bases across all chromosomes
let total_bases = read_fasta("reference.fa")
  |> map(|s| seq_len(s.seq))
  |> reduce(0, |acc, n| acc + n)

print(f"Reference genome size: {total_bases / 1e9:.2f} Gb")

sort – Order Elements

# Rank genes by expression level
let top_genes = csv("tpm.csv")
  |> sort("tpm", descending: true)
  |> take(50)
  |> map(|row| row.gene_name)

flat_map – Map and Flatten

Essential when each input produces multiple outputs:

# Extract all exons from a gene annotation
let all_exons = read_gff("genes.gff3")
  |> filter(|f| f.type == "gene")
  |> flat_map(|gene| gene.children |> filter(|c| c.type == "exon"))

# Find all k-mers across multiple sequences
let all_kmers = read_fasta("contigs.fa")
  |> flat_map(|seq| seq.seq |> window(21) |> map(|w| to_string(w)))
  |> group_by(|kmer| kmer)
  |> map(|g| {kmer: g.key, count: len(g.values)})
  |> sort("count", descending: true)

take_while – Stream Until Condition Fails

# Read sorted variants until we pass a genomic position
let nearby_variants = read_vcf("sorted.vcf.gz")
  |> filter(|v| v.chrom == "chr17")
  |> take_while(|v| v.pos <= 7700000)
  |> filter(|v| v.pos >= 7660000)

scan – Running Accumulation

Like reduce, but emits every intermediate value:

# Cumulative read count across sorted BAM regions
let cumulative = read_bam("sorted.bam")
  |> group_by(|r| r.chrom)
  |> map(|g| {chrom: g.key, count: len(g.values)})
  |> sort("chrom")
  |> scan(0, |acc, row| acc + row.count)

window and chunk – Sliding and Fixed Blocks

# Sliding window GC content
let gc_track = dna"ATCGATCGATCGATCGCCCCGGGG"
  |> window(size: 10, step: 5)
  |> map(|w| gc_content(w))

# Chunk reads into batches for parallel processing
let batches = read_fastq("sample.fastq.gz")
  |> chunk(100_000)
  |> map(|batch| {n_reads: len(batch), mean_gc: batch |> map(|r| gc_content(r.seq)) |> mean()})

zip – Pair Two Collections

# Pair forward and reverse reads
let r1 = read_fastq("sample_R1.fastq.gz")
let r2 = read_fastq("sample_R2.fastq.gz")

let paired = zip(r1, r2) |> map(|pair| {
  id: pair[0].id,
  r1_qual: mean(pair[0].quality),
  r2_qual: mean(pair[1].quality),
  combined_length: seq_len(pair[0].seq) + seq_len(pair[1].seq)
})

enumerate – Track Position

# Number contigs by size rank
read_fasta("assembly.fa")
  |> sort(|s| seq_len(s.seq), descending: true)
  |> enumerate()
  |> map(|i, seq| {rank: i + 1, id: seq.id, length: seq_len(seq.seq)})
  |> take(10)
  |> each(|row| print(f"#{row.rank}: {row.id} ({row.length} bp)"))

Example: FASTQ Quality Filtering Pipeline

A complete quality filtering workflow with logging at each step.

# quality_filter.bl
# Multi-step FASTQ quality filtering pipeline.

let input_r1 = "raw_data/sample_R1.fastq.gz"
let input_r2 = "raw_data/sample_R2.fastq.gz"

# Process R1 and R2 in parallel-style
let process_reads = |input_path, label| {
  read_fastq(input_path)
    |>> |reads| print(f"[{label}] Input: {len(reads)} reads")

    # Step 1: Remove short reads
    |> filter(|r| seq_len(r.seq) >= 50)
    |>> |reads| print(f"[{label}] After length filter: {len(reads)}")

    # Step 2: Filter by mean quality
    |> filter(|r| mean(r.quality) >= 25)
    |>> |reads| print(f"[{label}] After quality filter: {len(reads)}")

    # Step 3: Remove low-complexity reads
    |> filter(|r| {
      let gc = gc_content(r.seq)
      gc > 0.1 && gc < 0.9
    })
    |>> |reads| print(f"[{label}] After complexity filter: {len(reads)}")
}

let filtered_r1 = process_reads(input_r1, "R1")
let filtered_r2 = process_reads(input_r2, "R2")

# Keep only concordant pairs (both mates survived)
let r1_ids = filtered_r1 |> map(|r| r.id) |> set()
let r2_ids = filtered_r2 |> map(|r| r.id) |> set()
let shared_ids = intersection(r1_ids, r2_ids)

let paired_r1 = filtered_r1 |> filter(|r| contains(shared_ids, r.id))
let paired_r2 = filtered_r2 |> filter(|r| contains(shared_ids, r.id))

print(f"\nFinal paired reads: {len(paired_r1)} pairs")

paired_r1 |> write_fastq("filtered/sample_R1.fastq.gz")
paired_r2 |> write_fastq("filtered/sample_R2.fastq.gz")

# Orphan reads (mate was filtered out)
let orphan_r1 = filtered_r1 |> filter(|r| !contains(shared_ids, r.id))
let orphan_r2 = filtered_r2 |> filter(|r| !contains(shared_ids, r.id))
let orphans = orphan_r1 ++ orphan_r2
print(f"Orphan reads: {len(orphans)}")
orphans |> write_fastq("filtered/sample_orphans.fastq.gz")

Example: Variant Annotation Pipeline

Read VCF, classify variants, annotate against known databases, filter, and report.

# annotate_variants.bl
# Variant annotation and classification pipeline.

let known_pathogenic = csv("clinvar_pathogenic.csv")
  |> map(|r| f"{r.chrom}:{r.pos}:{r.ref}>{r.alt}")
  |> set()

let gnomad_af = csv("gnomad_af.csv")
  |> group_by(|r| f"{r.chrom}:{r.pos}:{r.ref}>{r.alt}")
  |> map(|g| {key: g.key, af: into(g.values[0].af, "Float")})

read_vcf("somatic_calls.vcf.gz")

  # Step 1: Basic quality filter
  |> filter(|v| v.qual >= 30 && v.filter == "PASS")
  |>> |vs| print(f"After quality filter: {len(vs)}")

  # Step 2: Classify variant type
  |> map(|v| {
    ...v,
    key: f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}",
    vtype: variant_type(variant(v.chrom, v.pos, v.ref, v.alt))
  })

  # Step 3: Annotate with population frequency
  |> map(|v| {
    let gnomad = gnomad_af |> find(|g| g.key == v.key)
    {...v, gnomad_af: gnomad?.af ?? 0.0}
  })

  # Step 4: Annotate with ClinVar pathogenicity
  |> map(|v| {
    ...v,
    is_known_pathogenic: contains(known_pathogenic, v.key)
  })

  # Step 5: Assign priority tier
  |> map(|v| {
    let tier = if v.is_known_pathogenic then "Tier1_Known"
      else if v.gnomad_af < 0.001 && v.vtype != "SNV" then "Tier2_RareIndel"
      else if v.gnomad_af < 0.01 then "Tier3_Rare"
      else "Tier4_Common"
    {...v, tier: tier}
  })
  |>> |vs| {
    let tier_counts = vs |> group_by(|v| v.tier) |> map(|g| f"{g.key}: {len(g.values)}")
    print(f"Tier distribution: {tier_counts}")
  }

  # Step 6: Filter to actionable variants
  |> filter(|v| v.tier == "Tier1_Known" || v.tier == "Tier2_RareIndel")
  |>> |vs| print(f"Actionable variants: {len(vs)}")

  # Step 7: Sort by priority and position
  |> sort(["tier", "chrom", "pos"])

  # Step 8: Generate report
  |> map(|v| {
    chrom: v.chrom,
    pos: v.pos,
    ref_allele: v.ref,
    alt_allele: v.alt,
    type: v.vtype,
    qual: v.qual,
    gnomad_af: v.gnomad_af,
    tier: v.tier,
    clinvar: if v.is_known_pathogenic then "Pathogenic" else "."
  })
  |> write_csv("annotated_actionable.csv")

Example: Multi-Sample Comparison with Zip and Reduce

Compare variant calls across matched tumor-normal pairs and compute concordance.

# multi_sample_compare.bl
# Compare variant calls across tumor-normal pairs.

let pairs = [
  {tumor: "patient1_tumor.vcf.gz", normal: "patient1_normal.vcf.gz", patient: "P001"},
  {tumor: "patient2_tumor.vcf.gz", normal: "patient2_normal.vcf.gz", patient: "P002"},
  {tumor: "patient3_tumor.vcf.gz", normal: "patient3_normal.vcf.gz", patient: "P003"}
]

# Process each pair
let pair_results = pairs |> map(|pair| {
  let tumor_vars = read_vcf(pair.tumor)
    |> filter(|v| v.qual >= 30)
    |> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
    |> set()

  let normal_vars = read_vcf(pair.normal)
    |> filter(|v| v.qual >= 30)
    |> map(|v| f"{v.chrom}:{v.pos}:{v.ref}>{v.alt}")
    |> set()

  let somatic = difference(tumor_vars, normal_vars)
  let germline = intersection(tumor_vars, normal_vars)
  let normal_only = difference(normal_vars, tumor_vars)

  {
    patient: pair.patient,
    tumor_total: len(tumor_vars),
    normal_total: len(normal_vars),
    somatic: len(somatic),
    germline: len(germline),
    normal_only: len(normal_only),
    somatic_variants: somatic
  }
})

# Summary table
let summary = table(pair_results |> map(|r| {
  patient: r.patient,
  tumor_total: r.tumor_total,
  normal_total: r.normal_total,
  somatic: r.somatic,
  germline: r.germline,
  somatic_rate: r.somatic / r.tumor_total * 100.0
}))

print("Patient Variant Comparison")
print("=" * 70)
summary |> each(|row|
  print(f"  {row.patient}: tumor={row.tumor_total}, somatic={row.somatic} ({row.somatic_rate:.1f}%), germline={row.germline}")
)

# Find recurrent somatic variants (present in 2+ patients)
let all_somatic = pair_results
  |> flat_map(|r| r.somatic_variants |> collect() |> map(|v| {variant: v, patient: r.patient}))
  |> group_by(|x| x.variant)
  |> filter(|g| len(g.values) >= 2)
  |> map(|g| {
    variant: g.key,
    n_patients: len(g.values),
    patients: g.values |> map(|v| v.patient)
  })
  |> sort("n_patients", descending: true)

print(f"\nRecurrent somatic variants (in >= 2 patients): {len(all_somatic)}")
all_somatic |> take(20) |> each(|v|
  print(f"  {v.variant} -- found in {v.n_patients} patients: {v.patients}")
)

# Cross-patient aggregates via reduce
let totals = pair_results |> reduce(
  {total_somatic: 0, total_germline: 0, patients: 0},
  |acc, r| {
    total_somatic: acc.total_somatic + r.somatic,
    total_germline: acc.total_germline + r.germline,
    patients: acc.patients + 1
  }
)

print(f"\nCohort totals across {totals.patients} patients:")
print(f"  Mean somatic variants: {totals.total_somatic / totals.patients:.0f}")
print(f"  Mean germline variants: {totals.total_germline / totals.patients:.0f}")

# Zip tumor-normal read depths for concordance check
let p1_tumor_depths = read_vcf(pairs[0].tumor) |> map(|v| into(v.info?.DP ?? "0", "Int"))
let p1_normal_depths = read_vcf(pairs[0].normal) |> map(|v| into(v.info?.DP ?? "0", "Int"))

let depth_comparison = zip(p1_tumor_depths, p1_normal_depths)
  |> map(|pair| {tumor_dp: pair[0], normal_dp: pair[1], ratio: pair[0] / (pair[1] + 1.0)})
  |> filter(|d| d.ratio > 3.0)   # flag positions with 3x tumor vs normal depth

print(f"\nPatient 1 depth outliers (tumor/normal > 3x): {len(depth_comparison)}")

Summary

Pipes and transforms are the backbone of BioLang. The |> operator turns nested function calls into linear, readable workflows. The tap pipe |>> lets you observe and debug without disrupting the data flow. Higher-order functions like map, filter, reduce, flat_map, zip, and window cover the full range of data transformation patterns you encounter in bioinformatics pipelines.

Every example in this chapter represents a real analysis pattern: quality filtering, variant annotation, multi-sample comparison. As your pipelines grow more complex, pipes keep them readable and composable.

Functions and Closures

Functions are the primary unit of reuse in BioLang. Whether you are writing a normalisation routine for RNA-seq counts, a scoring function for sequence alignment, or a filter predicate for variant calls, functions let you name a computation and invoke it wherever it is needed.

Defining Functions

A function is introduced with the fn keyword, followed by a name, a parameter list, and a body enclosed in braces.

fn gc_content(seq) {
    let gc = seq |> filter(|b| b == "G" || b == "C") |> len()
    gc / seq_len(seq)
}

let ratio = gc_content("ATGCGCTA")
# ratio => 0.5

The last expression in the body is the implicit return value. There is no need for a return keyword in the common case.

Explicit Return

When you need to exit early – for example after detecting an invalid input – use return.

fn median_quality(quals) {
    if len(quals) == 0 then {
        return 0.0
    }
    let sorted = quals |> sort()
    let mid = len(sorted) / 2
    if len(sorted) % 2 == 0 then {
        (sorted[mid - 1] + sorted[mid]) / 2.0
    } else {
        sorted[mid]
    }
}

Default Parameters

Parameters can carry default values. Callers may omit them.

fn trim_reads(records, min_qual: 20, min_len: 36) {
    records
        |> filter(|r| mean(r.quality) >= min_qual)
        |> filter(|r| seq_len(r.seq) >= min_len)
}

# Use defaults
let trimmed = trim_reads(raw_reads)

# Override quality threshold only
let strict = trim_reads(raw_reads, min_qual: 30)

Default parameters must appear after positional parameters.

Named Return Values

You can declare named return fields to make the return shape explicit.

fn count_variants(vcf_path) -> (snps: Int, indels: Int, other: Int) {
    let records = read_vcf(vcf_path)
    let snps = 0
    let indels = 0
    let other = 0
    records |> each(|v| {
        let vt = variant_type(v)
        if vt == "Snp" then snps = snps + 1
        else if vt == "Indel" then indels = indels + 1
        else other = other + 1
    })
}

Closures and Lambdas

A closure is an anonymous function written with pipe delimiters. Short closures use a single expression; longer ones use { } for a block body.

# Single-expression closure
let is_high_qual = |v| v.qual >= 30.0

# Block closure
let normalise = |counts, size_factors| {
    let total = counts |> reduce(0, |a, b| a + b)
    counts |> map(|c| (c / total) * 1e6 / size_factors)
}

Closures capture variables from the surrounding scope, which makes them ideal for building specialised predicates on the fly.

fn make_depth_filter(min_depth) {
    |record| record.info.DP >= min_depth
}

let deep_enough = make_depth_filter(30)
let filtered = variants |> filter(deep_enough)

Higher-Order Functions

A higher-order function accepts another function (or closure) as an argument, returns one, or both.

fn apply_qc_chain(reads, filters) {
    filters |> reduce(reads, |current, f| current |> filter(f))
}

let filters = [
    |r| mean(r.quality) >= 20,
    |r| seq_len(r.seq) >= 50,
    |r| gc_content(r.seq) < 0.8
]

let passing = apply_qc_chain(raw_reads, filters)

Because pipe inserts as the first argument, you can chain higher-order functions fluently:

let top_genes = counts
    |> filter(|g| g.biotype == "protein_coding")
    |> sort(|g| g.tpm, descending: true)
    |> take(100)
    |> map(|g| g.gene_name)

Where Clauses

A where clause attaches a precondition to a function. If the predicate is false at call time, a runtime error is raised.

fn normalise_counts(counts) where len(counts) > 0 {
    let total = counts |> reduce(0, |a, b| a + b)
    counts |> map(|c| c / total)
}

fn log2_fold_change(treated, control) where control > 0.0 {
    log2(treated / control)
}

fn call_consensus(reads) where len(reads) >= 3 {
    let bases = reads |> map(|r| r.base)
    bases |> group_by(|b| b)
         |> sort(|g| len(g.values), descending: true)
         |> first()
         |> map(|g| g.key)
}

Where clauses serve as executable documentation: they state the contract and enforce it at the boundary.

The @memoize Decorator

Bioinformatics computations are often called repeatedly with the same inputs – the same gene queried in an annotation database, the same k-mer scored in a seed-extension aligner. The @memoize decorator caches results keyed by argument values.

@memoize
fn gc_of_kmer(kmer) {
    gc_content(dna(kmer))
}

# First call computes and caches the result.
# Subsequent calls with the same k-mer return instantly.
let g1 = gc_of_kmer("ATCGGC")  # cache miss
let g2 = gc_of_kmer("ATCGGC")  # cache hit

Memoization is especially valuable when combined with recursion.

Other Decorators

Decorators are annotations placed before a function definition. Besides @memoize, BioLang supports user-defined decorators.

@log_timing
fn align_reads(fastq, ref_genome) {
    # ... alignment logic ...
}

@validate_inputs
fn merge_bams(bam_paths) where len(bam_paths) > 0 {
    # ... merge logic ...
}

Decorators wrap the function transparently – the original signature and behaviour are preserved, with the decorator adding cross-cutting logic.

Recursion

Recursive functions call themselves. BioLang supports standard recursion, which is useful for tree-structured biological data such as phylogenies and ontology DAGs.

fn tree_leaf_count(node) {
    if len(node.children) == 0 then {
        return 1
    }
    node.children |> map(tree_leaf_count) |> reduce(0, |a, b| a + b)
}

Combine @memoize with recursion for dynamic-programming-style algorithms:

@memoize
fn needleman_wunsch_score(seq1, seq2, i, j, gap_penalty: -2) {
    if i == 0 then { return j * gap_penalty }
    if j == 0 then { return i * gap_penalty }

    let match_score = if seq1[i - 1] == seq2[j - 1] then 1 else -1

    let diag   = needleman_wunsch_score(seq1, seq2, i - 1, j - 1) + match_score
    let up     = needleman_wunsch_score(seq1, seq2, i - 1, j) + gap_penalty
    let left   = needleman_wunsch_score(seq1, seq2, i, j - 1) + gap_penalty

    max(diag, max(up, left))
}

let score = needleman_wunsch_score("AGTACG", "ACATAG", 6, 6)

Example: Memoized K-mer Scoring Function

Counting k-mer matches across many sequences calls the same function millions of times. Caching the GC computation removes redundant work.

@memoize
fn kmer_gc(kmer_str) {
    gc_content(dna(kmer_str))
}

fn score_sequence_kmers(seq, k: 21) {
    let kmer_list = seq |> kmers(k)
    let gc_scores = kmer_list |> map(|km| kmer_gc(to_string(km)))
    {
        n_kmers: len(gc_scores),
        mean_gc: mean(gc_scores),
        high_gc_count: gc_scores |> filter(|g| g > 0.6) |> len()
    }
}

let result = score_sequence_kmers(dna"ATCGATCGCCCCGGGGATATATATCGCGCGATCG")
print(f"K-mers: {result.n_kmers}, Mean GC: {result.mean_gc:.3f}")

Example: Reusable QC Filter Factory

A factory function returns a closure configured with experiment-specific thresholds. Different sequencing protocols produce different acceptable ranges.

fn make_qc_filter(min_qual: 20, min_len: 50, max_n_frac: 0.05) {
    |read| {
        let avg_q = mean(read.quality)
        let n_frac = (read.seq |> filter(|b| b == "N") |> len()) / seq_len(read.seq)
        avg_q >= min_qual && seq_len(read.seq) >= min_len && n_frac <= max_n_frac
    }
}

# Whole-genome: permissive
let wgs_filter = make_qc_filter(min_qual: 15, min_len: 36)

# Amplicon: strict
let amp_filter = make_qc_filter(min_qual: 30, min_len: 100, max_n_frac: 0.01)

let wgs_reads = read_fastq("sample_wgs.fastq.gz") |> filter(wgs_filter)
let amp_reads = read_fastq("sample_amp.fastq.gz") |> filter(amp_filter)

print("WGS passing: " + to_string(len(wgs_reads)))
print("Amplicon passing: " + to_string(len(amp_reads)))

Example: Recursive Phylogenetic Tree Traversal

Phylogenetic trees are naturally recursive. This example collects all leaf taxa whose branch length from the root exceeds a threshold – useful for detecting fast-evolving lineages.

fn collect_distant_leaves(node, dist_so_far, threshold) {
    let current_dist = dist_so_far + (node.branch_length ?? 0.0)

    if len(node.children) == 0 then {
        # Leaf node
        if current_dist > threshold then {
            return [{name: node.name, distance: current_dist}]
        } else {
            return []
        }
    }

    # Internal node: recurse into children and concatenate results
    node.children
        |> flat_map(|child| collect_distant_leaves(child, current_dist, threshold))
}

# Build a simple tree as nested records
let tree = {
    name: "root", branch_length: 0.0, children: [
        {name: "primates", branch_length: 0.08, children: [
            {name: "human", branch_length: 0.1, children: []},
            {name: "chimp", branch_length: 0.12, children: []}
        ]},
        {name: "rodents", branch_length: 0.20, children: [
            {name: "mouse", branch_length: 0.35, children: []},
            {name: "rat", branch_length: 0.33, children: []}
        ]}
    ]
}

let fast_evolving = collect_distant_leaves(tree, 0.0, 0.30)

fast_evolving |> each(|taxon|
    print(taxon.name + " => total branch length " + to_string(taxon.distance))
)
# mouse => total branch length 0.55
# rat => total branch length 0.53

Summary

FeatureSyntaxUse Case
Functionfn name(params) { }Named, reusable computation
Default paramsfn f(x, k: 10) { }Flexible call sites
Where clausefn f(x) where x > 0 { }Precondition contracts
Closure|x| exprInline predicates, callbacks
Block closure|x| { stmts }Multi-step anonymous logic
@memoize@memoize fn f() { }Cache repeated computations
RecursionSelf-call in bodyTrees, dynamic programming

Functions and closures are the building blocks that the rest of BioLang – pipelines, parallel maps, and the standard library – is built upon. Master them and every subsequent chapter becomes easier.

Control Flow

BioLang provides a rich set of control-flow constructs designed for the messy realities of biological data. Variants need classification, samples need multi-criteria gating, and searches over sequences must handle both the found and not-found cases gracefully.

if/else Expressions

if/else in BioLang is an expression – it returns a value. This lets you embed conditional logic directly in assignments and pipes.

let label = if gc_content(seq) > 0.6 {
    "GC-rich"
} else if gc_content(seq) < 0.4 {
    "AT-rich"
} else {
    "balanced"
}

Because it is an expression, you can use it inline:

let allele_freq = if total_depth > 0 { alt_count / total_depth } else { 0.0 }

match Expressions

match compares a value against a series of patterns. Like if, it is an expression and returns the result of the matching arm.

let codon = "ATG"

let amino_acid = match codon {
    "ATG" => "Met"
    "TGG" => "Trp"
    "TAA" | "TAG" | "TGA" => "Stop"
    _ => lookup_codon_table(codon)
}

Patterns can destructure records:

fn describe_alignment(aln) {
    match aln {
        {mapped: true, mapq: q} if q >= 30 => "high-confidence"
        {mapped: true, mapq: q} if q >= 10 => "low-confidence"
        {mapped: true}                     => "very-low-quality"
        {mapped: false}                    => "unmapped"
    }
}

for Loops

The basic for loop iterates over any collection or stream.

let reads = read_fastq("sample.fastq.gz")
let total_bases = 0

for read in reads {
    total_bases = total_bases + len(read.seq)
}

print("Total bases: " + str(total_bases))

Tuple Destructuring

When iterating over paired data, destructure directly in the loop header.

let sample_ids = ["S1", "S2", "S3"]
let fastq_paths = [
    "data/S1_R1.fastq.gz",
    "data/S2_R1.fastq.gz",
    "data/S3_R1.fastq.gz"
]

for (sample, path) in zip(sample_ids, fastq_paths) {
    let stats = read_fastq(path) |> read_stats()
    print(sample + ": " + str(stats.total_reads) + " reads")
}

Destructuring also works with enumerate:

let exons = read_bed("BRCA1_exons.bed")

for (i, exon) in enumerate(exons) {
    print("Exon " + str(i + 1) + ": " + exon.chrom + ":" + str(exon.start) + "-" + str(exon.end))
}

for/else

The else block executes only when the loop completes without hitting a break. This is perfect for search-and-report patterns.

let target = dna"TATAAA"
let promoters = read_bed("promoter_regions.bed")

for region in promoters {
    let seq = extract_sequence(ref_genome, region)
    if contains(seq, target) {
        print("TATA box found in " + region.name)
        break
    }
} else {
    print("No TATA box found in any promoter region")
}

while Loops

Use while when the number of iterations is unknown ahead of time.

fn find_orf(seq) {
    let i = 0
    let in_orf = false
    let orf_start = 0
    let orfs = []

    while i + 2 < len(seq) {
        let codon = slice(seq, i, i + 3)
        if codon == "ATG" && !in_orf {
            in_orf = true
            orf_start = i
        }
        if (codon == "TAA" || codon == "TAG" || codon == "TGA") && in_orf {
            orfs = orfs + [{start: orf_start, end: i + 3, length: i + 3 - orf_start}]
            in_orf = false
        }
        i = i + 3
    }
    orfs
}

let orfs = find_orf("ATGAAACCCTAGATGTTTGAATAA")
# [{start: 0, end: 12, length: 12}, {start: 12, end: 24, length: 12}]

break and continue

break exits the innermost loop. continue skips to the next iteration.

let variants = read_vcf("somatic.vcf")
let first_pathogenic = None

for v in variants {
    # Skip low-quality calls
    if v.qual < 30.0 {
        continue
    }

    let info = parse_info(v.info_str)
    if info?.CLNSIG == "Pathogenic" {
        first_pathogenic = v
        break
    }
}

given/otherwise

given/otherwise is a declarative chain of conditions. It reads top to bottom; the first true condition wins. otherwise is the fallback.

fn classify_read_pair(r1_len, r2_len, insert_size) {
    given {
        insert_size < 0         => "invalid_pair"
        insert_size > 1000      => "structural_variant_candidate"
        r1_len < 36 || r2_len < 36 => "short_read"
        insert_size < 150       => "short_insert"
        otherwise               => "normal"
    }
}

given is an expression, so you can assign its result:

let risk = given {
    allele_freq >= 0.5 && coverage >= 30 => "high_confidence_somatic"
    allele_freq >= 0.2                   => "moderate_evidence"
    allele_freq >= 0.05                  => "low_frequency"
    otherwise                            => "below_detection"
}

guard Clauses

guard asserts that a condition is true. If it is not, the else block executes – typically an early return or error.

fn calculate_tmb(variants, exome_size_mb) {
    guard exome_size_mb > 0 else {
        return {error: "Exome size must be positive"}
    }
    guard len(variants) > 0 else {
        return {tmb: 0.0, classification: "low"}
    }

    let somatic = variants |> filter(|v| v.filter == "PASS")
    let tmb = len(somatic) / exome_size_mb

    let classification = given {
        tmb >= 20.0  => "high"
        tmb >= 10.0  => "intermediate"
        otherwise    => "low"
    }

    {tmb: tmb, classification: classification}
}

Guards keep the main logic at the top indentation level by pushing error handling to the margin. Use them liberally for input validation.

unless

unless is syntactic sugar for if !condition. It reads naturally for negative checks.

fn process_bam(path) {
    let header = read_bam_header(path)

    unless contains(header.sort_order, "coordinate") {
        print("WARNING: BAM is not coordinate-sorted, sorting first")
        path = sort_bam(path)
    }

    unless file_exists(path + ".bai") {
        print("Indexing BAM")
        index_bam(path)
    }

    # Proceed with analysis
    let stats = bam_stats(path)
    stats
}

Example: Classify Variants by Type

Read a VCF and produce a summary table of variant types using match.

let variants = read_vcf("sample.vcf")

let classified = variants |> map(|v| {
    let vtype = match true {
        is_snp(v) && is_transition(v)   => "transition"
        is_snp(v) && is_transversion(v) => "transversion"
        is_snp(v)                       => "snp_other"
        is_indel(v) && len(v.alt) > len(v.ref) => "insertion"
        is_indel(v)                     => "deletion"
        _                               => "complex"
    }
    {...v, classification: vtype}
})

let summary = classified
    |> group_by(|v| v.classification)
    |> map(|group| {type: group.0, count: len(group.1)})
    |> sort(|a, b| b.count - a.count)

for row in summary {
    print(row.type + ": " + str(row.count))
}
# transition: 45231
# transversion: 22890
# deletion: 3412
# insertion: 2876
# complex: 134

Example: Search for Motif with for/else

Scan upstream regions for a transcription-factor binding motif. Report the first hit or state that none was found.

let motif = dna"CANNTG"  # E-box motif (N = any base)
let genes = read_gff("annotations.gff")
    |> filter(|f| f.type == "gene" && f.biotype == "protein_coding")

for gene in genes {
    let upstream = interval(gene.chrom, gene.start - 2000, gene.start)
    let seq = extract_sequence(ref_genome, upstream)
    let hits = find_motif(seq, motif)

    if len(hits) > 0 {
        print("E-box found upstream of " + gene.name + " at offset " + str(hits[0].position))
        break
    }
} else {
    print("No E-box motif found in any upstream region scanned")
}

Example: Multi-Criteria Sample QC with given/otherwise

Gate samples through a series of quality thresholds and assign a disposition.

fn qc_disposition(stats) {
    given {
        stats.total_reads < 1_000_000 =>
            {pass: false, reason: "insufficient_reads", reads: stats.total_reads}

        stats.pct_mapped < 70.0 =>
            {pass: false, reason: "low_mapping_rate", pct: stats.pct_mapped}

        stats.mean_depth < 10.0 =>
            {pass: false, reason: "low_depth", depth: stats.mean_depth}

        stats.pct_duplicate > 40.0 =>
            {pass: false, reason: "high_duplication", pct: stats.pct_duplicate}

        stats.contamination > 0.03 =>
            {pass: false, reason: "contamination", frac: stats.contamination}

        otherwise =>
            {pass: true, reason: "all_checks_passed"}
    }
}

let samples = ["S001", "S002", "S003", "S004"]
let bam_dir = "data/aligned"

for sample in samples {
    let stats = bam_stats(bam_dir + "/" + sample + ".bam")
    let result = qc_disposition(stats)

    if result.pass {
        print(sample + ": PASS")
    } else {
        print(sample + ": FAIL (" + result.reason + ")")
    }
}

Example: Input Validation with guard

A variant-calling wrapper that validates every precondition before running the expensive computation.

fn call_variants(bam_path, ref_path, bed_path, min_depth: 10) {
    guard file_exists(bam_path) else {
        return {error: "BAM file not found: " + bam_path}
    }
    guard file_exists(ref_path) else {
        return {error: "Reference FASTA not found: " + ref_path}
    }
    guard file_exists(bed_path) else {
        return {error: "Target BED not found: " + bed_path}
    }

    let header = read_bam_header(bam_path)
    guard header.sort_order == "coordinate" else {
        return {error: "BAM must be coordinate-sorted"}
    }
    guard file_exists(bam_path + ".bai") else {
        return {error: "BAM index (.bai) not found"}
    }

    # All preconditions met -- run the caller
    let regions = read_bed(bed_path)
    let raw_calls = pileup_caller(bam_path, ref_path, regions, min_depth)

    let filtered = raw_calls
        |> filter(|v| v.qual >= 30.0 && v.depth >= min_depth)

    {
        total_calls: len(raw_calls),
        passing_calls: len(filtered),
        variants: filtered
    }
}

Summary

ConstructReturns Value?Best For
if/elseYesBinary or ternary decisions
matchYesMulti-arm pattern dispatch
forNoIteration over collections
for/elseNoSearch with not-found fallback
whileNoIndeterminate iteration
given/otherwiseYesDeclarative condition chains
guard ... elseNoEarly-exit preconditions
unlessNoNegative-condition readability

Choose the construct that best communicates intent. Use guard for validation at function boundaries, given for multi-criteria classification, match for type-driven dispatch, and for/else when a search must report failure.

Error Handling

Biological data is messy. FASTA files contain malformed headers, VCF records have missing INFO fields, and remote databases time out. BioLang provides layered error-handling mechanisms so you can write code that degrades gracefully instead of crashing mid-pipeline.

try/catch Blocks

Wrap any code that might fail in try. If an error occurs, control transfers to the catch block, where you can inspect the error and decide how to proceed.

let records = try {
    read_vcf("variants.vcf")
} catch e {
    print("Failed to parse VCF: " + e.message)
    []
}

try/catch is an expression – it returns the value of whichever branch executes:

let depth = try {
    let info = parse_info(variant.info_str)
    int(info["DP"])
} catch e {
    0  # default when DP is absent or unparseable
}

Catching Specific Errors

You can pattern-match on error types inside catch:

try {
    let result = fetch_uniprot(accession)
    process_protein(result)
} catch e {
    match e.kind {
        "network_error" => {
            print("Network unreachable, using cached data")
            load_cache("uniprot", accession)
        }
        "parse_error" => {
            print("Malformed response for " + accession + ", skipping")
            None
        }
        _ => {
            print("Unexpected error: " + e.message)
            None
        }
    }
}

Error Propagation

When a function cannot handle an error meaningfully, let it propagate to the caller.

fn load_reference(path) {
    guard file_exists(path) else {
        error("Reference file not found: " + path)
    }
    let records = read_fasta(path)
    guard len(records) > 0 else {
        error("Reference file is empty: " + path)
    }
    records
}

# Caller decides how to handle it
try {
    let ref_seqs = load_reference("hg38.fa")
    run_alignment(reads, ref_seqs)
} catch e {
    print("Pipeline aborted: " + e.message)
}

The error() function raises an exception that unwinds to the nearest catch.

retry

Network calls to biological databases are unreliable. The retry construct repeats a block up to a specified number of attempts, with configurable delay between tries.

let gene_info = retry(3, delay: 2000) {
    fetch_ncbi_gene("BRCA1")
}

If all attempts fail, the last error propagates. Combine retry with try/catch for a fully defensive pattern:

let annotation = try {
    retry(5, delay: 1000) {
        fetch_ensembl_annotation("ENSG00000139618")
    }
} catch e {
    print("Ensembl unavailable after 5 attempts: " + e.message)
    {gene_name: "BRCA2", source: "fallback_cache"}
}

Exponential Backoff

For high-traffic APIs, increase delay between retries using a helper:

fn fetch_with_backoff(url, max_attempts: 5) {
    let attempt = 0
    let result = None

    while attempt < max_attempts {
        try {
            result = http_get(url)
            break
        } catch e {
            attempt = attempt + 1
            if attempt >= max_attempts {
                error("All " + str(max_attempts) + " attempts failed: " + e.message)
            }
            let wait = 1000 * pow(2, attempt - 1)  # 1s, 2s, 4s, 8s, 16s
            print("Attempt " + str(attempt) + " failed, retrying in " + str(wait) + "ms")
            sleep(wait)
        }
    }
    result
}

let blast_result = fetch_with_backoff("https://blast.ncbi.nlm.nih.gov/blast/Blast.cgi?RID=" + rid)

Null Coalescing: ??

The ?? operator returns the left-hand side if it is not nil; otherwise it returns the right-hand side. This is indispensable for fields that may be absent.

let depth = variant.info?.DP ?? 0
let gene_name = annotation?.gene_name ?? "unknown"
let af = variant.info?.AF ?? variant.info?.MAF ?? 0.0

?? chains naturally. The first non-None value wins:

let symbol = record.hugo_symbol ?? record.gene_id ?? record.locus_tag ?? "uncharacterised"

Optional Chaining: ?.

The ?. operator short-circuits a field access chain when any intermediate value is None. Instead of crashing, the entire expression evaluates to None.

# Without optional chaining -- crashes if info is None or CLNSIG is absent
let sig = variant.info.CLNSIG

# With optional chaining -- returns None safely
let sig = variant.info?.CLNSIG

# Deep chains
let protein_change = variant.annotation?.consequences?.protein?.hgvs

Combine with ?? for a complete default:

let consequence = variant.annotation?.consequences?.most_severe ?? "unknown"

Defensive Patterns for Bio Data

Missing Fields in Records

Biological file formats are under-specified. A BED file might have 3 columns or 12. A VCF INFO field might omit half the keys. Build defensive accessors.

fn safe_info(variant, key, default: None) {
    try {
        let info = parse_info(variant.info_str)
        info[key] ?? default
    } catch e {
        default
    }
}

let dp = safe_info(v, "DP", default: 0)
let af = safe_info(v, "AF", default: 0.0)
let clnsig = safe_info(v, "CLNSIG", default: "not_reported")

Malformed Records

Skip bad records instead of aborting the whole file:

fn robust_parse(lines, parser) {
    let good = []
    let bad = 0

    for (i, line) in enumerate(lines) {
        try {
            good = good + [parser(line)]
        } catch e {
            bad = bad + 1
            print("Skipped line " + str(i + 1) + ": " + e.message)
        }
    }

    if bad > 0 {
        print("Warning: " + str(bad) + " malformed records skipped")
    }
    good
}

Nil-safe Aggregation

When computing statistics over optional fields, filter out None values first:

let qualities = variants
    |> map(|v| v.qual)
    |> filter(|q| q != None)

let mean_qual = if len(qualities) > 0 { mean(qualities) } else { 0.0 }

Example: Robust FASTA Parser with try/catch

Parse a FASTA file that may contain corrupted entries mixed with valid ones.

fn parse_fasta_robust(path) {
    let raw = read_lines(path)
    let records = []
    let current_header = None
    let current_seq = ""
    let errors = []

    for line in raw {
        if starts_with(line, ">") {
            # Flush previous record
            if current_header != None {
                try {
                    guard len(current_seq) > 0 else {
                        error("Empty sequence for " + current_header)
                    }
                    guard is_valid_dna(current_seq) else {
                        error("Invalid characters in " + current_header)
                    }
                    records = records + [{header: current_header, seq: current_seq}]
                } catch e {
                    errors = errors + [{header: current_header, reason: e.message}]
                }
            }
            current_header = slice(line, 1, len(line)) |> trim()
            current_seq = ""
        } else {
            current_seq = current_seq + trim(line)
        }
    }

    # Flush last record
    if current_header != None {
        try {
            guard len(current_seq) > 0 else { error("Empty sequence") }
            guard is_valid_dna(current_seq) else { error("Invalid characters") }
            records = records + [{header: current_header, seq: current_seq}]
        } catch e {
            errors = errors + [{header: current_header, reason: e.message}]
        }
    }

    print("Parsed " + str(len(records)) + " records, " + str(len(errors)) + " errors")
    for err in errors {
        print("  SKIP: " + err.header + " (" + err.reason + ")")
    }
    records
}

let sequences = parse_fasta_robust("mixed_quality.fasta")

Example: Retrying API Calls to NCBI

Query NCBI’s E-utilities for gene summaries, handling the rate limit and transient failures that are common on public APIs.

fn fetch_gene_summaries(gene_ids) {
    let results = []

    for id in gene_ids {
        let summary = try {
            retry(3, delay: 1500) {
                let url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
                    + "?db=gene&id=" + str(id) + "&retmode=json"
                let resp = http_get(url)
                let data = parse_json(resp.body)
                guard data?.result != None else {
                    error("No result field in response")
                }
                data.result[str(id)]
            }
        } catch e {
            print("Failed to fetch gene " + str(id) + ": " + e.message)
            {gene_id: id, name: "unknown", description: "fetch_failed"}
        }

        results = results + [summary]

        # Respect NCBI rate limit: max 3 requests per second without API key
        sleep(400)
    }
    results
}

let gene_ids = [672, 675, 7157, 4609]  # BRCA1, BRCA2, TP53, MYC
let summaries = fetch_gene_summaries(gene_ids)

for s in summaries {
    print(str(s.gene_id ?? s.uid) + ": " + (s.name ?? "unknown") + " - " + (s.description ?? "N/A"))
}

Example: Processing VCF with Missing INFO Fields

Real VCF files have inconsistent INFO columns. Some records carry AF, others do not. Some have CLNSIG, most do not. Use ?. and ?? to process them uniformly.

let variants = read_vcf("clinvar.vcf")

let annotated = variants |> map(|v| {
    let info = parse_info(v.info_str)

    let af = info?.AF ?? info?.CAF ?? 0.0
    let clinical_sig = info?.CLNSIG ?? "not_reported"
    let gene = split(info?.GENEINFO ?? "", ":") |> first() ?? "intergenic"
    let review_status = info?.CLNREVSTAT ?? "no_assertion"
    let origin = info?.ORIGIN ?? "unknown"

    {
        chrom: v.chrom,
        pos: v.pos,
        ref: v.ref,
        alt: v.alt,
        gene: gene,
        clinical_significance: clinical_sig,
        allele_frequency: af,
        review_status: review_status,
        origin: origin
    }
})

# Filter to pathogenic variants with at least two-star review
let pathogenic = annotated
    |> filter(|v| v.clinical_significance == "Pathogenic"
               || v.clinical_significance == "Likely_pathogenic")
    |> filter(|v| v.review_status != "no_assertion")

print("Pathogenic/likely pathogenic variants: " + str(len(pathogenic)))

for v in pathogenic |> take(10) {
    print(v.gene + " " + v.chrom + ":" + str(v.pos)
        + " " + v.ref + ">" + v.alt
        + " AF=" + str(v.allele_frequency))
}

# Write a clean report
write_csv(pathogenic, "pathogenic_report.csv")

Summary

MechanismSyntaxPurpose
try/catchtry { } catch e { }Graceful failure handling
error()error("msg")Raise an exception
retryretry(n, delay: ms) { }Transient-failure recovery
??val ?? defaultNone substitution
?.obj?.fieldSafe field access
guardguard cond else { }Precondition assertion

Layer these mechanisms: guard at function entry to reject invalid inputs, ?. and ?? for data access, try/catch around I/O and parsing, and retry for network calls. Together they produce pipelines that finish with partial results instead of crashing with none.

File I/O

Bioinformatics begins and ends with files. Sequencers emit FASTQ, aligners produce BAM, callers output VCF, annotators read GFF. BioLang provides first-class readers and writers for every major format, with consistent interfaces and stream support for files that exceed available memory.

Sample data: BioLang ships with sample files in examples/sample-data/ covering FASTA, FASTQ, VCF, BED, GFF, CSV, TSV, and SAM formats. You can substitute these paths wherever the examples below use bare filenames. Run bl run examples/quickstart.bl to verify all files are present.

Reading FASTA

read_fasta() returns a list of records, each with header and seq fields.

let contigs = read_fasta("assembly.fasta")

for c in contigs {
    print(c.header + ": " + str(len(c.seq)) + " bp")
}

# Find the longest contig
let longest = contigs |> sort(|a, b| len(b.seq) - len(a.seq)) |> first()
print("Longest: " + longest.header + " (" + str(len(longest.seq)) + " bp)")

FASTA records carry the raw sequence as a string. Use bio literals for compile-time validated sequences, and read_fasta for runtime file data.

Reading FASTQ

read_fastq() returns records with id, seq, quality, and comment fields. Quality scores are integer Phred values.

let reads = read_fastq("sample_R1.fastq.gz")

let gc_mean = reads |> map(|r| gc_content(r.seq)) |> mean()
let summary = {
    total: len(reads),
    mean_length: reads |> map(|r| len(r.seq)) |> mean(),
    mean_qual: reads |> map(|r| mean(r.quality)) |> mean(),
    gc_pct: gc_mean * 100
}

print("Reads: " + str(summary.total))
print("Mean length: " + str(round(summary.mean_length, 1)))
print("Mean quality: " + str(round(summary.mean_qual, 1)))
print("GC%: " + str(round(summary.gc_pct, 1)))

Compressed files (.gz) are decompressed transparently.

Reading VCF

read_vcf() returns variant records with fields chrom, pos, id, ref, alt, qual, filter, info_str, and genotypes.

let variants = read_vcf("gatk_output.vcf")

let passing = variants |> filter(|v| v.filter == "PASS")
let snps = passing |> filter(|v| is_snp(v))
let indels = passing |> filter(|v| is_indel(v))

print("PASS variants: " + str(len(passing)))
print("  SNPs: " + str(len(snps)))
print("  Indels: " + str(len(indels)))

Reading BED

read_bed() returns interval records with chrom, start, end, and optional name, score, strand fields depending on the number of columns.

let targets = read_bed("exome_targets.bed")

let total_bp = targets |> map(|t| t.end - t.start) |> reduce(0, |a, b| a + b)
print("Total target region: " + str(total_bp) + " bp")
print("Number of targets: " + str(len(targets)))

# Group by chromosome
let by_chrom = targets |> group_by(|t| t.chrom)
for (chrom, regions) in by_chrom {
    let bp = regions |> map(|r| r.end - r.start) |> reduce(0, |a, b| a + b)
    print(chrom + ": " + str(len(regions)) + " targets, " + str(bp) + " bp")
}

Reading GFF/GTF

read_gff() parses GFF3 and GTF formats. Records have chrom, source, type, start, end, score, strand, phase, and attributes (a map).

let features = read_gff("gencode.v44.annotation.gff3")

let genes = features |> filter(|f| f.type == "gene")
let protein_coding = genes |> filter(|g| g.attributes.gene_type == "protein_coding")

print("Total genes: " + str(len(genes)))
print("Protein-coding: " + str(len(protein_coding)))

# Gene length distribution
let lengths = protein_coding |> map(|g| g.end - g.start)
print("Median gene length: " + str(median(lengths)) + " bp")
print("Mean gene length: " + str(round(mean(lengths), 0)) + " bp")

Reading CSV and TSV

csv() and tsv() return tables where column headers become field names.

let metadata = csv("sample_metadata.csv")

# Group samples by condition
let groups = metadata |> group_by(|row| row.condition)

for (condition, samples) in groups {
    let ids = samples |> map(|s| s.sample_id) |> join(", ")
    print(condition + ": " + ids)
}

# DESeq2-style count matrix
let counts = tsv("gene_counts.tsv")
let gene_names = counts |> map(|row| row.gene_id)
print("Genes in count matrix: " + str(len(gene_names)))

Writing Files

Writer functions take a list of records and an output path. The format is determined by the function name.

# Write FASTA
let filtered = contigs |> filter(|c| len(c.seq) >= 500)
write_fasta(filtered, "assembly_filtered.fasta")

# Write FASTQ
let trimmed = reads |> map(|r| trim_quality(r, 20, 0))
write_fastq(trimmed, "trimmed_R1.fastq.gz")

# Write CSV
let stats_table = samples |> map(|s| {
    sample: s.id,
    total_reads: s.read_count,
    pct_mapped: round(s.mapped / s.read_count * 100, 2),
    mean_depth: round(s.depth, 1)
})
write_csv(stats_table, "qc_summary.csv")

The .gz extension triggers gzip compression automatically.

Stream-Based Reading

For files too large to fit in memory – a 300 GB whole-genome BAM, a multi-sample VCF with millions of records – use streaming. Stream readers return a lazy sequence that is consumed once, record by record.

# Stream a large FASTQ without loading it all
let stream = read_fastq("wgs_sample.fastq.gz")

# Streams work with all HOFs -- processing is lazy
let high_qual_count = stream
    |> filter(|r| mean(r.quality) >= 30)
    |> filter(|r| len(r.seq) >= 100)
    |> len()

print("High-quality long reads: " + str(high_qual_count))

Streams are consumed once. If you need multiple passes, collect into a list or read the file again:

# Two-pass: first pass counts, second pass filters
let total = read_fastq("reads.fastq.gz") |> len()
let passing = read_fastq("reads.fastq.gz")
    |> filter(|r| mean(r.quality) >= 20)
    |> len()

print("Pass rate: " + str(round(passing / total * 100, 2)) + "%")

Example: Convert FASTQ to FASTA

Strip quality scores from a FASTQ file to produce a FASTA for downstream assembly or BLAST.

let reads = read_fastq("nanopore_reads.fastq.gz")

let fasta_records = reads |> map(|r| {
    header: r.id + " " + (r.description ?? ""),
    seq: r.seq
})

write_fasta(fasta_records, "nanopore_reads.fasta")
print("Converted " + str(len(fasta_records)) + " reads to FASTA")

For large files, this streams through without holding everything in memory because map on a stream produces a stream.

Example: Extract Coding Sequences from GFF + FASTA

Combine a genome FASTA with GFF annotation to extract CDS sequences for each protein-coding gene.

let genome = read_fasta("GRCh38.fasta")
let features = read_gff("gencode.v44.annotation.gff3")

# Build a lookup from chromosome name to sequence
let chrom_seqs = {}
for contig in genome {
    let name = split(contig.header, " ") |> first()
    chrom_seqs = {...chrom_seqs, [name]: contig.seq}
}

# Collect CDS exons grouped by transcript
let cds_features = features
    |> filter(|f| f.type == "CDS")
    |> group_by(|f| f.attributes.transcript_id)

let cds_records = []

for (tx_id, exons) in cds_features {
    let sorted_exons = exons |> sort(|a, b| a.start - b.start)
    let chrom = sorted_exons[0].chrom
    let strand = sorted_exons[0].strand

    guard chrom_seqs[chrom] != None else { continue }

    let seq = sorted_exons
        |> map(|e| slice(chrom_seqs[chrom], e.start - 1, e.end))
        |> join("")

    let final_seq = if strand == "-" { reverse_complement(seq) } else { seq }

    cds_records = cds_records + [{
        header: tx_id + " " + chrom + " " + strand,
        seq: final_seq
    }]
}

write_fasta(cds_records, "coding_sequences.fasta")
print("Extracted " + str(len(cds_records)) + " coding sequences")

Example: Filter VCF by Quality and Write Passing Variants

Apply multiple quality filters to a VCF and write both passing and failing records to separate files.

let variants = read_vcf("raw_calls.vcf")

let results = variants |> map(|v| {
    let info = parse_info(v.info_str)
    let dp = info?.DP ?? 0
    let mq = info?.MQ ?? 0.0
    let af = info?.AF ?? 0.0

    let pass = v.qual >= 30.0
        && dp >= 10
        && mq >= 40.0
        && v.filter != "LowQual"

    {...v, qc_pass: pass, depth: dp, map_qual: mq, allele_freq: af}
})

let passing = results |> filter(|v| v.qc_pass)
let failing = results |> filter(|v| !v.qc_pass)

write_vcf(passing, "filtered_PASS.vcf")
write_vcf(failing, "filtered_FAIL.vcf")

print("Total: " + str(len(results)))
print("PASS: " + str(len(passing)))
print("FAIL: " + str(len(failing)))

# Summarise failure reasons
let low_qual = failing |> filter(|v| v.qual < 30.0) |> len()
let low_depth = failing |> filter(|v| v.depth < 10) |> len()
let low_mq = failing |> filter(|v| v.map_qual < 40.0) |> len()

print("Reasons (not mutually exclusive):")
print("  Low QUAL (<30): " + str(low_qual))
print("  Low depth (<10): " + str(low_depth))
print("  Low MQ (<40): " + str(low_mq))

Example: Merge BED Files and Compute Coverage

Load target regions from multiple BED files, merge overlapping intervals, and compute total coverage.

let bed_files = [
    "targets_panel_v1.bed",
    "targets_panel_v2.bed",
    "custom_hotspots.bed"
]

# Read and concatenate all regions
let all_regions = bed_files
    |> flat_map(|path| {
        let regions = read_bed(path)
        print("Loaded " + str(len(regions)) + " regions from " + path)
        regions
    })

print("Total regions before merge: " + str(len(all_regions)))

# Sort by chromosome and start position
let sorted = all_regions
    |> sort(|a, b| {
        if a.chrom != b.chrom {
            compare(a.chrom, b.chrom)
        } else {
            a.start - b.start
        }
    })

# Merge overlapping intervals
fn merge_intervals(intervals) {
    guard len(intervals) > 0 else { return [] }

    let merged = [intervals[0]]

    for region in intervals |> drop(1) {
        let last_region = merged |> last()
        if region.chrom == last_region.chrom && region.start <= last_region.end {
            # Overlapping -- extend the current interval
            let new_end = max(last_region.end, region.end)
            merged = merged |> take(len(merged) - 1)
            merged = merged + [{...last_region, end: new_end}]
        } else {
            merged = merged + [region]
        }
    }
    merged
}

let merged = merge_intervals(sorted)
print("Regions after merge: " + str(len(merged)))

let total_bp = merged |> map(|r| r.end - r.start) |> reduce(0, |a, b| a + b)
print("Total coverage: " + str(total_bp) + " bp (" + str(round(total_bp / 1e6, 2)) + " Mb)")

# Per-chromosome summary
let by_chrom = merged |> group_by(|r| r.chrom)
for (chrom, regions) in by_chrom |> sort(|a, b| compare(a.0, b.0)) {
    let bp = regions |> map(|r| r.end - r.start) |> reduce(0, |a, b| a + b)
    print("  " + chrom + ": " + str(len(regions)) + " regions, " + str(bp) + " bp")
}

# Write merged BED
write_bed(merged, "merged_targets.bed")

Format Reference

FunctionFormatReturns
read_fasta(path)FASTA[{header, seq}]
read_fastq(path)FASTQ[{id, seq, quality, comment}]
read_vcf(path)VCF[{chrom, pos, id, ref, alt, qual, filter, info_str, genotypes}]
read_bed(path)BED[{chrom, start, end, name?, score?, strand?}]
read_gff(path)GFF3/GTF[{chrom, source, type, start, end, score, strand, phase, attributes}]
csv(path)CSV[{col1: val, col2: val, ...}]
tsv(path)TSV[{col1: val, col2: val, ...}]
write_fasta(records, path)FASTAwrites file
write_fastq(records, path)FASTQwrites file
write_vcf(records, path)VCFwrites file
write_bed(records, path)BEDwrites file
write_csv(records, path)CSVwrites file

All readers accept .gz paths and decompress automatically. All writers compress when the output path ends in .gz.

Genomic Intervals and Variants

Two data types sit at the heart of genomic analysis: intervals (regions on a chromosome) and variants (differences from a reference). BioLang provides built-in types and operations for both, including interval trees for fast overlap queries and variant classification functions that handle the full spectrum from SNPs to complex structural events.

Genomic Intervals

An interval represents a contiguous region on a chromosome. Create one with the interval() constructor.

let promoter = interval("chr1", 11869, 14409)
let exon = interval("chr1", 11869, 12227, strand: "+")

Interval Fields

Every interval has chrom, start, and end. Optional fields include strand, name, and score.

let region = interval("chr7", 55181000, 55182000, strand: "+", name: "EGFR_exon20")

print(region.chrom)   # chr7
print(region.start)   # 55181000
print(region.end)     # 55182000
print(region.strand)  # +
print(region.name)    # EGFR_exon20

let width = region.end - region.start
print("Width: " + str(width) + " bp")

BioLang uses 0-based, half-open coordinates (BED convention) throughout.

Interval Arithmetic

Test relationships between intervals directly.

let a = interval("chr1", 100, 200)
let b = interval("chr1", 150, 250)
let c = interval("chr1", 300, 400)

# Overlap test
print(overlaps(a, b))   # true
print(overlaps(a, c))   # false

# Containment
let outer = interval("chr1", 100, 500)
print(contains(outer, c))  # true

# Distance between non-overlapping intervals
print(distance(a, c))   # 100

# Merge overlapping intervals into one
let merged = merge_interval(a, b)
# interval("chr1", 100, 250)

# Intersection
let common = intersect(a, b)
# interval("chr1", 150, 200)

Interval Trees

When you have thousands of intervals and need to query overlaps repeatedly, build an interval tree. Construction is O(n log n); each query is O(log n + k) where k is the number of hits.

let exons = read_bed("gencode_exons.bed")
    |> map(|e| interval(e.chrom, e.start, e.end, name: e.name))

let tree = interval_tree(exons)

query_overlaps

Find all intervals in the tree that overlap a query region.

let query = interval("chr17", 43044294, 43044295)  # single position in BRCA1
let hits = query_overlaps(tree, query)

for hit in hits {
    print("Overlaps: " + hit.name + " " + str(hit.start) + "-" + str(hit.end))
}

query_nearest

Find the closest interval to a query, even if there is no overlap.

let snp_pos = interval("chr7", 55181378, 55181379)
let nearest = query_nearest(tree, snp_pos)

print("Nearest feature: " + nearest.name
    + " at distance " + str(distance(snp_pos, nearest)) + " bp")

coverage

Compute per-base coverage across a set of intervals – how many intervals cover each position.

let reads_as_intervals = aligned_reads
    |> map(|r| interval(r.chrom, r.start, r.end))

let tree = interval_tree(reads_as_intervals)
let target = interval("chr17", 43044000, 43045000)

let cov = coverage(tree, target)
print("Mean depth over target: " + str(mean(cov)))
print("Bases >= 30x: " + str(cov |> filter(|d| d >= 30) |> len()))

Variants

A variant represents a difference from the reference genome. Create one with the variant() constructor.

let snp = variant("chr7", 55181378, "T", "A")           # EGFR T790M
let del = variant("chr17", 43045684, "TCAA", "T")       # BRCA1 deletion
let ins = variant("chr2", 29416089, "A", "AGCTGCTG")    # ALK insertion

Variant Classification

BioLang provides functions to classify variants by their molecular type.

let v = variant("chr7", 55181378, "T", "A")

print(is_snp(v))           # true
print(is_indel(v))         # false
print(is_transition(v))    # false  (T>A is transversion)
print(is_transversion(v))  # true
print(variant_type(v))     # "Snp"

The full classification set:

let examples = [
    variant("chr1", 100, "A", "G"),      # transition (purine to purine)
    variant("chr1", 200, "A", "T"),      # transversion
    variant("chr1", 300, "ACG", "A"),    # deletion
    variant("chr1", 400, "A", "ATCG"),   # insertion
    variant("chr1", 500, "ACG", "TGA"),  # MNP (multi-nucleotide polymorphism)
]

for v in examples {
    let vtype = variant_type(v)
    let detail = given {
        is_snp(v) && is_transition(v)   => "transition"
        is_snp(v) && is_transversion(v) => "transversion"
        is_indel(v)                     => "indel (" + str(abs(len(v.alt) - len(v.ref))) + "bp)"
        otherwise                       => vtype
    }
    print(v.chrom + ":" + str(v.pos) + " " + v.ref + ">" + v.alt + " => " + detail)
}

Genotype Queries

When a variant carries genotype information (from a VCF), query zygosity.

let variants = read_vcf("sample.vcf")

let het_snps = variants
    |> filter(|v| is_snp(v) && is_het(v))

let hom_alt = variants
    |> filter(|v| is_hom_alt(v))

print("Heterozygous SNPs: " + str(len(het_snps)))
print("Homozygous alt calls: " + str(len(hom_alt)))

# Allele balance for heterozygous calls
let allele_balances = het_snps |> map(|v| {
    let info = parse_info(v.info_str)
    let ad = info?.AD ?? [0, 0]
    let total = ad |> reduce(0, |a, b| a + b)
    if total > 0 { ad[1] / total } else { 0.0 }
})

print("Median allele balance (het): " + str(round(median(allele_balances), 3)))

parse_vcf_info

The VCF INFO column packs key-value pairs into a semicolon-delimited string. parse_vcf_info() turns it into a record.

let info_str = "DP=42;MQ=60.0;AF=0.48;CLNSIG=Pathogenic;GENEINFO=BRCA1:672"
let info = parse_info(info_str)

print(info.DP)       # 42
print(info.AF)       # 0.48
print(info.CLNSIG)   # "Pathogenic"
print(info.GENEINFO) # "BRCA1:672"

Example: Find Genes Overlapping with ChIP-seq Peaks

Identify which genes have promoter regions that overlap with transcription factor binding peaks from a ChIP-seq experiment.

# Load gene annotations
let genes = read_gff("gencode.v44.annotation.gff3")
    |> filter(|f| f.type == "gene" && f.attributes.gene_type == "protein_coding")

# Define promoter regions: 2kb upstream of TSS
let promoters = genes |> map(|g| {
    let tss = if g.strand == "+" { g.start } else { g.end }
    let prom_start = if g.strand == "+" { tss - 2000 } else { tss }
    let prom_end = if g.strand == "+" { tss } else { tss + 2000 }
    interval(g.chrom, max(0, prom_start), prom_end,
             name: g.attributes.gene_name)
})

# Load ChIP-seq peaks
let peaks = read_bed("H3K27ac_peaks.bed")
    |> map(|p| interval(p.chrom, p.start, p.end, name: p.name ?? "peak"))

# Build interval tree from promoters for fast lookup
let promoter_tree = interval_tree(promoters)

# Query each peak against promoters
let genes_with_peaks = []

for peak in peaks {
    let hits = query_overlaps(promoter_tree, peak)
    for hit in hits {
        genes_with_peaks = genes_with_peaks + [{
            gene: hit.name,
            peak_chrom: peak.chrom,
            peak_start: peak.start,
            peak_end: peak.end,
            overlap_bp: intersect(peak, hit) |> width()
        }]
    }
}

# Deduplicate by gene name
let unique_genes = genes_with_peaks
    |> map(|g| g.gene)
    |> unique()

print("ChIP-seq peaks: " + str(len(peaks)))
print("Genes with H3K27ac at promoter: " + str(len(unique_genes)))

# Top genes by peak overlap size
let top = genes_with_peaks
    |> sort(|a, b| b.overlap_bp - a.overlap_bp)
    |> take(20)

for entry in top {
    print(entry.gene + ": " + str(entry.overlap_bp) + " bp overlap at "
        + entry.peak_chrom + ":" + str(entry.peak_start) + "-" + str(entry.peak_end))
}

write_csv(genes_with_peaks, "genes_with_h3k27ac_peaks.csv")

Example: Classify Variants and Generate a Ti/Tv Report

Transition/transversion ratio (Ti/Tv) is a key quality metric for SNP calling. Whole-genome sequencing typically yields Ti/Tv around 2.0-2.1; exome around 2.8-3.0. Deviations suggest systematic errors.

let variants = read_vcf("deepvariant_output.vcf")

# Keep only PASS SNPs
let pass_snps = variants
    |> filter(|v| v.filter == "PASS" && is_snp(v))

# Classify each SNP
let classified = pass_snps |> map(|v| {
    let cls = if is_transition(v) { "Ti" } else { "Tv" }
    let change = v.ref + ">" + v.alt
    {chrom: v.chrom, pos: v.pos, change: change, class: cls, qual: v.qual}
})

let ti_count = classified |> filter(|v| v.class == "Ti") |> len()
let tv_count = classified |> filter(|v| v.class == "Tv") |> len()
let ratio = if tv_count > 0 { ti_count / tv_count } else { 0.0 }

print("Transitions: " + str(ti_count))
print("Transversions: " + str(tv_count))
print("Ti/Tv ratio: " + str(round(ratio, 3)))

# Per-chromosome Ti/Tv
let by_chrom = classified |> group_by(|v| v.chrom)

let chrom_report = by_chrom |> map(|group| {
    let chrom = group.0
    let vars = group.1
    let ti = vars |> filter(|v| v.class == "Ti") |> len()
    let tv = vars |> filter(|v| v.class == "Tv") |> len()
    {
        chrom: chrom,
        ti: ti,
        tv: tv,
        ratio: if tv > 0 { round(ti / tv, 3) } else { 0.0 },
        total: ti + tv
    }
})

let sorted_report = chrom_report |> sort(|a, b| compare(a.chrom, b.chrom))

print("\nPer-chromosome Ti/Tv:")
print("Chrom\tTi\tTv\tRatio\tTotal")
for row in sorted_report {
    print(row.chrom + "\t" + str(row.ti) + "\t" + str(row.tv)
        + "\t" + str(row.ratio) + "\t" + str(row.total))
}

# Substitution spectrum: count each of the 12 possible changes
let spectrum = classified
    |> group_by(|v| v.change)
    |> map(|g| {change: g.0, count: len(g.1)})
    |> sort(|a, b| b.count - a.count)

print("\nSubstitution spectrum:")
for entry in spectrum {
    let bar = str_repeat("*", entry.count / 100)
    print(entry.change + "\t" + str(entry.count) + "\t" + bar)
}

write_csv(sorted_report, "titv_per_chromosome.csv")
write_csv(spectrum, "substitution_spectrum.csv")

Example: Annotate Variants with Overlapping Regulatory Regions

Given a set of variants and a regulatory region BED file (e.g., ENCODE cCREs), annotate each variant with the regulatory elements it falls within.

# Load regulatory regions from ENCODE
let regulatory = read_bed("ENCODE_cCREs.bed")
    |> map(|r| interval(r.chrom, r.start, r.end, name: r.name ?? "cCRE"))

let reg_tree = interval_tree(regulatory)

# Load variants
let variants = read_vcf("gwas_hits.vcf")

# Annotate each variant with overlapping regulatory regions
let annotated = variants |> map(|v| {
    let pos_interval = interval(v.chrom, v.pos - 1, v.pos + len(v.ref) - 1)
    let overlapping = query_overlaps(reg_tree, pos_interval)
    let nearest = query_nearest(reg_tree, pos_interval)

    let reg_names = overlapping |> map(|r| r.name) |> join(";")
    let nearest_name = nearest?.name ?? "none"
    let nearest_dist = if len(overlapping) > 0 {
        0
    } else {
        distance(pos_interval, nearest)
    }

    {
        chrom: v.chrom,
        pos: v.pos,
        ref: v.ref,
        alt: v.alt,
        in_regulatory: len(overlapping) > 0,
        regulatory_elements: if len(reg_names) > 0 { reg_names } else { "none" },
        num_overlapping: len(overlapping),
        nearest_element: nearest_name,
        distance_to_nearest: nearest_dist
    }
})

# Summary statistics
let in_reg = annotated |> filter(|v| v.in_regulatory) |> len()
let total = len(annotated)
print("Variants in regulatory regions: " + str(in_reg) + "/" + str(total)
    + " (" + str(round(in_reg / total * 100, 1)) + "%)")

# Distribution of distance to nearest regulatory element
let distances = annotated
    |> filter(|v| !v.in_regulatory)
    |> map(|v| v.distance_to_nearest)

print("Non-regulatory variants:")
print("  Median distance to nearest cCRE: " + str(median(distances)) + " bp")
print("  Within 1kb of cCRE: " + str(distances |> filter(|d| d <= 1000) |> len()))
print("  Within 10kb of cCRE: " + str(distances |> filter(|d| d <= 10000) |> len()))

# Variants overlapping multiple regulatory elements
let multi = annotated |> filter(|v| v.num_overlapping > 1)
print("\nVariants in multiple regulatory regions: " + str(len(multi)))
for v in multi |> take(10) {
    print("  " + v.chrom + ":" + str(v.pos) + " overlaps " + str(v.num_overlapping)
        + " elements: " + v.regulatory_elements)
}

write_csv(annotated, "variants_regulatory_annotation.csv")

Summary

Interval Operations

FunctionDescription
interval(chrom, start, end)Create an interval
overlaps(a, b)Test for overlap
contains(a, b)Test containment
distance(a, b)Gap between non-overlapping intervals
intersect(a, b)Overlapping portion
merge_interval(a, b)Union of overlapping pair
interval_tree(list)Build tree for fast queries
query_overlaps(tree, q)All intervals overlapping q
query_nearest(tree, q)Closest interval to q
coverage(tree, region)Per-base depth array

Variant Operations

FunctionDescription
variant(chrom, pos, ref, alt)Create a variant
variant_type(v)“Snp”, “Indel”, “Mnp”, “Other”
is_snp(v)Single nucleotide change
is_indel(v)Insertion or deletion
is_transition(v)Purine-purine or pyrimidine-pyrimidine
is_transversion(v)Purine-pyrimidine or vice versa
is_het(v)Heterozygous genotype
is_hom_ref(v)Homozygous reference
is_hom_alt(v)Homozygous alternate
parse_vcf_info(str)Parse INFO column to record

Intervals and variants are the coordinate system of genomics. Master these types and you can express peak-calling, variant annotation, coverage analysis, and regulatory overlap queries concisely and efficiently.

Chapter 11: Pipelines

Bioinformatics workflows are rarely a single step. A typical analysis chains together quality control, alignment, variant calling, filtering, and annotation into a directed acyclic graph of stages. BioLang provides first-class pipeline blocks that make these multi-stage workflows explicit, reproducible, and parallel-aware.

Pipeline Blocks

A pipeline block declares a named workflow composed of stage blocks. Stages execute sequentially by default, and each stage can reference the output of the previous stage.

pipeline fastq_qc {
  stage raw_stats {
    read_fastq("sample_R1.fastq.gz")
      |> fastq_stats()
  }

  stage filter {
    read_fastq("sample_R1.fastq.gz")
      |> filter(|r| mean(r.quality) >= 30)
      |> write_fastq("sample_R1.filtered.fastq.gz")
  }

  stage filtered_stats {
    read_fastq("sample_R1.filtered.fastq.gz")
      |> fastq_stats()
  }
}

When you invoke fastq_qc, BioLang runs raw_stats, then filter, then filtered_stats in order. The return value of the pipeline is the result of the last stage.

Passing Data Between Stages

Stages can capture the result of a previous stage by name. The stage name becomes a binding available to all subsequent stages.

pipeline align_and_sort {
  stage alignment {
    let ref = "GRCh38.fa"
    let r1 = "tumor_R1.fastq.gz"
    let r2 = "tumor_R2.fastq.gz"
    bwa_mem(ref, r1, r2, threads: 16)
  }

  stage sorted {
    # alignment is the output path from the previous stage
    samtools_sort(alignment, threads: 8)
  }

  stage indexed {
    samtools_index(sorted)
    sorted  # return the sorted BAM path
  }
}

Each stage name resolves to whatever its block evaluated to. This creates an implicit dependency chain without manual wiring.

Parallel Execution

When you need to process multiple samples, parallel for fans out work across available cores. BioLang schedules iterations concurrently, respecting system resources.

let samples = [
  {id: "tumor_01", r1: "t01_R1.fq.gz", r2: "t01_R2.fq.gz"},
  {id: "tumor_02", r1: "t02_R1.fq.gz", r2: "t02_R2.fq.gz"},
  {id: "tumor_03", r1: "t03_R1.fq.gz", r2: "t03_R2.fq.gz"},
  {id: "normal_01", r1: "n01_R1.fq.gz", r2: "n01_R2.fq.gz"},
]

parallel for sample in samples {
  let bam = bwa_mem("GRCh38.fa", sample.r1, sample.r2, threads: 4)
  let sorted = samtools_sort(bam, threads: 4)
  samtools_index(sorted)
  {id: sample.id, bam: sorted}
}

The result is a list of records, one per sample, collected in the order the iterations complete.

Stage Dependencies and DAG Execution

BioLang infers a dependency graph from which stage names appear in each block. Independent stages run concurrently when the runtime detects no data dependencies.

pipeline variant_qc {
  stage aligned {
    bwa_mem("GRCh38.fa", "sample_R1.fq.gz", "sample_R2.fq.gz")
      |> samtools_sort(threads: 8)
  }

  # These two stages depend only on aligned, not on each other.
  # BioLang runs them concurrently.
  stage depth {
    samtools_depth(aligned) |> mean()
  }

  stage flagstat {
    samtools_flagstat(aligned)
  }

  # This stage depends on both depth and flagstat
  stage report {
    {
      mean_depth: depth,
      mapping_rate: flagstat.mapped_pct,
      sample: "sample_01",
    }
  }
}

The runtime builds a DAG: aligned has no dependencies, depth and flagstat both depend on aligned (and thus run in parallel once it finishes), and report depends on both.

Defer for Cleanup

Intermediate files pile up fast. defer registers a cleanup action that runs when the pipeline completes, whether it succeeds or fails. Multiple defer blocks execute in reverse order (LIFO).

pipeline trimmed_alignment {
  stage trim {
    let trimmed_r1 = "tmp/trimmed_R1.fq.gz"
    let trimmed_r2 = "tmp/trimmed_R2.fq.gz"
    trim_adapters("sample_R1.fq.gz", "sample_R2.fq.gz",
                  out_r1: trimmed_r1, out_r2: trimmed_r2,
                  adapter: "AGATCGGAAGAG")
    defer { remove(trimmed_r1); remove(trimmed_r2) }
    {r1: trimmed_r1, r2: trimmed_r2}
  }

  stage align {
    let bam = bwa_mem("GRCh38.fa", trim.r1, trim.r2, threads: 16)
    let unsorted = bam
    let sorted = samtools_sort(bam, threads: 8)
    defer { remove(unsorted) }
    sorted
  }

  stage mark_dups {
    let deduped = gatk_mark_duplicates(align)
    let metrics = align + ".dup_metrics"
    defer { remove(metrics) }
    deduped
  }
}

When trimmed_alignment finishes, the deferred blocks fire: the duplicate metrics file is removed first, then the unsorted BAM, then the trimmed FASTQs.

Example: RNA-seq Processing Pipeline

A complete RNA-seq workflow from raw reads to normalized expression counts.

pipeline rnaseq {
  stage qc {
    let report = fastqc("rnaseq_R1.fq.gz", "rnaseq_R2.fq.gz",
                         out_dir: "qc/")
    let stats = read_fastq("rnaseq_R1.fq.gz") |> fastq_stats()
    defer { log("QC complete: " + str(stats.total_reads) + " reads") }
    {report: report, stats: stats}
  }

  stage trim {
    let trimmed = trim_adapters(
      "rnaseq_R1.fq.gz", "rnaseq_R2.fq.gz",
      out_r1: "trimmed_R1.fq.gz",
      out_r2: "trimmed_R2.fq.gz",
      quality: 20,
      min_length: 36
    )
    defer { remove("trimmed_R1.fq.gz"); remove("trimmed_R2.fq.gz") }
    trimmed
  }

  stage align {
    star_align(
      index: "star_index/GRCh38",
      r1: "trimmed_R1.fq.gz",
      r2: "trimmed_R2.fq.gz",
      out_prefix: "aligned/rnaseq_",
      threads: 16,
      two_pass: true
    )
  }

  stage quantify {
    featurecounts(
      bam: align,
      annotation: "gencode.v44.gtf",
      paired: true,
      strand: "reverse",
      threads: 8
    )
  }

  stage normalize {
    let raw_counts = read_tsv(quantify)
    let size_factors = raw_counts
      |> columns(exclude: ["gene_id", "gene_name"])
      |> each(|col| median(col))
      |> |medians| median(medians) / medians

    raw_counts
      |> map_columns(exclude: ["gene_id", "gene_name"],
                     |col, i| col * size_factors[i])
      |> write_tsv("normalized_counts.tsv")
  }
}

Example: Variant Calling Pipeline

Germline variant calling from FASTQ to annotated VCF.

pipeline germline_variants {
  stage align {
    let ref = "GRCh38.fa"
    bwa_mem(ref, "sample_R1.fq.gz", "sample_R2.fq.gz",
            threads: 16, read_group: "@RG\\tID:S1\\tSM:sample")
      |> samtools_sort(threads: 8)
      |> samtools_index()
  }

  stage mark_dups {
    gatk_mark_duplicates(align)
  }

  stage bqsr {
    let known_sites = [
      "dbsnp_156.vcf.gz",
      "Mills_and_1000G.indels.vcf.gz",
    ]
    gatk_base_recalibrator(mark_dups, ref: "GRCh38.fa",
                           known_sites: known_sites)
      |> gatk_apply_bqsr(mark_dups, ref: "GRCh38.fa")
  }

  stage call {
    gatk_haplotype_caller(bqsr, ref: "GRCh38.fa",
                          emit_ref_confidence: "GVCF",
                          intervals: "exome_targets.bed")
  }

  stage genotype {
    gatk_genotype_gvcfs(call, ref: "GRCh38.fa")
  }

  stage filter {
    let snps = gatk_select_variants(genotype, type: "SNP")
      |> gatk_filter(filters: [
           {name: "QD", expr: "QD < 2.0"},
           {name: "FS", expr: "FS > 60.0"},
           {name: "MQ", expr: "MQ < 40.0"},
         ])

    let indels = gatk_select_variants(genotype, type: "INDEL")
      |> gatk_filter(filters: [
           {name: "QD", expr: "QD < 2.0"},
           {name: "FS", expr: "FS > 200.0"},
         ])

    bcftools_concat(snps, indels) |> bcftools_sort()
  }

  stage annotate {
    ensembl_vep(filter,
                species: "human",
                assembly: "GRCh38",
                cache_dir: "vep_cache/",
                plugins: ["CADD", "SpliceAI"])
      |> write("sample.annotated.vcf.gz")
  }
}

Example: Multi-Sample Parallel Processing with Merge

Processing a cohort in parallel, then merging results in a final stage.

let cohort = read_tsv("sample_manifest.tsv")
  |> map(|row| {
       id: row.sample_id,
       r1: row.fastq_r1,
       r2: row.fastq_r2,
       type: row.sample_type,
     })

pipeline cohort_analysis {
  stage per_sample {
    parallel for sample in cohort {
      let bam = bwa_mem("GRCh38.fa", sample.r1, sample.r2,
                        threads: 4,
                        read_group: "@RG\\tID:" + sample.id + "\\tSM:" + sample.id)
        |> samtools_sort(threads: 4)

      samtools_index(bam)

      let gvcf = gatk_haplotype_caller(bam, ref: "GRCh38.fa",
                                        emit_ref_confidence: "GVCF")

      let stats = samtools_flagstat(bam)
      let depth = samtools_depth(bam) |> mean()

      {
        id: sample.id,
        bam: bam,
        gvcf: gvcf,
        mapped_pct: stats.mapped_pct,
        mean_depth: depth,
      }
    }
  }

  stage qc_gate {
    # Fail the pipeline if any sample has poor mapping
    let failed = per_sample
      |> filter(|s| s.mapped_pct < 90.0 or s.mean_depth < 20.0)

    if length(failed) > 0 then
      error("QC failed for: " + join(map(failed, |s| s.id), ", "))

    per_sample
  }

  stage joint_genotype {
    let gvcfs = qc_gate |> map(|s| s.gvcf)
    gatk_combine_gvcfs(gvcfs, ref: "GRCh38.fa")
      |> gatk_genotype_gvcfs(ref: "GRCh38.fa")
  }

  stage filter_and_annotate {
    joint_genotype
      |> gatk_vqsr(ref: "GRCh38.fa",
                   resources: ["hapmap.vcf.gz", "1000G_omni.vcf.gz"])
      |> ensembl_vep(species: "human", assembly: "GRCh38")
      |> write("cohort.annotated.vcf.gz")
  }

  stage summary {
    let variant_counts = read_vcf("cohort.annotated.vcf.gz")
      |> group_by(|v| v.info.consequence)
      |> map_values(|vs| length(vs))

    let sample_stats = qc_gate
      |> map(|s| {id: s.id, depth: s.mean_depth, mapped: s.mapped_pct})

    {
      n_samples: length(cohort),
      variant_counts: variant_counts,
      sample_stats: sample_stats,
    }
      |> write_json("cohort_summary.json")
  }
}

Parameterized Pipelines (Templates)

A pipeline can accept parameters, turning it into a reusable template. When you declare pipeline name(param1, param2) { ... }, BioLang defines a callable function instead of executing immediately. You invoke it like any other function.

# Define a reusable alignment template
pipeline align_sample(sample_id, r1, r2, reference) {
  stage aligned {
    shell("bwa-mem2 mem -t 16 " + reference + " " + r1 + " " + r2
          + " | samtools sort -@ 8 -o " + sample_id + ".sorted.bam")
    sample_id + ".sorted.bam"
  }

  stage indexed {
    shell("samtools index " + aligned)
    aligned
  }
}

# Call with different inputs — same pipeline, different data
let tumor = align_sample("tumor", "tumor_R1.fq.gz", "tumor_R2.fq.gz", "GRCh38.fa")
let normal = align_sample("normal", "normal_R1.fq.gz", "normal_R2.fq.gz", "GRCh38.fa")
print("Tumor BAM: " + tumor)
print("Normal BAM: " + normal)

You can loop over a sample list to process a whole cohort with the same template:

let samples = [
  {id: "S1", r1: "S1_R1.fq.gz", r2: "S1_R2.fq.gz"},
  {id: "S2", r1: "S2_R1.fq.gz", r2: "S2_R2.fq.gz"},
  {id: "S3", r1: "S3_R1.fq.gz", r2: "S3_R2.fq.gz"},
]

let bams = samples |> map(|s| align_sample(s.id, s.r1, s.r2, "GRCh38.fa"))
print("Aligned " + str(len(bams)) + " samples")

How it works:

  • pipeline name { ... } (no params) — executes immediately, result bound to name
  • pipeline name(params) { ... } — defines a callable function; nothing runs until you call it
  • The parameter list uses the same syntax as function parameters

Pipeline Composition

Pipelines are values. You can call one pipeline from within another or store its result for downstream use.

pipeline align_one(sample) {
  stage bam {
    bwa_mem("GRCh38.fa", sample.r1, sample.r2, threads: 8)
      |> samtools_sort(threads: 4)
  }
  stage index {
    samtools_index(bam)
    bam
  }
}

pipeline somatic_pair(tumor, normal) {
  stage tumor_bam { align_one(tumor) }
  stage normal_bam { align_one(normal) }

  stage call {
    mutect2(tumor: tumor_bam, normal: normal_bam,
            ref: "GRCh38.fa",
            pon: "pon.vcf.gz",
            germline: "af-only-gnomad.vcf.gz")
  }

  stage filter {
    gatk_filter_mutect(call, ref: "GRCh38.fa")
  }
}

# Run the composed pipeline
let result = somatic_pair(
  {r1: "tumor_R1.fq.gz", r2: "tumor_R2.fq.gz"},
  {r1: "normal_R1.fq.gz", r2: "normal_R2.fq.gz"}
)
print("Somatic variants written to: " + result)

Summary

Pipeline blocks give you named, composable workflows with automatic dependency tracking. Stages pass data forward by name, parallel for fans out across samples, and defer keeps your working directory clean. The runtime infers a DAG from stage references, running independent stages concurrently while respecting data dependencies.

nf-core Integration

Why nf-core in a DSL

nf-core is the largest curated collection of bioinformatics pipelines, maintained by a global community and covering over 100 workflows – from bulk RNA-seq to rare-disease variant calling. Every pipeline follows strict standards: peer review, continuous integration, container packaging, and a machine-readable parameter schema.

BioLang exposes nf-core as a set of built-in functions. There is nothing to import and nothing to install. You can browse, search, inspect, and compare nf-core pipelines from the same scripts that process your FASTQ files and call variants. The five builtins – nfcore_list, nfcore_search, nfcore_info, nfcore_releases, and nfcore_params – turn the nf-core catalog into a queryable data source that fits naturally into pipes, filters, and record operations.

Browsing the Catalog

nfcore_list() returns every pipeline in the nf-core organization. Each entry is a record with fields like name, description, stars, and topics.

# List all nf-core pipelines
nfcore_list()

The result is a list of records. You can pipe it through any BioLang operation:

# Show names and star counts, sorted by popularity
nfcore_list(sort_by: "stars")
  |> each |p| { name: p.name, stars: p.stars }

To limit the result set, pass limit. This is useful in exploratory sessions where you want a quick overview rather than the full catalog:

# Top 10 pipelines by most recent release date
nfcore_list(sort_by: "release", limit: 10)
  |> each |p| { name: p.name, updated: p.updated }

Sorting accepts three values:

  • "stars" – community popularity (default)
  • "name" – alphabetical
  • "release" – most recently updated first

Searching Pipelines

When you know what kind of analysis you need but not which pipeline implements it, use nfcore_search. It matches against pipeline names, descriptions, and topic tags.

# Find RNA-seq related pipelines
nfcore_search("rnaseq")

Search accepts an optional limit to cap results:

# Find variant-calling pipelines, show top 5
nfcore_search("variant", 5)

The returned records have the same shape as nfcore_list entries, so you can chain the same downstream operations:

# Search for methylation pipelines and extract their topics
nfcore_search("methylation")
  |> each |p| { name: p.name, topics: p.topics }

Pipeline Details

nfcore_info returns a single record with comprehensive metadata for a named pipeline. The returned record contains:

FieldTypeDescription
nameStringShort pipeline name
full_nameStringGitHub-qualified name (nf-core/…)
descriptionStringOne-line summary
starsIntGitHub star count
urlStringRepository URL
licenseStringLicense identifier (usually MIT)
topicsListGitHub topic tags
open_issuesIntCurrent open issue count
createdStringRepository creation date
updatedStringLast push date
# Inspect the Sarek germline/somatic variant calling pipeline
let sarek = nfcore_info("sarek")
print("Pipeline: " + sarek.full_name)
print("License:  " + sarek.license)
print("Stars:    " + to_string(sarek.stars))
print("Topics:   " + join(sarek.topics, ", "))

You can use the metadata to make decisions in scripts. For example, checking whether a pipeline is actively maintained before committing to it:

# Only proceed if the pipeline has been updated recently
let info = nfcore_info("taxprofiler")
if info.open_issues < 50 then
  print(info.name + " looks actively maintained")

Releases and Versions

Production workflows should pin to a specific release rather than tracking the development branch. nfcore_releases returns the full release history as a list of records, each with tag and published_at fields.

# List all rnaseq releases
nfcore_releases("rnaseq")

The list is ordered newest-first, so the head element is the latest stable version:

# Get the latest stable release tag for rnaseq
let latest = nfcore_releases("rnaseq")
  |> first
print("Latest release: " + latest.tag + " (" + latest.published_at + ")")

You can also search for a specific version range or find how many releases a pipeline has had – a rough proxy for maturity:

# Count total releases for sarek
let release_count = nfcore_releases("sarek") |> length
print("sarek has " + to_string(release_count) + " releases")
# Find all 3.x releases of rnaseq
nfcore_releases("rnaseq")
  |> filter |r| starts_with(r.tag, "3.")
  |> each |r| r.tag

Parameter Schemas

Every nf-core pipeline publishes a JSON schema describing its configurable parameters – input paths, reference genomes, trimming options, resource limits, and more. nfcore_params fetches and parses this schema, returning a record whose keys are parameter group names and whose values are records of individual parameters.

# Fetch the full parameter schema for rnaseq
let params = nfcore_params("rnaseq")

The top-level keys are group names such as "input_output_options", "reference_genome_options", or "trimming_options". Each group is a record of parameter entries:

# List all parameter groups
nfcore_params("rnaseq")
  |> keys
# Inspect the reference genome options
nfcore_params("rnaseq").reference_genome_options
  |> keys

This is particularly useful for validating that your configuration covers the required parameters before submitting a long-running pipeline:

# Check whether a specific parameter exists
let params = nfcore_params("sarek")
let genome_opts = params.reference_genome_options
print(keys(genome_opts))

Example: Finding the Right Pipeline

A common task: you need to process single-cell RNA-seq data but are unsure which nf-core pipeline to use. Search the catalog, compare candidates, and pick the best fit.

# Step 1: Search for single-cell pipelines
let candidates = nfcore_search("single-cell")

# Step 2: Show names, stars, and descriptions side by side
candidates
  |> each |p| {
    name: p.name,
    stars: p.stars,
    description: p.description
  }
  |> sort_by |a, b| b.stars - a.stars
  |>> each |p| print(p.name + " (" + to_string(p.stars) + " stars): " + p.description)

# Step 3: Get detailed info on the top candidate
let top = candidates
  |> sort_by |a, b| b.stars - a.stars
  |> first

let info = nfcore_info(top.name)
print("Selected: " + info.full_name)
print("License:  " + info.license)
print("Topics:   " + join(info.topics, ", "))

# Step 4: Check the latest release
let latest = nfcore_releases(top.name) |> first
print("Latest release: " + latest.tag + " published " + latest.published_at)

# Step 5: Preview the parameter groups
nfcore_params(top.name)
  |> keys
  |>> each |group| print("  - " + group)

This entire exploration runs in a single script – no switching between a browser, the nf-core CLI tool, and your pipeline configuration files.

Example: Auditing Pipeline Parameters

Before launching a whole-genome sequencing analysis with Sarek, you want to confirm that your reference genome configuration covers every required parameter and that no deprecated options have crept into your config.

# Fetch the Sarek parameter schema
let params = nfcore_params("sarek")

# Extract all reference genome parameters
let ref_params = params.reference_genome_options

# Your current configuration
let my_config = {
  genome: "GRCh38",
  igenomes_base: "s3://ngi-igenomes/igenomes",
  fasta: nil,
  fasta_fai: nil
}

# Check for parameters in the schema that you have not configured
let schema_keys = keys(ref_params)
let config_keys = keys(my_config)

let missing = schema_keys
  |> filter |k| not(contains(config_keys, k))

if length(missing) > 0 then
  print("WARNING: unset reference genome parameters:")
  missing |>> each |k| print("  - " + k)
else
  print("All reference genome parameters are covered.")

# Check for keys in your config that are not in the schema (possibly deprecated)
let extra = config_keys
  |> filter |k| not(contains(schema_keys, k))

if length(extra) > 0 then
  print("WARNING: config keys not in current schema (possibly deprecated):")
  extra |>> each |k| print("  - " + k)

This catches configuration drift early – before you discover it four hours into a whole-genome analysis run.

Example: Building a Pipeline Registry

For a core facility managing dozens of active projects, it helps to maintain a catalog of available pipelines organized by topic. This script builds a topic index from the full nf-core catalog.

# Fetch all pipelines
let all_pipelines = nfcore_list(sort_by: "stars")

# Build a topic-to-pipeline mapping
# Flatten: for each pipeline, emit one record per topic tag
let entries = all_pipelines
  |> flat_map |p| p.topics |> each |t| { topic: t, name: p.name, stars: p.stars }

# Group by topic
let topics = entries
  |> group_by |e| e.topic

# Print a summary: topics with 3 or more pipelines
keys(topics)
  |> filter |t| length(topics[t]) >= 3
  |> sort
  |>> each |t| {
    let pipelines = topics[t]
      |> sort_by |a, b| b.stars - a.stars
      |> each |p| p.name
    print(t + " (" + to_string(length(pipelines)) + " pipelines): " + join(pipelines, ", "))
  }

You can extend this to generate a Markdown report, push to a shared wiki, or feed into a lab notebook:

# Export the top 20 pipelines as a tab-separated table
let header = "name\tstars\ttopics\tlatest_release"
print(header)

nfcore_list(sort_by: "stars", limit: 20)
  |>> each |p| {
    let latest = nfcore_releases(p.name) |> first
    let tag = if latest != nil then latest.tag else "unreleased"
    let topic_str = join(p.topics, ";")
    print(p.name + "\t" + to_string(p.stars) + "\t" + topic_str + "\t" + tag)
  }

Redirect stdout to a file and you have a pipeline inventory that updates every time you run the script – no manual curation needed.

Parsing Nextflow Files

BioLang can parse Nextflow .nf files directly with nf_parse(). This function reads the file and extracts its structure without requiring a Nextflow runtime.

# Parse a Nextflow pipeline file
let parsed = nf_parse("main.nf")

The result is a record with five fields:

FieldTypeContent
paramsRecordParameter name-value pairs
processesList[Record]Process name, inputs, outputs, script, container, cpus, memory
includesList[Record]Include name, alias, source path
workflowList[String]Workflow block lines
dslStringDSL version (“DSL1” or “DSL2”)

You can pipe the parsed structure through any BioLang operation:

# List all processes and their containers
let parsed = nf_parse("variant_calling.nf")

parsed.processes
  |> each |p| {
    print(p.name + ": " + if p.container != nil then p.container else "no container")
  }
# Extract all parameters with their default values
let parsed = nf_parse("rnaseq.nf")

keys(parsed.params)
  |> sort
  |>> each |k| print(k + " = " + to_string(parsed.params[k]))

Generating BioLang Code

The nf_to_bl function takes a parsed Nextflow record and generates BioLang pipeline code as a string:

# Parse and convert a Nextflow pipeline
let parsed = nf_parse("rnaseq.nf")
let bl_code = nf_to_bl(parsed)

# Print the generated code
print(bl_code)

# Save to a file
write_file("rnaseq.bl", bl_code)

The generated code maps Nextflow constructs to BioLang equivalents:

  • params.* become top-level variables
  • process blocks become pipeline { stage ... } blocks
  • Container, CPU, and memory directives are preserved as stage annotations
  • Workflow call ordering is preserved as comments

This is a starting point – edit the generated skeleton to add BioLang-specific features like pipe chains, error handling, and parallel execution.

# Full workflow: parse, generate, and review
let parsed = nf_parse("sarek_main.nf")

# Show what was extracted
print("Found " + to_string(length(parsed.processes)) + " processes")
print("Found " + to_string(length(keys(parsed.params))) + " parameters")

# Generate BioLang code
let bl_code = nf_to_bl(parsed)
write_file("sarek.bl", bl_code)
print("Generated sarek.bl")

Chapter 13: BioContainers Integration

Reproducible bioinformatics demands pinned software versions. A variant calling pipeline that works today must produce identical results next year, on a different machine, with the same tool versions. The BioContainers project addresses this by packaging over 9,000 bioinformatics tools as container images hosted on registries like quay.io and Docker Hub.

BioLang gives you native access to the BioContainers registry. Four built-in functions let you search for tools, discover popular packages, inspect version histories, and retrieve exact container image URIs – all without leaving your script. No imports are needed.

Searching for Tools

biocontainers_search queries the BioContainers registry by name or keyword. It returns a list of records, each describing a matching tool.

# Find all samtools-related containers
let results = biocontainers_search("samtools")
# results => [
#   {name: "samtools", description: "Tools for manipulating NGS alignments...",
#    organization: "biocontainers", version_count: 42,
#    latest_version: "1.19--h50ea8bc_0",
#    latest_image: "quay.io/biocontainers/samtools:1.19--h50ea8bc_0"},
#   {name: "htslib", description: "C library for high-throughput sequencing...", ...},
#   ...
# ]

results |> each(|tool| {
  print(tool.name + " (" + str(tool.version_count) + " versions)")
  print("  Latest: " + tool.latest_version)
})

The default limit is 25 results. Pass a second argument to narrow or widen the search.

# Top 5 matches for short-read aligners
let aligners = biocontainers_search("bwa", 5)

aligners |> each(|a| print(a.name + " => " + a.latest_image))

Search terms are matched against tool names and descriptions, so broader queries work too.

# Find tools related to RNA-seq quantification
let quant_tools = biocontainers_search("salmon rna-seq")

quant_tools
  |> filter(|t| t.version_count > 5)
  |> each(|t| print(t.name + ": " + t.description))

biocontainers_popular returns the most-pulled tools in the registry. This is useful for discovering which tools the community relies on and for auditing whether your pipeline uses well-maintained software.

let top20 = biocontainers_popular()

print("Top 20 BioContainers tools:")
top20 |> each(|t| print("  " + t.name + " - " + t.latest_version))

Pass a limit to retrieve more.

# Top 50 most popular bioinformatics containers
let top50 = biocontainers_popular(50)

# Which of our pipeline tools are in the top 50?
let our_tools = ["samtools", "bcftools", "bwa-mem2", "gatk4", "picard"]
let popular_names = top50 |> map(|t| t.name)

our_tools |> each(|tool| {
  let found = popular_names |> find(|n| n == tool)
  if found != None then
    print(tool + " is in the top 50")
  else
    print(tool + " is NOT in the top 50")
})

Tool Details

biocontainers_info returns a detailed record for a single tool, including its full version history with per-version container images.

let info = biocontainers_info("samtools")
# info => {
#   name: "samtools",
#   description: "Tools for manipulating NGS alignments...",
#   organization: "biocontainers",
#   aliases: ["samtools"],
#   versions: [
#     {version: "1.19--h50ea8bc_0",
#      images: [{registry: "quay.io", image: "quay.io/biocontainers/samtools:1.19--h50ea8bc_0",
#                type: "Docker", size: 14250000}]},
#     {version: "1.18--h50ea8bc_0", images: [...]},
#     ...
#   ]
# }

print(info.name + " by " + info.organization)
print(str(len(info.versions)) + " versions available")

# List the 5 most recent versions
info.versions
  |> take(5)
  |> each(|v| {
    let image = v.images |> first()
    print("  " + v.version + " (" + image.registry + ", "
          + str(image.size / 1000000) + " MB)")
  })

The images list for each version may contain entries from multiple registries or image types (Docker, Singularity). Filter by registry or type if your infrastructure requires a specific format.

# Find Singularity images for deepvariant
let dv = biocontainers_info("deepvariant")

dv.versions |> each(|v| {
  let singularity = v.images |> filter(|img| img.type == "Singularity")
  if len(singularity) > 0 then
    print(v.version + " has Singularity image")
})

Version Management

biocontainers_versions returns a flat list of all versions for a tool, each with a list of full image URI strings. This is the function to use when you need to pin a specific version in a pipeline manifest.

let versions = biocontainers_versions("gatk4")
# versions => [
#   {version: "4.5.0.0--py310hdfd78af_0",
#    images: ["quay.io/biocontainers/gatk4:4.5.0.0--py310hdfd78af_0"]},
#   {version: "4.4.0.0--py310hdfd78af_0",
#    images: ["quay.io/biocontainers/gatk4:4.4.0.0--py310hdfd78af_0"]},
#   ...
# ]

# Find the latest GATK 4.4.x release
let gatk44 = versions
  |> filter(|v| starts_with(v.version, "4.4"))
  |> first()

print("Pinning GATK to: " + gatk44.images[0])

You can use this to check whether a specific version exists before committing to it in a pipeline definition.

# Verify that bcftools 1.18 is available
let bc_versions = biocontainers_versions("bcftools")
let target = bc_versions |> find(|v| starts_with(v.version, "1.18"))

if target != None then
  print("bcftools 1.18 available: " + target.images[0])
else
  print("bcftools 1.18 not found in BioContainers")

Example: Building a Reproducible Tool Manifest

A variant calling pipeline needs exact container images for every tool. Use the BioContainers builtins to resolve each tool to a pinned image URI and export the manifest.

# Tools required for a germline variant calling pipeline
let required = [
  {name: "bwa-mem2",  min_version: "2.2"},
  {name: "samtools",  min_version: "1.18"},
  {name: "gatk4",     min_version: "4.4"},
  {name: "bcftools",  min_version: "1.18"},
]

let manifest = required |> map(|req| {
  let versions = biocontainers_versions(req.name)

  # Find the newest version that satisfies the minimum
  let matching = versions
    |> filter(|v| starts_with(v.version, req.min_version))

  if len(matching) == 0 then {
    print("WARNING: no " + req.name + " >= " + req.min_version + " found")
    {tool: req.name, version: "MISSING", image: "MISSING"}
  } else {
    let chosen = matching |> first()
    {tool: req.name, version: chosen.version, image: chosen.images[0]}
  }
})

# Print the resolved manifest
print("Variant Calling Pipeline - Tool Manifest")
print("=========================================")
manifest |> each(|m| {
  print(m.tool + ":")
  print("  version: " + m.version)
  print("  image:   " + m.image)
})

# Export as structured data
manifest |> write_json("pipeline_manifest.json")

This script produces a lockfile-style manifest that can be checked into version control alongside the pipeline definition.

Example: Tool Discovery for a New Analysis

When starting a new analysis type, you need to survey what tools are available. Here we explore the methylation analysis landscape.

# What methylation tools exist in BioContainers?
let methyl_tools = biocontainers_search("methylation", 50)

print(str(len(methyl_tools)) + " methylation-related tools found")
print("")

# Group by version count to find well-maintained tools
let mature = methyl_tools
  |> filter(|t| t.version_count >= 5)
  |> sort_by(|t| -t.version_count)

let new_tools = methyl_tools
  |> filter(|t| t.version_count < 3)

print("Mature tools (" + str(len(mature)) + "):")
mature |> each(|t| {
  print("  " + t.name + " - " + str(t.version_count) + " versions"
        + " (latest: " + t.latest_version + ")")
  print("    " + t.description)
})

print("")
print("Newer tools (" + str(len(new_tools)) + "):")
new_tools |> take(10) |> each(|t| {
  print("  " + t.name + " - " + t.latest_version)
})

# Deep dive into the top candidate
let bismark = biocontainers_info("bismark")
print("")
print("Bismark detail:")
print("  " + bismark.description)
print("  " + str(len(bismark.versions)) + " releases")
print("  Aliases: " + join(bismark.aliases, ", "))

# Check image sizes across versions
bismark.versions |> take(5) |> each(|v| {
  let docker = v.images |> filter(|img| img.type == "Docker") |> first()
  if docker != None then
    print("  " + v.version + ": " + str(docker.size / 1000000) + " MB")
})

Example: Container Image Audit

For an existing pipeline, verify that every tool has a valid BioContainers image and flag any that are outdated.

# Current pipeline tools and their pinned versions
let pinned = [
  {tool: "bwa-mem2",  version: "2.2.1--hd03093a_2"},
  {tool: "samtools",  version: "1.17--h50ea8bc_0"},
  {tool: "gatk4",     version: "4.3.0.0--py310hdfd78af_0"},
  {tool: "bcftools",  version: "1.17--h3cc50cf_1"},
  {tool: "multiqc",   version: "1.14--pyhdfd78af_0"},
  {tool: "fastp",     version: "0.23.2--hb7a2d85_2"},
]

let audit = pinned |> map(|entry| {
  let info = biocontainers_info(entry.tool)
  let all_versions = info.versions |> map(|v| v.version)

  # Check if pinned version still exists
  let exists = all_versions |> find(|v| v == entry.version) != None

  # Check if there is a newer version
  let latest = info.versions |> first()
  let is_latest = latest.version == entry.version

  # Count how many versions behind
  let versions_behind = if is_latest then
    0
  else {
    let idx = all_versions
      |> enumerate()
      |> find(|pair| pair.value == entry.version)
    if idx != None then idx.index else -1
  }

  {
    tool: entry.tool,
    pinned: entry.version,
    latest: latest.version,
    exists: exists,
    is_latest: is_latest,
    versions_behind: versions_behind,
  }
})

# Report
print("Pipeline Container Audit")
print("========================")

let missing = audit |> filter(|a| not a.exists)
let outdated = audit |> filter(|a| a.exists and not a.is_latest)
let current = audit |> filter(|a| a.is_latest)

if len(missing) > 0 then {
  print("")
  print("MISSING (pinned version no longer in registry):")
  missing |> each(|a| print("  " + a.tool + " " + a.pinned
                             + " => latest: " + a.latest))
}

if len(outdated) > 0 then {
  print("")
  print("OUTDATED:")
  outdated |> each(|a| print("  " + a.tool + " " + a.pinned
                              + " => " + a.latest
                              + " (" + str(a.versions_behind) + " versions behind)"))
}

if len(current) > 0 then {
  print("")
  print("CURRENT:")
  current |> each(|a| print("  " + a.tool + " " + a.pinned))
}

print("")
print(str(len(current)) + " current, "
      + str(len(outdated)) + " outdated, "
      + str(len(missing)) + " missing")

audit |> write_json("container_audit.json")

Summary

BioLang’s four BioContainers builtins – biocontainers_search, biocontainers_popular, biocontainers_info, and biocontainers_versions – bring the full BioContainers registry into your scripts as native data. Use them to discover tools, pin container images for reproducibility, audit existing pipelines, and explore new analysis domains. Combined with BioLang’s pipes and collection operations, a few lines of code replace manual registry browsing and ad hoc version tracking.

Galaxy ToolShed & Workflow Parsing

The Galaxy ToolShed is the app store for the Galaxy bioinformatics platform. It hosts thousands of community-contributed tools covering everything from read alignment and variant calling to phylogenetics and metabolomics. Each repository in the ToolShed carries metadata – owner, description, download counts, and revision history – making it a rich discovery resource even if you never run Galaxy itself.

BioLang provides read-only access to the Galaxy ToolShed through four built-in functions. There is nothing to import and nothing to install. You can search repositories, browse popular tools, list categories, and inspect individual tools from the same scripts that process your sequence data. A fifth builtin, galaxy_to_bl, takes a Galaxy workflow record and generates BioLang pipeline code from it.

Searching Repositories

galaxy_search queries the ToolShed by name or keyword. It returns a list of records, each describing a matching repository.

# Find BWA-related tools in the Galaxy ToolShed
galaxy_search("bwa")
  |> each(|t| print(t.name + " by " + t.owner + " (" + str(t.downloads) + " downloads)"))

Each record in the result contains:

FieldTypeDescription
nameStringRepository name
ownerStringToolShed username of the maintainer
descriptionStringShort summary
downloadsIntTotal download count
urlStringToolShed repository URL

The default result set includes all matches. Pass a second argument to cap the number of results, which is useful in exploratory sessions where you want a quick overview.

# Top 5 matches for short-read aligners
galaxy_search("bwa", 5)
  |> each(|t| print(t.name + " (" + t.owner + "): " + t.description))

Search terms are matched against repository names and descriptions, so broader queries work too.

# Find tools related to RNA-seq
galaxy_search("rna-seq", 10)
  |> filter(|t| t.downloads > 1000)
  |> each(|t| print(t.name + ": " + str(t.downloads) + " downloads"))

galaxy_popular returns the most-downloaded repositories in the ToolShed, sorted by download count descending. This is useful for discovering which tools the Galaxy community relies on most heavily.

galaxy_popular(10)
  |> map(|t| { name: t.name, owner: t.owner, downloads: t.downloads })

Without an argument, galaxy_popular() returns a default set. Pass a limit to widen or narrow the list.

# Top 25 Galaxy tools by download count
galaxy_popular(25)
  |> each(|t| print(t.name + " by " + t.owner + " - " + str(t.downloads) + " downloads"))

Categories

The ToolShed organizes repositories into categories such as “Assembly”, “Variant Analysis”, “RNA”, and “Visualization”. galaxy_categories returns them all as a list of records.

galaxy_categories()
  |> each(|c| print(c.name + ": " + c.description))

You can use this to explore what kinds of tools exist before drilling into a specific area.

# Find categories related to sequencing
galaxy_categories()
  |> filter(|c| contains(lower(c.name), "sequenc") or contains(lower(c.description), "sequenc"))
  |> each(|c| print(c.name))

Repository Details

galaxy_tool returns a detailed record for a single repository, identified by owner and name.

let tool = galaxy_tool("devteam", "bwa")
print("Tool: " + tool.name)
print("Owner: " + tool.owner)
print("Downloads: " + str(tool.downloads))
print("Last updated: " + tool.last_updated)
print("Description: " + tool.description)

The returned record includes fields beyond what the search results provide, such as last_updated and the full repository URL. Use this to verify that a tool is actively maintained before building a workflow around it.

# Check maintenance status of a tool before adopting it
let tool = galaxy_tool("iuc", "samtools_sort")
print(tool.name + " last updated: " + tool.last_updated)
print("Downloads: " + str(tool.downloads))

Cross-Registry Discovery

One of BioLang’s strengths is that nf-core, BioContainers, and Galaxy ToolShed are all accessible from the same script. When evaluating a tool, you can check all three registries in a single pass to understand the full landscape – whether curated pipelines exist, whether container images are available, and whether Galaxy wrappers have been written.

# Search for BWA across all three registries
let tool = "bwa"

let nf_results = nfcore_search(tool)
let bc_results = biocontainers_search(tool, 5)
let gx_results = galaxy_search(tool, 5)

print("=== Cross-registry search for: " + tool + " ===")
print("nf-core pipelines mentioning " + tool + ": " + str(len(nf_results)))
print("BioContainers: " + str(len(bc_results)))
print("Galaxy tools: " + str(len(gx_results)))

This pattern scales to any tool or keyword. Here is a more thorough version that prints details from each registry side by side.

# Compare tool availability across registries
let tools = ["samtools", "bcftools", "hisat2", "salmon"]

tools |> each(|name| {
  print("--- " + name + " ---")

  let gx = galaxy_search(name, 3)
  let bc = biocontainers_search(name, 3)

  if len(gx) > 0 then
    print("  Galaxy: " + gx[0].name + " by " + gx[0].owner
          + " (" + str(gx[0].downloads) + " downloads)")
  else
    print("  Galaxy: not found")

  if len(bc) > 0 then
    print("  BioContainers: " + bc[0].name
          + " (latest: " + bc[0].latest_version + ")")
  else
    print("  BioContainers: not found")

  print("")
})

Workflow Code Generation

Galaxy workflows are JSON documents that describe a directed acyclic graph of tool invocations. BioLang can parse these workflows and generate equivalent BioLang pipeline code via the galaxy_to_bl builtin. This is useful when migrating from Galaxy to a script-based workflow, or when you want to use a Galaxy workflow as a starting point for a BioLang pipeline.

galaxy_to_bl takes a record that represents a Galaxy workflow and returns a string of BioLang code. The expected input format mirrors the structure of a Galaxy workflow export.

# Construct a Galaxy workflow record
let workflow = {
  name: "RNA-seq Analysis",
  annotation: "Basic RNA-seq pipeline",
  steps: [
    { name: "FastQC", tool_id: "fastqc/0.74", inputs: ["reads"], outputs: ["report"] },
    { name: "HISAT2", tool_id: "hisat2/2.2.1", inputs: ["reads"], outputs: ["aligned_bam"] },
    { name: "featureCounts", tool_id: "featurecounts/2.0.6", inputs: ["aligned_bam"], outputs: ["counts"] }
  ]
}

# Generate BioLang pipeline code
let bl_code = galaxy_to_bl(workflow)
print(bl_code)

The generated code maps each Galaxy step to a BioLang function call, preserving the data flow between steps. You can write it directly to a file and use it as a starting point for further customization.

# Save the generated code
let bl_code = galaxy_to_bl(workflow)
write_text("rnaseq.bl", bl_code)
print("Pipeline written to rnaseq.bl")

For more complex workflows with branching and multiple outputs, the generated code includes comments marking each step and its connections.

# A branching QC + alignment workflow
let workflow = {
  name: "QC and Align",
  annotation: "Parallel QC with alignment",
  steps: [
    { name: "FastQC", tool_id: "fastqc/0.74", inputs: ["reads"], outputs: ["qc_report"] },
    { name: "Trimmomatic", tool_id: "trimmomatic/0.39", inputs: ["reads"], outputs: ["trimmed"] },
    { name: "BWA-MEM2", tool_id: "bwa_mem2/2.2.1", inputs: ["trimmed"], outputs: ["aligned"] },
    { name: "MultiQC", tool_id: "multiqc/1.14", inputs: ["qc_report"], outputs: ["summary"] }
  ]
}

let bl_code = galaxy_to_bl(workflow)
print(bl_code)

Configuration

The Galaxy ToolShed URL defaults to the main public instance at https://toolshed.g2.bx.psu.edu. You can override it for private or institutional ToolShed installations.

Set the URL in ~/.biolang/apis.yaml:

galaxy:
  toolshed_url: "https://toolshed.example.org"

Or via the environment variable BIOLANG_GALAXY_TOOLSHED_URL:

export BIOLANG_GALAXY_TOOLSHED_URL="https://toolshed.example.org"

The environment variable takes precedence over the config file. If neither is set, the public ToolShed is used.

Example: Building a Tool Inventory

A core facility managing Galaxy and non-Galaxy workflows needs to know which tools are available in the ToolShed and whether container images exist for running them outside Galaxy. This script searches Galaxy for tools matching a keyword, then cross-references each one with BioContainers to check container availability.

# Build a tool inventory for variant analysis
let keyword = "variant"

# Step 1: Search Galaxy for variant-related tools
let gx_tools = galaxy_search(keyword, 20)
print("Found " + str(len(gx_tools)) + " Galaxy tools for: " + keyword)
print("")

# Step 2: For each Galaxy tool, check BioContainers
let inventory = gx_tools |> map(|t| {
  # Search BioContainers using the tool name
  let bc = biocontainers_search(t.name, 1)

  let container_status = if len(bc) > 0 then
    "available (" + bc[0].latest_version + ")"
  else
    "not found"

  {
    name: t.name,
    owner: t.owner,
    galaxy_downloads: t.downloads,
    container: container_status
  }
})

# Step 3: Report
print("Tool Inventory: " + keyword)
print("========================================")
inventory
  |> sort_by(|a| -a.galaxy_downloads)
  |> each(|t| {
    print(t.name + " (" + t.owner + ")")
    print("  Galaxy downloads: " + str(t.galaxy_downloads))
    print("  BioContainers:    " + t.container)
  })

# Step 4: Summary statistics
let with_container = inventory |> filter(|t| not starts_with(t.container, "not"))
let without_container = inventory |> filter(|t| starts_with(t.container, "not"))

print("")
print("Summary:")
print("  Total tools:          " + str(len(inventory)))
print("  With container image: " + str(len(with_container)))
print("  Without container:    " + str(len(without_container)))

You can extend this pattern to cover additional registries or to export the inventory as a structured file.

# Export as JSON for downstream use
inventory |> write_json("variant_tool_inventory.json")

Example: Migrating a Galaxy Workflow

When moving a project from Galaxy to BioLang, you can combine the ToolShed lookup with code generation in a single script. Look up each tool to verify it exists, then generate the BioLang equivalent.

# Define the workflow steps from a Galaxy export
let workflow = {
  name: "Whole Genome Variant Calling",
  annotation: "BWA-MEM2 alignment with GATK HaplotypeCaller",
  steps: [
    { name: "FastQC", tool_id: "fastqc/0.74", inputs: ["reads_r1", "reads_r2"], outputs: ["qc_report"] },
    { name: "BWA-MEM2", tool_id: "bwa_mem2/2.2.1", inputs: ["reads_r1", "reads_r2"], outputs: ["aligned_bam"] },
    { name: "MarkDuplicates", tool_id: "picard_markduplicates/3.1.1", inputs: ["aligned_bam"], outputs: ["dedup_bam"] },
    { name: "HaplotypeCaller", tool_id: "gatk4_haplotypecaller/4.5", inputs: ["dedup_bam"], outputs: ["raw_vcf"] },
    { name: "FilterVcf", tool_id: "gatk4_filtermutectcalls/4.5", inputs: ["raw_vcf"], outputs: ["filtered_vcf"] }
  ]
}

# Verify each tool exists in the ToolShed
print("Verifying tools...")
workflow.steps |> each(|step| {
  # Extract base tool name from tool_id
  let parts = split(step.tool_id, "/")
  let tool_name = parts[0]
  let results = galaxy_search(tool_name, 1)
  if len(results) > 0 then
    print("  " + step.name + ": found (" + results[0].owner + ")")
  else
    print("  " + step.name + ": WARNING - not found in ToolShed")
})

# Generate the BioLang pipeline
print("")
let bl_code = galaxy_to_bl(workflow)
print("Generated pipeline:")
print(bl_code)

# Save to file
write_text("wgs_variant_calling.bl", bl_code)
print("Pipeline saved to wgs_variant_calling.bl")

Summary

BioLang’s Galaxy ToolShed builtins – galaxy_search, galaxy_popular, galaxy_categories, galaxy_tool, and galaxy_to_bl – bring the Galaxy ecosystem into your scripts as native data. Use them to discover tools, evaluate popularity and maintenance status, cross-reference with BioContainers and nf-core, and generate BioLang pipeline code from Galaxy workflows. The ToolShed becomes one more queryable registry alongside the others, all accessible from the same pipes, filters, and record operations you use for everything else.

Chapter 12: Statistics and Linear Algebra

Bioinformatics is quantitative at its core. Whether you are testing for differential expression, running principal component analysis on a transcriptome, or modeling drug kinetics, you need robust numerical tools. BioLang provides built-in statistics functions, matrix operations, and an ODE solver so these analyses live alongside your data processing code.

Descriptive Statistics

All descriptive stats operate on lists of numbers.

let coverage = tsv("sample_depth.tsv") |> col("depth")

let summary = {
  mean: mean(coverage),
  median: median(coverage),
  sd: stdev(coverage),
  var: variance(coverage),
  q25: quantile(coverage, 0.25),
  q75: quantile(coverage, 0.75),
  iqr: quantile(coverage, 0.75) - quantile(coverage, 0.25),
}

print("Coverage: mean=" + str(summary.mean) + " median=" + str(summary.median))

These functions handle missing values gracefully: None entries are skipped with a warning.

Hypothesis Testing

t-test

Compare expression levels between two conditions.

let tumor = tsv("expr.tsv") |> filter(|r| r.group == "tumor") |> col("BRCA1")
let normal = tsv("expr.tsv") |> filter(|r| r.group == "normal") |> col("BRCA1")

let result = ttest(tumor, normal)
# result => {statistic: 4.32, pvalue: 0.00018, df: 28, mean_diff: 2.15}

if result.pvalue < 0.05 then
  print("BRCA1 significantly different (p=" + str(result.pvalue) + ")")

Wilcoxon Rank-Sum

For non-normally distributed data such as read counts.

let treated = [1240, 890, 1560, 2100, 780, 1890, 1345]
let control = [450, 520, 380, 610, 490, 430, 550]

let result = wilcoxon(treated, control)
# result => {statistic: 49.0, pvalue: 0.0012}

Chi-squared Test

Test whether observed genotype frequencies match Hardy-Weinberg expectations.

let observed = [210, 480, 310]  # AA, Aa, aa genotypes
let n = sum(observed)
let p = (2 * observed[0] + observed[1]) / (2 * n)
let q = 1.0 - p
let expected = [p * p * n, 2 * p * q * n, q * q * n]

let result = chi_square(observed, expected)
# result => {statistic: 3.21, pvalue: 0.073, df: 1}

if result.pvalue > 0.05 then
  print("Population is in Hardy-Weinberg equilibrium")

ANOVA

Compare expression across multiple tissue types.

let brain = tsv("expr.tsv") |> filter(|r| r.tissue == "brain") |> col("TP53")
let liver = tsv("expr.tsv") |> filter(|r| r.tissue == "liver") |> col("TP53")
let lung = tsv("expr.tsv") |> filter(|r| r.tissue == "lung") |> col("TP53")
let kidney = tsv("expr.tsv") |> filter(|r| r.tissue == "kidney") |> col("TP53")

let result = anova(brain, liver, lung, kidney)
# result => {f_statistic: 8.74, pvalue: 0.00003, df_between: 3, df_within: 76}

Fisher’s Exact Test

Test enrichment of a mutation in cases versus controls.

# Contingency table: [[mutation+/case, mutation-/case], [mutation+/ctrl, mutation-/ctrl]]
let table = [[45, 155], [12, 188]]
let result = fisher_exact(table)
# result => {odds_ratio: 5.48, pvalue: 0.00001}

Multiple Testing Correction

When testing thousands of genes, you must correct for multiple comparisons. p_adjust supports Bonferroni, Benjamini-Hochberg (BH), and Benjamini-Yekutieli (BY) methods.

let genes = tsv("gene_expression.tsv")
let gene_names = genes |> col("gene_name")
let tumor_cols = colnames(genes) |> filter(|c| starts_with(c, "tumor_"))
let normal_cols = colnames(genes) |> filter(|c| starts_with(c, "normal_"))

let pvalues = range(0, genes.num_rows)
  |> map(|i| {
       let t = tumor_cols |> map(|c| col(genes, c)[i])
       let n = normal_cols |> map(|c| col(genes, c)[i])
       ttest(t, n).pvalue
     })

let adjusted_bh = p_adjust(pvalues, "BH")
let adjusted_bonf = p_adjust(pvalues, "bonferroni")

let significant = range(0, len(adjusted_bh))
  |> filter(|i| adjusted_bh[i] < 0.05)
  |> map(|i| {gene: gene_names[i], p: pvalues[i], q: adjusted_bh[i]})

print(str(len(significant)) + " genes significant at FDR < 0.05")

Correlation

Compute Pearson or Spearman correlation between expression profiles.

let expr = tsv("tpm_matrix.tsv")
let gene_a = expr |> col("EGFR")
let gene_b = expr |> col("ERBB2")

let pearson = cor(gene_a, gene_b)
# pearson => {r: 0.72, pvalue: 0.00001}

let spearman = spearman(gene_a, gene_b)
# spearman => {rho: 0.68, pvalue: 0.00004}

Linear Models

Fit a linear model to predict expression from covariates.

let data = tsv("sample_metadata.tsv")
let expression = data |> col("gene_expr")
let age = data |> col("age")
let batch = data |> col("batch_id")

let model = lm(expression, [age, batch])
# model => {
#   coefficients: [{name: "intercept", estimate: 5.2, se: 0.8, p: 0.001},
#                  {name: "x1", estimate: 0.03, se: 0.01, p: 0.02},
#                  {name: "x2", estimate: -1.1, se: 0.4, p: 0.008}],
#   r_squared: 0.45,
#   adj_r_squared: 0.42,
#   f_statistic: 12.3,
#   pvalue: 0.00002
# }

Matrix Creation and Operations

Create matrices from data and perform standard operations.

# Create a 3x3 count matrix (genes x samples)
let counts = matrix([
  [120, 340, 250],
  [890, 1200, 1050],
  [45, 30, 55]
])

# Transpose: samples x genes
let transposed = transpose(counts)

# Matrix multiplication
let product = mat_mul(transpose(counts), counts)  # 3x3 covariance-like

# Element-wise operations with mat_map
let log_counts = mat_map(counts, |x| log2(x + 1))

# Basic properties
print("Trace: " + str(trace(product)))
print("Rank: " + str(rank(counts)))
print("Frobenius norm: " + str(norm(counts)))

Linear Algebra

Determinant and Inverse

let cov = matrix([
  [1.0, 0.8, 0.3],
  [0.8, 1.0, 0.5],
  [0.3, 0.5, 1.0]
])

let det = determinant(cov)
print("Determinant: " + str(det))  # non-zero => invertible

let precision = inverse(cov)  # precision matrix

Eigenvalues and Eigenvectors

let eigen = eigenvalues(cov)
# eigen => {values: [2.1, 0.7, 0.2], vectors: [[...], [...], [...]]}

# Proportion of variance explained by each component
let total = sum(eigen.values)
let prop = eigen.values |> map(|v| v / total)
print("PC1 explains " + str(prop[0] * 100) + "% of variance")

Singular Value Decomposition

let result = svd(counts)
# result => {u: matrix, s: [singular values], v: matrix}

Solving Linear Systems

# Solve Ax = b
let a = matrix([[2, 1], [1, 3]])
let b = [5, 11]
let x = solve(a, b)
# x => [1.0, 3.0]

ODE Solving

ode_solve uses a 4th-order Runge-Kutta integrator. The signature is:

ode_solve(derivative_fn, initial_state, [t_start, t_end, dt])

It returns {t: [...], y: [...]} where y is a list of state vectors at each time point.

Example: Differential Expression Analysis

Run t-tests across all genes and correct for multiple testing.

let expr = tsv("counts_normalized.tsv")
let metadata = tsv("sample_info.tsv")

let case_ids = metadata |> filter(|r| r.condition == "treated") |> col("sample_id")
let ctrl_ids = metadata |> filter(|r| r.condition == "control") |> col("sample_id")

let gene_names = expr |> col("gene_id")
let results = gene_names |> map(|gene| {
  let case_vals = case_ids |> map(|id| expr[id][index_of(gene_names, gene)])
  let ctrl_vals = ctrl_ids |> map(|id| expr[id][index_of(gene_names, gene)])
  let test = ttest(case_vals, ctrl_vals)
  let fc = mean(case_vals) / mean(ctrl_vals)
  {gene: gene, log2fc: log2(fc), pvalue: test.pvalue}
})

let p_vals = results |> map(|r| r.pvalue)
let q_vals = p_adjust(p_vals, "BH")

let de_genes = range(0, len(results))
  |> map(|i| {...results[i], q_value: q_vals[i]})
  |> filter(|r| r.q_value < 0.05 && abs(r.log2fc) > 1.0)
  |> sort_by(|r| r.q_value)

print(str(len(de_genes)) + " differentially expressed genes")
de_genes |> take(20) |> each(|g| print(g.gene + " log2FC=" + str(g.log2fc)))
de_genes |> write_tsv("de_results.tsv")

Example: PCA on Gene Expression

Use the covariance matrix eigendecomposition to project samples into PC space.

let expr = tsv("tpm_matrix.tsv")
let sample_ids = expr |> colnames() |> filter(|n| n != "gene_id")
let n_genes = expr.num_rows
let n_samples = len(sample_ids)

# Build samples x genes matrix
let data = sample_ids
  |> map(|id| expr |> col(id))
  |> matrix()

# Center each gene (column) to zero mean
let centered = range(0, n_genes)
  |> fold(data, |mat, j| {
       let col = mat_col(mat, j)
       let mu = mean(col)
       mat_set_col(mat, j, col |> map(|x| x - mu))
     })

# Covariance matrix (samples x samples)
let cov = mat_mul(centered, transpose(centered))
  |> mat_map(|x| x / (n_genes - 1))

# Eigendecomposition
let eigen = eigenvalues(cov)
let total_var = sum(eigen.values)

# Project onto first 2 PCs
let pc1 = mat_col(eigen.vectors, 0)
let pc2 = mat_col(eigen.vectors, 1)

let pca_coords = range(0, n_samples)
  |> map(|i| {
       sample: sample_ids[i],
       pc1: pc1[i],
       pc2: pc2[i],
     })

print("PC1: " + str(eigen.values[0] / total_var * 100) + "% variance")
print("PC2: " + str(eigen.values[1] / total_var * 100) + "% variance")
pca_coords |> write_tsv("pca_coordinates.tsv")

Example: Pharmacokinetic Modeling

Model oral drug absorption and elimination using a two-compartment ODE system.

# Parameters for a typical oral drug
let ka = 1.0      # absorption rate (1/hr)
let ke = 0.2      # elimination rate (1/hr)
let dose = 500.0  # mg

# State: [gut_amount, plasma_concentration]
# gut depletes by absorption, plasma gains from gut and loses by elimination
let pk_model = |t, y| [
  -ka * y[0],              # dA_gut/dt = -ka * A_gut
  ka * y[0] - ke * y[1],   # dC_plasma/dt = ka * A_gut - ke * C_plasma
]

let initial = [dose, 0.0]
let solution = ode_solve(pk_model, initial, [0.0, 24.0, 0.1])

# Find Cmax and Tmax
let plasma = solution.y |> map(|state| state[1])
let cmax = max(plasma)
let tmax_idx = index_of(plasma, cmax)
let tmax = solution.t[tmax_idx]

print("Tmax: " + str(tmax) + " hr")
print("Cmax: " + str(cmax) + " mg")

# Half-life
let t_half = log(2) / ke
print("Half-life: " + str(t_half) + " hr")

# AUC by trapezoidal rule
let auc = range(0, len(plasma) - 1)
  |> map(|i| (plasma[i] + plasma[i + 1]) / 2.0 * (solution.t[i + 1] - solution.t[i]))
  |> sum()
print("AUC(0-24h): " + str(auc) + " mg*hr")

Example: Population Genetics (Hardy-Weinberg)

Simulate genotype frequencies and test for equilibrium across loci.

# Observed genotype counts at 50 SNP loci
let loci = tsv("genotype_counts.tsv")
# columns: locus, AA, Aa, aa

let hwe_results = loci |> map(|loc| {
  let obs = [loc.AA, loc.Aa, loc.aa]
  let n = sum(obs)
  let p = (2 * loc.AA + loc.Aa) / (2 * n)
  let q = 1.0 - p

  let exp = [p * p * n, 2 * p * q * n, q * q * n]
  let test = chi_square(obs, exp)

  {
    locus: loc.locus,
    p_freq: p,
    q_freq: q,
    chi2: test.statistic,
    pvalue: test.pvalue,
  }
})

# Correct for multiple testing
let p_vals = hwe_results |> map(|r| r.pvalue)
let q_vals = p_adjust(p_vals, "BH")

let departures = range(0, len(hwe_results))
  |> map(|i| {...hwe_results[i], q_value: q_vals[i]})
  |> filter(|r| r.q_value < 0.05)

print(str(len(departures)) + " loci depart from HWE (FDR < 0.05):")
departures |> each(|r| print("  " + r.locus + " chi2=" + str(r.chi2)
                              + " q=" + str(r.q_value)))

Summary

BioLang’s numerical toolbox covers the statistical tests and linear algebra operations that appear throughout computational biology. Descriptive stats, hypothesis tests with multiple-testing correction, matrix decompositions, and ODE integration are all available as built-in functions, keeping your analysis in a single language from raw data to publication-ready results.

Chapter 13: Biological Database APIs

Bioinformatics analysis rarely lives in isolation. You query NCBI for gene annotations, check UniProt for protein function, pull variant consequences from Ensembl VEP, and look up pathways in KEGG. BioLang provides 16 built-in database client functions so you can fetch, cross-reference, and integrate biological data without leaving your script.

All bio API functions are builtins. No imports are needed.

NCBI

The National Center for Biotechnology Information hosts PubMed, Gene, Nucleotide, and dozens of other databases. BioLang provides three NCBI functions.

Search any NCBI database (PubMed, Gene, Nucleotide, Protein, etc.).

# Search PubMed for recent CRISPR-Cas9 papers
# ncbi_search(db, query, max_results?)
let ids = ncbi_search("pubmed", "CRISPR-Cas9 delivery 2024", 20)
# ids => ["39012345", "39012346", ...]

# Get summaries for the IDs
let summaries = ncbi_summary(ids, "pubmed")
summaries |> each(|s| {
  print(s.uid)
})

ncbi_gene

Fetch detailed gene information by gene ID or symbol.

# ncbi_gene(symbol_or_query, max_results?)
let tp53 = ncbi_gene("TP53")
# If a single gene matches, returns a record:
# {id, symbol, name, description, organism, chromosome, location, summary}

print(tp53.name + " on chr" + tp53.chromosome + ": " + tp53.location)

ncbi_sequence

Retrieve nucleotide or protein sequences from NCBI.

# ncbi_sequence(accession) — returns FASTA text
let fasta = ncbi_sequence("NM_000546.6")
print("FASTA (first 100 chars): " + fasta[0..100])

Set the NCBI_API_KEY environment variable to get 10 requests/second instead of the default 3.

Ensembl

ensembl_gene

Look up a gene via the Ensembl REST API.

# ensembl_symbol(species, symbol) — lookup by symbol
let brca1 = ensembl_symbol("human", "BRCA1")
# Returns: {id, symbol, description, species, biotype, start, end, strand, chromosome}

print("Ensembl ID: " + brca1.id)
print("Location: " + brca1.chromosome + ":" + str(brca1.start) + "-" + str(brca1.end))

# ensembl_gene(ensembl_id) — lookup by Ensembl ID
let same = ensembl_gene("ENSG00000012048")
print("Symbol: " + same.symbol)

ensembl_vep

Predict the functional consequences of variants using the Variant Effect Predictor.

# ensembl_vep(hgvs) — predict variant consequences
let variants = [
  "17:g.43091434C>T",   # BRCA1 splice donor
  "7:g.140753336A>T",   # BRAF V600E
  "12:g.25245350C>A",   # KRAS G12V
]

let predictions = variants |> map(|v| ensembl_vep(v))

# Each result is a list of records with:
# {allele_string, most_severe_consequence, transcript_consequences: [...]}
predictions |> each(|pred| {
  if len(pred) > 0 then {
    let r = pred[0]
    print(r.allele_string + " => " + r.most_severe_consequence)
  }
})

UniProt

Search the UniProt protein database.

# uniprot_search(query, limit?)
let kinases = uniprot_search("kinase AND organism_id:9606", 50)
# Returns list of records: {accession, name, organism, sequence_length, gene_names, function}

print(str(len(kinases)) + " human kinases found")
kinases |> take(5) |> each(|k| print(k.accession + " " + k.gene_names))

uniprot_entry

Get full details for a single UniProt accession.

let entry = uniprot_entry("P04637")  # TP53
# Returns: {accession, name, organism, sequence_length, gene_names, function}

print(entry.name + ": " + str(entry.sequence_length) + " aa")
print("Genes: " + entry.gene_names)
print("Function: " + entry.function)

# Get protein features separately
let features = uniprot_features("P04637")
# Returns list of: {type, location, description}
let domains = features |> filter(|f| f.type == "Domain")
domains |> each(|d| print("  " + d.description + ": " + d.location))

UCSC Genome Browser

ucsc_sequence

Retrieve genomic sequences from the UCSC Genome Browser.

# Get the sequence of the BRCA1 promoter region
# ucsc_sequence(genome, chrom, start, end)
let promoter = ucsc_sequence("hg38", "chr17", 43170245, 43172245)
# Returns DNA sequence as a string

print("BRCA1 promoter length: " + str(len(promoter)) + " bp")
let gc = gc_content(dna(promoter))
print("GC content: " + str(gc))

KEGG

kegg_find

Search KEGG databases (pathway, enzyme, compound, etc.).

# kegg_find(db, query) — search KEGG databases
let pathways = kegg_find("pathway", "apoptosis human")
# Returns list of: {id, description}

kegg_get

Retrieve a specific KEGG entry.

# kegg_get(entry_id) — returns raw KEGG text
let apoptosis = kegg_get("hsa04210")
# Returns the KEGG flat-file text for the entry
print("Entry text length: " + str(len(apoptosis)) + " chars")

STRING

string_network

Query protein-protein interaction networks from the STRING database.

# string_network(identifiers, species)
let network = string_network("TP53", 9606)
# Returns interaction network data

print(network)

PDB

pdb_entry

Retrieve protein structure metadata from the Protein Data Bank.

let structure = pdb_entry("1TUP")  # TP53 DNA-binding domain
# structure => {
#   id: "1TUP",
#   title: "Crystal structure of the p53 core domain...",
#   resolution: 2.2,
#   method: "X-RAY DIFFRACTION",
#   chains: [{id: "A", entity: "Tumor suppressor p53", length: 195}, ...],
#   ligands: [...],
# }

print(structure.title)
print("Resolution: " + str(structure.resolution) + " A")

Reactome

reactome_pathways

Find pathways associated with a gene or set of genes.

# reactome_pathways(gene_or_genes)
let pathways = reactome_pathways("BRCA1")
# Returns pathway records

pathways |> each(|p| print(p))

Gene Ontology

go_term

Look up a GO term by its identifier.

let term = go_term("GO:0006915")
# term => {id: "GO:0006915", name: "apoptotic process",
#           namespace: "biological_process",
#           definition: "A programmed cell death process..."}

go_annotations

Get GO annotations for a gene.

# go_annotations(gene_or_accession)
let annotations = go_annotations("TP53")
# Returns list of: {go_id, term, aspect}

annotations |> take(10) |> each(|a| print("  " + a.go_id + " [" + a.aspect + "]: " + a.term))

COSMIC

cosmic_gene

Query the Catalogue Of Somatic Mutations In Cancer. Requires COSMIC_API_KEY.

# cosmic_gene(gene) — requires COSMIC_API_KEY env var
let cosmic = cosmic_gene("BRAF")
# Returns mutation data for the gene

print(cosmic)

NCBI Datasets

datasets_gene

Use the NCBI Datasets API for gene metadata.

# datasets_gene(symbol_or_id)
let info = datasets_gene("EGFR")
print(info)

Environment Variables

Some APIs require or benefit from API keys:

VariableEffect
NCBI_API_KEYNCBI: 10 req/sec instead of 3
COSMIC_API_KEYRequired for COSMIC queries

Set these in your shell or in a .env file before running your script.

Example: Gene Annotation Pipeline

Fetch gene info from NCBI, cross-reference with UniProt, and pull pathways from KEGG.

let gene_list = ["TP53", "BRCA1", "EGFR", "KRAS", "PIK3CA"]

let annotated = gene_list |> map(|symbol| {
  let ncbi = ncbi_gene(symbol)
  let up_hits = uniprot_search("gene:" + symbol + " AND organism_id:9606", 1)
  let prot_len = if len(up_hits) > 0 then up_hits[0].sequence_length else 0

  let pathways = kegg_find("pathway", symbol + " human")
    |> take(3)

  {
    symbol:         symbol,
    name:           ncbi.name,
    chromosome:     ncbi.chromosome,
    location:       ncbi.location,
    protein_length: prot_len,
    pathways:       pathways |> map(|p| p.description),
  }
})

annotated |> each(|g| {
  print(g.symbol + " (" + g.name + ") - chr" + g.chromosome)
  print("  Protein: " + str(g.protein_length) + " aa")
  print("  Pathways: " + str(g.pathways))
})

annotated |> write_json("gene_annotations.json")

Example: Variant Interpretation

Predict variant effects with Ensembl VEP and check for known cancer mutations in COSMIC.

let variants = tsv("candidate_variants.tsv")
  |> map(|r| r.chrom + ":" + str(r.pos) + ":" + r.ref + ":" + r.alt)

let interpreted = variants |> map(|v| {
  let vep = ensembl_vep(v)

  let worst = vep.consequences
    |> sort_by(|c| c.impact_rank)
    |> first()

  let gene = worst.gene_symbol

  # Check COSMIC if it is a missense or nonsense variant
  let cosmic_info = if worst.consequence == "missense_variant" or
                       worst.consequence == "stop_gained" then
    cosmic_gene(gene)
      |> |cg| cg.mutations
      |> find(|m| m.aa_change == worst.amino_acid_change)
  else
    None

  {
    variant: v,
    gene: gene,
    consequence: worst.consequence,
    impact: worst.impact,
    sift: worst.sift_prediction,
    polyphen: worst.polyphen_prediction,
    cosmic_count: if cosmic_info != None then cosmic_info.count else 0,
    cosmic_known: cosmic_info != None,
  }
})

# Flag high-impact or COSMIC-known variants
let flagged = interpreted
  |> filter(|v| v.impact == "HIGH" or v.cosmic_known)
  |> sort_by(|v| -v.cosmic_count)

print(str(len(flagged)) + " variants flagged for review")
flagged |> write_tsv("flagged_variants.tsv")

Example: Protein Interaction Network

Build a STRING interaction network for a set of differentially expressed genes and annotate with Reactome pathways.

let de_genes = tsv("de_results.tsv")
  |> filter(|r| r.q_value < 0.01 and abs(r.log2fc) > 2.0)
  |> map(|r| r.gene)

# Get STRING interactions for the top 50 DE genes
let top_genes = de_genes |> take(50)
let network = string_network(top_genes, 9606)

print(str(len(network.nodes)) + " nodes, "
      + str(len(network.edges)) + " edges")

# Find hub genes (most connections)
let degree = network.nodes |> map(|node| {
  let edges = network.edges
    |> filter(|e| e.source == node.id or e.target == node.id)
  {gene: node.id, degree: len(edges)}
})
  |> sort_by(|d| -d.degree)

print("Hub genes:")
degree |> take(10) |> each(|d| print("  " + d.gene + ": " + str(d.degree) + " interactions"))

# Pathway enrichment for hub genes
let hub_genes = degree |> take(10) |> map(|d| d.gene)
let pathways = reactome_pathways(hub_genes)
  |> filter(|p| p.p_value < 0.05)
  |> sort_by(|p| p.p_value)

print("Enriched pathways:")
pathways |> take(10) |> each(|p| print("  " + p.name + " (p=" + str(p.p_value) + ")"))

Summary

BioLang’s built-in bio API functions let you query 12 major biological databases directly from your scripts. Combine them with pipes, maps, and filters to build annotation pipelines that cross-reference genes, variants, proteins, and pathways in a few lines of code. Set API keys via environment variables to increase rate limits where supported.

Chapter 14: Streams and Scale

Genomics data is large. A single whole-genome sequencing run produces hundreds of gigabytes of FASTQ. A population-scale VCF can hold billions of variant records. Loading everything into memory is not an option. BioLang addresses this with streams: lazy, single-pass iterators that process data record by record without materializing the entire dataset.

What Is a Stream?

A stream is a StreamValue – a lazy sequence of items produced on demand. File readers in BioLang return streams by default. Streams are consumed once: after you iterate through a stream, it is exhausted.

# read_fastq returns a stream, not a list
let reads = read_fastq("whole_genome_R1.fastq.gz")

# This processes one record at a time -- constant memory
let high_quality = reads
  |> filter(|r| mean_phred(r.quality) >= 30)
  |> map(|r| {id: r.id, length: r.length, gc: gc_content(r.seq)})
  |> write_tsv("read_stats.tsv")

No matter how large the FASTQ file is, this script uses the same amount of memory. Each record flows through filter, then map, then is written out before the next record is read.

Stream Operations

Streams support the same functional operations as lists, but they execute lazily.

map

Transform each element.

read_vcf("variants.vcf.gz")
  |> map(|v| {chrom: v.chrom, pos: v.pos, qual: v.qual, af: v.info.AF})
  |> write_tsv("variant_summary.tsv")

filter

Keep only elements matching a predicate.

read_bam("aligned.bam")
  |> filter(|r| r.mapping_quality >= 30 and not r.is_duplicate)
  |> count()
  |> |n| print(str(n) + " high-quality unique alignments")

take_while

Consume from a stream while a condition holds, then stop.

# Read the first million reads from a FASTQ for quick QC
let sample = read_fastq("deep_sequencing.fastq.gz")
  |> take_while(|r, i| i < 1000000)

let quals = sample |> map(|r| mean_phred(r.quality))
print("Sampled stats: " + str(mean(quals)) + " mean Q")

each

Execute a side effect for every element (the stream analogue of a for loop).

let bad_count = 0
read_fastq("sample.fastq.gz")
  |> each(|r| {
       if mean_phred(r.quality) < 20 then
         bad_count = bad_count + 1
     })
print(str(bad_count) + " low-quality reads")

fold / reduce

Accumulate a value across the entire stream.

let total_bases = read_fastq("sample.fastq.gz")
  |> fold(0, |acc, r| acc + r.length)
print("Total bases: " + str(total_bases))

Chunked Processing

When you need batch operations on a stream, chunk groups elements into fixed-size lists. This is useful for writing output in blocks or batching API calls.

# Process a FASTQ in chunks of 10,000 reads
read_fastq("large_sample.fastq.gz")
  |> chunk(10000)
  |> each(|batch| {
       let mean_q = batch |> map(|r| mean_phred(r.quality)) |> mean()
       print("Batch: " + str(len(batch)) + " reads, "
             + "mean Q=" + str(mean_q))
     })

Chunking is also useful for writing to multiple output files.

# Split a FASTQ into files of 1 million reads each
let file_num = 0
read_fastq("huge_sample.fastq.gz")
  |> chunk(1000000)
  |> each(|batch| {
       let path = "split/chunk_" + str(file_num) + ".fastq.gz"
       batch |> write_fastq(path)
       file_num = file_num + 1
     })
print("Wrote " + str(file_num) + " chunk files")

Window Operations

window creates a sliding window over a list (or materialized portion of a stream). This is essential for computing positional statistics across a genome.

# Sliding window GC content
let sequence = read_fasta("chr1.fa") |> first() |> |r| r.seq
let win_size = 1000
let step = 500

let gc_track = window(sequence, win_size)
  |> enumerate()
  |> map(|pair| {
       pos: pair[0] * step,
       gc: gc_content(pair[1]),
     })

gc_track |> write_tsv("chr1_gc_content.tsv")

For windowed operations on streams (where you cannot look back), use sliding_window which maintains a buffer.

# Compute rolling average base quality across a BAM
read_bam("sample.bam")
  |> filter(|r| r.chrom == "chr17")
  |> sort_by(|r| r.pos)
  |> window(100)
  |> map(|win| {
       pos: win[50].pos,
       mean_mapq: mean(win |> map(|r| r.mapping_quality)),
     })
  |> write_tsv("chr17_mapq_rolling.tsv")

Parallel Operations

par_map

Apply a function to each element of a list in parallel, distributing work across available CPU cores.

let chromosomes = ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6",
                   "chr7", "chr8", "chr9", "chr10", "chr11", "chr12",
                   "chr13", "chr14", "chr15", "chr16", "chr17", "chr18",
                   "chr19", "chr20", "chr21", "chr22", "chrX", "chrY"]

let per_chrom_stats = par_map(chromosomes, |chrom| {
  let reads = read_bam("sample.bam") |> filter(|r| r.chrom == chrom)
  let count = reads |> count()
  let depth = samtools_depth("sample.bam", chrom) |> mean()
  {chrom: chrom, reads: count, mean_depth: depth}
})

per_chrom_stats |> sort_by(|s| s.chrom) |> write_tsv("chrom_stats.tsv")

par_filter

Filter elements in parallel. Useful when the predicate itself is expensive.

let variants = tsv("all_variants.tsv")

let pathogenic = par_filter(variants, |v| {
  let vep = ensembl_vep(v.chrom + ":" + str(v.pos) + ":"
                        + v.ref + ":" + v.alt)
  let worst = vep.consequences |> first()
  worst.impact == "HIGH"
})

print(str(len(pathogenic)) + " high-impact variants")

Unit Helpers

BioLang provides genomic unit helpers that make size comparisons readable.

let region_size = mb(3)          # 3,000,000
let read_length = bp(150)        # 150
let window = kb(50)              # 50,000
let genome_size = gb(3)          # 3,000,000,000

# Use in calculations
let coverage = 1800000000 / genome_size  # ~0.6x
print("Coverage: " + str(coverage) + "x")

# Filter regions by size
let large_svs = read_vcf("structural_variants.vcf.gz")
  |> filter(|v| v.info.SVLEN != None and abs(v.info.SVLEN) > mb(1))

print(str(large_svs |> count()) + " SVs larger than 1 Mb")

These helpers are simple multipliers but they prevent off-by-three-zeros bugs that plague genomics scripts.

Example: Process a 100GB FASTQ

Stream 4 billion reads, filter by quality, and compute statistics without loading the file into memory.

let input = "novaseq_R1.fastq.gz"  # 100 GB compressed

let stats = {
  total: 0,
  passed: 0,
  total_bases: 0,
  gc_bases: 0,
  qual_sum: 0,
}

read_fastq(input)
  |> each(|r| {
       stats.total = stats.total + 1
       let q = mean_phred(r.quality)
       if q >= 30 then {
         stats.passed = stats.passed + 1
         stats.total_bases = stats.total_bases + r.length
         let gc = gc_content(dna(r.seq)) * r.length
         stats.gc_bases = stats.gc_bases + gc
         stats.qual_sum = stats.qual_sum + q
       }
     })

let pct_pass = stats.passed / stats.total * 100
let gc = stats.gc_bases / stats.total_bases * 100
let mean_q = stats.qual_sum / stats.passed

print("Total reads:  " + str(stats.total))
print("Passed Q>=30: " + str(stats.passed) + " (" + str(pct_pass) + "%)")
print("Mean quality: " + str(mean_q))
print("GC content:   " + str(gc) + "%")

This script processes the entire file in a single pass. Memory usage stays constant regardless of file size.

Example: Parallel Variant Annotation Across Chromosomes

Split variant annotation work by chromosome to use all cores.

let vcf_path = "cohort.vcf.gz"
let chromosomes = range(1, 23) |> map(|n| "chr" + str(n))
let chromosomes = concat(chromosomes, ["chrX", "chrY"])

let annotated = par_map(chromosomes, |chrom| {
  let variants = read_vcf(vcf_path)
    |> filter(|v| v.chrom == chrom)
    |> collect()

  let results = variants |> map(|v| {
    let vep = ensembl_vep(v.chrom + ":" + str(v.pos) + ":" + v.ref + ":" + v.alt)
    let worst = vep.consequences
      |> sort_by(|c| c.impact_rank)
      |> first()
    {
      ...v,
      consequence: worst.consequence,
      impact: worst.impact,
      gene: worst.gene_symbol,
    }
  })

  print(chrom + ": " + str(len(results)) + " variants annotated")
  results
})
  |> flatten()

annotated |> write_tsv("annotated_variants.tsv")
print("Annotated " + str(len(annotated)) + " variants across "
      + str(len(chromosomes)) + " chromosomes")

Example: Sliding Window GC Content

Compute GC content in 1 kb windows across an entire chromosome.

let chr_seq = read_fasta("GRCh38_chr22.fa") |> first() |> |r| r.seq
let chr_len = len(chr_seq)
let win = kb(1)
let step = bp(500)

print("Chromosome 22: " + str(chr_len) + " bp")
print("Computing GC in " + str(win) + " bp windows, step " + str(step))

let gc_profile = window(chr_seq, win)
  |> enumerate()
  |> map(|pair| {
       start: pair[0] * step,
       end: pair[0] * step + win,
       gc: gc_content(pair[1]),
     })

# Identify GC-rich and GC-poor regions
let gc_rich = gc_profile |> filter(|w| w.gc > 0.60)
let gc_poor = gc_profile |> filter(|w| w.gc < 0.35)

print("GC-rich windows (>60%): " + str(len(gc_rich)))
print("GC-poor windows (<35%): " + str(len(gc_poor)))

gc_profile |> write_tsv("chr22_gc_profile.tsv")

Example: Batch Processing Thousands of Samples

Process a large sample cohort using chunked parallelism to avoid overwhelming system resources.

let manifest = tsv("sample_manifest.tsv")  # 2,000 samples
print("Processing " + str(len(manifest)) + " samples")

let batch_size = 16  # process 16 samples at a time

let all_results = manifest
  |> chunk(batch_size)
  |> enumerate()
  |> map(|pair| {
       let batch_idx = pair[0]
       let batch = pair[1]
       print("Batch " + str(batch_idx + 1) + "/"
             + str(ceil(len(manifest) / batch_size)))

       par_map(batch, |sample| {
         # Align
         let bam = bwa_mem("GRCh38.fa", sample.r1, sample.r2, threads: 2)
           |> samtools_sort(threads: 2)
         samtools_index(bam)

         # QC metrics
         let flagstat = samtools_flagstat(bam)
         let depth = samtools_depth(bam) |> mean()
         let insert_size = picard_insert_size(bam)

         {
           id: sample.sample_id,
           bam: bam,
           mapped_pct: flagstat.mapped_pct,
           mean_depth: depth,
           median_insert: insert_size.median,
           status: if flagstat.mapped_pct > 90 and depth > 20
                   then "PASS" else "FAIL",
         }
       })
     })
  |> flatten()

let passed = all_results |> filter(|r| r.status == "PASS")
let failed = all_results |> filter(|r| r.status == "FAIL")

print("Results: " + str(len(passed)) + " PASS, "
      + str(len(failed)) + " FAIL")

if len(failed) > 0 then {
  print("Failed samples:")
  failed |> each(|r| print("  " + r.id + " depth=" + str(r.mean_depth)
                           + " mapped=" + str(r.mapped_pct) + "%"))
}

all_results |> write_tsv("cohort_qc_results.tsv")

Memory Patterns to Know

PatternMemoryUse When
read_fastq(f) |> filter(...) |> count()O(1)Counting, simple stats
read_vcf(f) |> collect()O(n)Need random access
read_bam(f) |> chunk(k) |> each(...)O(k)Batch writing, API calls
par_map(list, f)O(n)List already in memory
stream |> fold(init, f)O(1)Aggregation

The rule of thumb: stay in stream mode as long as possible. Call collect() only when you genuinely need the full dataset in memory (for sorting, random access, or passing to a function that requires a list).

Summary

Streams let BioLang handle files of any size in constant memory. Combine them with chunk for batch processing, window for positional analysis, and par_map/par_filter for multi-core parallelism. The unit helpers bp(), kb(), mb(), and gb() keep genomic arithmetic readable. Together, these tools let you write scripts that scale from a test FASTQ to a population-scale dataset without changing the code.

SQLite & Notifications

BioLang includes built-in SQLite support and notification builtins so your pipelines can persist results and alert you when they finish — no external tools required.

SQLite

Bioinformatics workflows constantly produce tabular results: QC metrics, variant counts, sample manifests. SQLite gives you a zero-config embedded database to store, query, and compare results across runs.

Opening a Database

# File-based (creates if missing)
let db = sqlite("project_results.db")

# In-memory (temporary, fast)
let scratch = sqlite()

Querying

sql() executes any SQL statement. SELECT queries return a Table; write statements return the number of affected rows. Use ? placeholders for parameterized queries.

# Create a table
sql(db, "CREATE TABLE IF NOT EXISTS qc (
  sample TEXT PRIMARY KEY,
  total_reads INTEGER,
  pass_reads INTEGER,
  pass_rate REAL
)")

# Insert data
sql(db, "INSERT INTO qc VALUES (?, ?, ?, ?)",
    ["tumor_01", 5000000, 4850000, 97.0])

# Query — returns a Table
let results = sql(db, "SELECT * FROM qc WHERE pass_rate > ?", [95.0])
print(results)

Since sql() returns a standard BioLang Table, you can pipe it directly:

sql(db, "SELECT symbol, chrom, start, end FROM genes WHERE chrom = ?", ["chr17"])
  |> filter(|g| g.start > 40000000)
  |> sort_by(|g| g.start)
  |> print()

Bulk Insert

sql_insert() inserts an entire Table or list of records in a single transaction. This is much faster than individual INSERT statements.

# From a Table (e.g., read from TSV)
let stats = read_tsv("variant_stats.tsv")
sql_insert(db, "variants", stats)

# From a list of records
let samples = [
  {id: "S1", depth: 42.3, mapped_pct: 98.1},
  {id: "S2", depth: 38.7, mapped_pct: 97.5},
]
let inserted = sql_insert(db, "qc_results", samples)
print(str(inserted) + " rows inserted")

Metadata

# List all tables
sql_tables(db)
# ["qc", "variants", "qc_results"]

# Inspect a table's schema
sql_schema(db, "qc")
#  cid | name        | type    | notnull | pk
# -----+-------------+---------+---------+-----
#  0   | sample      | TEXT    | false   | true
#  1   | total_reads | INTEGER | false   | false
#  2   | pass_reads  | INTEGER | false   | false
#  3   | pass_rate   | REAL    | false   | false

Pipeline Example

Store QC results from every sample run, then query across all runs:

let db = sqlite("lab_results.db")

sql(db, "CREATE TABLE IF NOT EXISTS qc (
  sample TEXT, run_date TEXT, total INTEGER, passing INTEGER, rate REAL
)")

pipeline sample_qc(sample_id, fastq_path) {
  stage stats {
    let reads = read_fastq(fastq_path) |> collect()
    let total = len(reads)
    let passing = reads |> filter(|r| mean_phred(r.quality) >= 25) |> len()
    let rate = passing / total * 100.0

    sql(db, "INSERT INTO qc VALUES (?, date('now'), ?, ?, ?)",
        [sample_id, total, passing, rate])

    {total: total, passing: passing, rate: rate}
  }
}

# After processing many samples, query trends
let low_quality = sql(db,
  "SELECT sample, rate FROM qc WHERE rate < ? ORDER BY rate",
  [95.0])
print("Samples below 95% pass rate:")
print(low_quality)

SQLite Builtins Summary

BuiltinReturnsDescription
sqlite(path?)DbHandleOpen/create database
sql(db, query, params?)Table or IntExecute SQL
sql_insert(db, table, data)IntBulk insert (transactional)
sql_tables(db)ListList table names
sql_schema(db, table)TableColumn metadata
sql_close(db)NilClose connection (optional)
is_db(value)BoolType check

Notifications

Long-running pipelines (alignment, variant calling, cohort analysis) can take hours. BioLang’s notification builtins send you a message when your analysis finishes or fails — Slack, Teams, Telegram, Discord, or email.

The notify() Builtin

notify() is the smart router. It reads BIOLANG_NOTIFY to determine which provider to use, then sends the message.

# Simple string
notify("Alignment complete: 24 samples processed")

# Structured record — provider formats it natively
notify({
  title: "QC Pipeline Complete",
  status: "success",
  fields: {
    samples: 24,
    pass_rate: "96%",
    output: "/data/results/cohort.vcf.gz"
  }
})

If BIOLANG_NOTIFY is not set, notify() prints to stderr as a fallback.

Provider-Specific Builtins

Each provider has a dedicated builtin for direct use:

# Slack (env: SLACK_WEBHOOK)
slack("Variant calling finished: 1,234 SNPs, 456 indels")

# Microsoft Teams (env: TEAMS_WEBHOOK)
teams("RNA-seq pipeline complete: 12 samples normalized")

# Telegram (env: TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID)
telegram("Alignment done: tumor.sorted.bam")

# Discord (env: DISCORD_WEBHOOK)
discord("FASTQ QC complete: 98% pass rate")

# Email (env: SMTP_HOST, SMTP_USER, SMTP_PASS)
email("lab@example.com", "Pipeline Complete", "Analysis finished successfully")

All webhook-based builtins accept an optional first argument for an explicit webhook URL, so you can skip env vars:

slack("https://hooks.slack.com/services/xxx", "Pipeline done")
teams("https://outlook.office.com/webhook/xxx", "Pipeline done")

Structured Messages

Pass a record instead of a string for rich formatting. Each provider renders it natively (Slack Block Kit, Teams Adaptive Cards, Discord embeds):

slack({
  title: "Cohort Analysis Complete",
  status: "success",
  fields: {
    samples: 48,
    variants: "2.3M",
    ti_tv: 2.1,
    output: "cohort.annotated.vcf.gz"
  }
})

Pipeline Integration

Use notify() in pipeline stages or with defer for guaranteed delivery:

pipeline variant_pipeline {
  defer {
    notify("Pipeline variant_pipeline finished")
  }

  stage align {
    shell("bwa-mem2 mem -t 16 GRCh38.fa R1.fq.gz R2.fq.gz | samtools sort -o aligned.bam")
    "aligned.bam"
  }

  stage call {
    shell("gatk HaplotypeCaller -R GRCh38.fa -I " + align + " -O raw.vcf.gz")
    let variants = read_vcf("raw.vcf.gz") |> collect()
    let snps = variants |> filter(|v| v.is_snp) |> len()
    let indels = variants |> filter(|v| v.is_indel) |> len()

    notify({
      title: "Variant Calling Complete",
      fields: {SNPs: snps, Indels: indels}
    })

    "raw.vcf.gz"
  }
}

Setup

Set environment variables once in your shell profile:

# Pick your provider
export BIOLANG_NOTIFY=slack

# Slack
export SLACK_WEBHOOK=https://hooks.slack.com/services/T.../B.../xxx

# Telegram
export TELEGRAM_BOT_TOKEN=123456:ABC-DEF
export TELEGRAM_CHAT_ID=-1001234567890

# Email (for notify() with BIOLANG_NOTIFY=email)
export SMTP_HOST=smtp.gmail.com
export SMTP_USER=you@gmail.com
export SMTP_PASS=app-password
export NOTIFY_EMAIL_TO=lab@example.com

Notification Builtins Summary

BuiltinProviderEnv Vars
notify(msg)AutoBIOLANG_NOTIFY + provider vars
slack(msg)SlackSLACK_WEBHOOK
teams(msg)TeamsTEAMS_WEBHOOK
telegram(msg)TelegramTELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID
discord(msg)DiscordDISCORD_WEBHOOK
email(to, subj, body)SMTPSMTP_HOST, SMTP_USER, SMTP_PASS

Knowledge Graphs

BioLang includes a built-in graph data structure for modeling biological networks — protein interactions, gene regulatory networks, metabolic pathways, and more.

Creating Graphs

# Empty undirected graph
let g = graph()

# Directed graph
let g = graph(true)

Adding Nodes and Edges

Nodes are identified by string IDs and can carry arbitrary attributes:

let g = graph()
let g = add_node(g, "BRCA1", {biotype: "protein_coding", chrom: "chr17"})
let g = add_node(g, "TP53", {biotype: "protein_coding", chrom: "chr17"})
let g = add_edge(g, "BRCA1", "TP53", {score: 0.99, source: "STRING"})

Adding an edge auto-creates nodes that don’t exist yet:

let g = graph()
let g = add_edge(g, "A", "B")  # both A and B are created
has_node(g, "A")                # true

Querying the Graph

# Build a small PPI network
let g = graph()
let g = add_edge(g, "BRCA1", "TP53", {score: 0.99})
let g = add_edge(g, "TP53", "MDM2", {score: 0.97})
let g = add_edge(g, "BRCA1", "BARD1", {score: 0.95})
let g = add_edge(g, "MDM2", "CDKN2A", {score: 0.85})

# Direct neighbors
neighbors(g, "TP53")           # ["BRCA1", "MDM2"]

# Node degree
degree(g, "BRCA1")             # 2

# Shortest path (BFS)
shortest_path(g, "BARD1", "CDKN2A")  # ["BARD1", "BRCA1", "TP53", "MDM2", "CDKN2A"]

# All nodes and edges
nodes(g)                       # ["BARD1", "BRCA1", "CDKN2A", "MDM2", "TP53"]
edges(g)                       # Table with from, to, weight columns

Graph Analysis

Connected Components

Find disconnected subnetworks:

let g = graph()
let g = add_edge(g, "BRCA1", "TP53")
let g = add_edge(g, "BRCA1", "BARD1")
let g = add_node(g, "ISOLATED_GENE")
let g = add_edge(g, "CDK2", "CCND1")

let components = connected_components(g)
print("Number of components: " + str(len(components)))
# 3: [BRCA1, TP53, BARD1], [ISOLATED_GENE], [CDK2, CCND1]

Induced Subgraph

Extract a subgraph containing only specified nodes and their connecting edges:

let g = graph()
let g = add_edge(g, "A", "B")
let g = add_edge(g, "B", "C")
let g = add_edge(g, "C", "D")

let sub = subgraph(g, ["A", "B", "C"])
has_edge(sub, "A", "B")    # true
has_edge(sub, "C", "D")    # false (D not in subgraph)

Node Attributes

let g = graph()
let g = add_node(g, "EGFR", {
    biotype: "protein_coding",
    chrom: "chr7",
    pathway: "EGFR signaling"
})

let attrs = node_attr(g, "EGFR")
print(attrs.pathway)    # "EGFR signaling"

Removing Nodes and Edges

let g = graph()
let g = add_edge(g, "A", "B")
let g = add_edge(g, "B", "C")

# Remove a single edge
let g = remove_edge(g, "A", "B")
has_edge(g, "A", "B")    # false

# Remove a node (also removes all connected edges)
let g = remove_node(g, "B")
has_node(g, "B")          # false

Directed vs Undirected

By default, graphs are undirected — edges go both ways:

let g = graph()
let g = add_edge(g, "A", "B")
neighbors(g, "B")    # ["A"] — B sees A as neighbor

For directed graphs (e.g., regulatory networks):

let g = graph(true)
let g = add_edge(g, "TF", "TARGET")
neighbors(g, "TF")      # ["TARGET"]
neighbors(g, "TARGET")  # [] — directed, no reverse edge

Real-World Example: STRING Network Analysis

# Fetch protein interactions from STRING
let network = string_network("BRCA1", 9606)

# Build graph from API results
let g = graph()
let g = network |> reduce(g, |g, edge|
    add_edge(g, edge.preferredName_A, edge.preferredName_B, {score: edge.score})
)

# Find hub genes (highest degree)
let gene_degrees = nodes(g) |> map(|n| {gene: n, deg: degree(g, n)})
gene_degrees
  |> sort_by(|r| r.deg, desc: true)
  |> take(10)
  |> each(|r| print(r.gene + ": " + str(r.deg) + " interactions"))

# Check connectivity
let components = connected_components(g)
print("Connected components: " + str(len(components)))

Builtin Reference

FunctionArgsDescription
graph()[directed]Create empty graph (default: undirected)
add_node(g, id, [attrs])graph, Str, [Record]Add node with optional attributes
add_edge(g, from, to, [attrs])graph, Str, Str, [Record]Add edge (auto-creates nodes)
remove_node(g, id)graph, StrRemove node and its edges
remove_edge(g, from, to)graph, Str, StrRemove first matching edge
neighbors(g, id)graph, StrList of neighbor node IDs
degree(g, id)graph, StrNumber of edges for a node
shortest_path(g, from, to)graph, Str, StrBFS shortest path (nil if none)
connected_components(g)graphList of node-ID lists per component
nodes(g)graphSorted list of all node IDs
edges(g)graphTable with from, to, weight columns
has_node(g, id)graph, StrBool
has_edge(g, from, to)graph, Str, StrBool
subgraph(g, ids)graph, List[Str]Induced subgraph
node_attr(g, id)graph, StrAttributes record for a node

PDB Structures & Enrichment Analysis

BioLang provides direct access to the RCSB Protein Data Bank and built-in gene set enrichment analysis.

PDB Structures

Fetching Entries

let entry = pdb_entry("4HHB")
print(entry.title)          # "THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN"
print(entry.method)         # "X-RAY DIFFRACTION"
print(entry.resolution)     # 1.74
print(entry.organism)       # "Homo sapiens"
print(entry.release_date)   # "1984-07-17"

Searching

let ids = pdb_search("insulin receptor")
print("Found " + str(len(ids)) + " structures")
ids |> take(5) |> each(|id| print(id))

Chains and Entities

# Get all polymer entities (chains) in a structure
let chains = pdb_chains("4HHB")
chains |> each(|c|
    print(c.description + " (" + c.entity_type + "): " + str(len(c.sequence)) + " residues")
)

# Get a specific entity
let entity = pdb_entity("4HHB", 1)
print(entity.description)
print(entity.sequence)

# Get sequence as Protein value
let seq = pdb_sequence("4HHB", 1)
print(len(seq))             # sequence length
print(molecular_weight(seq)) # if available

Real-World Example: Compare Hemoglobin Chains

let alpha = pdb_entity("4HHB", 1)
let beta = pdb_entity("4HHB", 2)
print("Alpha chain: " + str(len(alpha.sequence)) + " residues")
print("Beta chain: " + str(len(beta.sequence)) + " residues")
print("Alpha type: " + alpha.entity_type)

PubMed Integration

Search PubMed and retrieve abstracts:

# Search for recent CRISPR papers
let results = pubmed_search("CRISPR Cas9 therapy", 5)
print("Total results: " + str(results.count))

# Fetch abstract for a specific paper
let abstract = pubmed_abstract(results.ids[0])
print(abstract)

Literature Review Pipeline

# Search for papers about a gene of interest
let gene = "BRCA1"
let results = pubmed_search(gene + " cancer therapy 2024", 20)

print("Found " + str(results.count) + " papers for " + gene)
results.ids
  |> take(5)
  |> each(|pmid|
    let text = pubmed_abstract(pmid)
    print("PMID " + pmid + ":")
    print(text)
    print("---")
  )

Enrichment Analysis

Over-Representation Analysis (ORA)

Test whether your gene list is enriched for specific biological pathways:

# Load gene sets (GMT format from MSigDB, GO, KEGG, etc.)
let gene_sets = read_gmt("h.all.v2024.1.Hs.symbols.gmt")

# Your differentially expressed genes
let de_genes = ["BRCA1", "TP53", "CDK2", "CCND1", "RB1", "E2F1", "PCNA", "MCM2"]

# Run ORA with background size (total genes measured)
let results = enrich(de_genes, gene_sets, 20000)

# Filter significant results
results
  |> filter(|r| r.fdr < 0.05)
  |> each(|r| print(r.term + ": p=" + str(r.p_value) + " FDR=" + str(r.fdr)))

The enrich() function uses the hypergeometric test with Benjamini-Hochberg FDR correction.

Output columns: term, overlap, p_value, fdr, genes

Gene Set Enrichment Analysis (GSEA)

For ranked gene lists (e.g., by fold change or t-statistic):

# Prepare ranked table with gene and score columns
let ranked = read_tsv("de_results.tsv")
  |> select(["gene", "log2fc"])
  |> rename({log2fc: "score"})
  |> sort_by(|r| r.score, desc: true)

# Load gene sets
let gene_sets = read_gmt("c2.cp.kegg.v2024.1.Hs.symbols.gmt")

# Run GSEA (1000 permutations)
let results = gsea(ranked, gene_sets)

# Top enriched pathways
results
  |> filter(|r| r.fdr < 0.25)
  |> each(|r| print(r.term + ": NES=" + str(r.nes) + " FDR=" + str(r.fdr)))

Output columns: term, es (enrichment score), nes (normalized ES), p_value, fdr, leading_edge

GMT File Format

The GMT (Gene Matrix Transposed) format is tab-delimited:

PATHWAY_NAME\tDescription\tGENE1\tGENE2\tGENE3\t...

Standard sources:

  • MSigDB (Broad Institute) — Hallmark, KEGG, Reactome, GO sets
  • Enrichr — curated gene set libraries
  • WikiPathways — community-curated pathways

Real-World Example: RNA-seq Enrichment Pipeline

# 1. Read differential expression results
let de = read_tsv("deseq2_results.tsv")

# 2. Get significant genes
let sig_genes = de
  |> filter(|r| r.padj < 0.05 and abs(r.log2FoldChange) > 1)
  |> map(|r| r.gene)
  |> collect()

print("Significant DE genes: " + str(len(sig_genes)))

# 3. Load pathway gene sets
let hallmark = read_gmt("h.all.v2024.1.Hs.symbols.gmt")
let kegg = read_gmt("c2.cp.kegg.v2024.1.Hs.symbols.gmt")

# 4. Run ORA on both
let h_results = enrich(sig_genes, hallmark, 20000)
let k_results = enrich(sig_genes, kegg, 20000)

# 5. Report significant pathways
print("\n=== Hallmark Pathways ===")
h_results |> filter(|r| r.fdr < 0.05) |> each(|r|
    print(r.term + " (overlap=" + str(r.overlap) + ", FDR=" + str(r.fdr) + ")")
)

print("\n=== KEGG Pathways ===")
k_results |> filter(|r| r.fdr < 0.05) |> each(|r|
    print(r.term + " (overlap=" + str(r.overlap) + ", FDR=" + str(r.fdr) + ")")
)

Builtin Reference

PDB

FunctionArgsReturnsDescription
pdb_entry(id)StrRecordFetch PDB entry metadata
pdb_search(query)StrList[Str]Full-text search, returns PDB IDs
pdb_entity(id, entity_id)Str, IntRecordGet specific polymer entity
pdb_sequence(id, entity_id)Str, IntProteinGet entity sequence as Protein value
pdb_chains(id)StrList[Record]Get all polymer entities for a structure

PubMed

FunctionArgsReturnsDescription
pubmed_search(query, [max])Str, [Int]Record{count, ids}Search PubMed articles
pubmed_abstract(pmid)StrStrFetch abstract text

Enrichment

FunctionArgsReturnsDescription
read_gmt(path)StrMap{name -> List[Str]}Parse GMT gene set file
enrich(genes, sets, bg)List, Map, IntTableORA with hypergeometric test + BH FDR
ora(genes, sets, bg)List, Map, IntTableAlias for enrich()
gsea(ranked, sets)Table, MapTableGSEA with permutation test + BH FDR

Chapter 17: Literate Notebooks

BioLang notebooks (.bln files) combine Markdown prose with executable code blocks. They’re ideal for documenting analyses, creating reproducible reports, and sharing methods with collaborators.

Running a Notebook

bl notebook analysis.bln

Prose sections are rendered with ANSI formatting in the terminal. Code blocks are executed in order, with output interleaved between prose.

The .bln Format

A .bln file is plain text. Everything outside a code block is Markdown prose. Code blocks can use either fenced syntax or dash delimiters.

Fenced Code Blocks

Use standard Markdown fences with biolang, bl, or no language tag:

## Load Data

Read the FASTQ file and compute basic stats.

```biolang
let reads = read_fastq("sample.fq")
print(f"Loaded {len(reads)} reads")
```

## Filter

Keep high-quality reads.

```bl
let filtered = reads |> filter(|r| mean_phred(r.quality) > 25)
print(f"Kept {len(filtered)} reads")
```

Dash Delimiters

The original .bln syntax uses --- on its own line:

## Load Data
---
let reads = read_fastq("sample.fq")
print(f"Loaded {len(reads)} reads")
---
## Results
The output above shows the read count.

Both styles can be mixed in the same notebook. Fenced blocks with other language tags (e.g. ```python) are treated as prose and not executed.

State Carries Over

Variables defined in one code block are available in all later blocks. This lets you build up an analysis step by step, explaining each part in prose.

Cell Directives

Add special comment annotations at the top of a code block to control behavior:

DirectiveEffect
# @hideExecute but don’t display the code
# @skipDon’t execute this block
# @echoPrint the code before executing
# @hide-outputExecute (and show code) but suppress printed output

Directives are stripped from the code before execution. Multiple directives can be combined:

```biolang
# @hide
let config = {min_quality: 30, reference: "GRCh38"}
```

```biolang
# @echo
let reads = read_fastq("sample.fq")
  |> filter(|r| mean_phred(r.quality) >= config.min_quality)
print(f"Filtered: {len(reads)} reads")
```

With # @hide, the config block runs silently. With # @echo, readers see both the code and its output.

HTML Export

Generate a self-contained HTML report:

bl notebook analysis.bln --export html > report.html

The HTML output includes:

  • Inline CSS with a dark theme
  • BioLang syntax highlighting (keywords, strings, comments, pipes)
  • Rendered Markdown prose
  • Code output blocks
  • No external dependencies – a single standalone file

This is useful for sharing results via email or publishing to a website.

Jupyter Interop

Import from Jupyter

Convert an existing .ipynb notebook to .bln:

bl notebook experiment.ipynb --from-ipynb > experiment.bln
  • Markdown cells become prose sections
  • Code cells become fenced biolang blocks
  • Outputs are discarded (they’ll be regenerated)

Export to Jupyter

Convert a .bln to .ipynb:

bl notebook analysis.bln --to-ipynb > analysis.ipynb
  • Prose becomes Markdown cells
  • Code blocks become code cells with "language": "biolang" metadata
  • Uses nbformat v4, compatible with JupyterLab and VS Code

Example: GC Content Report

Here’s a complete notebook that analyzes GC content:

# GC Content Analysis

This notebook computes per-contig GC content and flags outliers.

## Setup

```biolang
# @hide
let threshold = 2.0
```

## Load Data

```biolang
let seqs = read_fasta("contigs.fa")
print(f"Loaded {len(seqs)} sequences")
```

## Statistics

```biolang
# @echo
let gc_values = seqs |> map(|s| gc_content(s.seq))
let mu = mean(gc_values)
let sigma = stdev(gc_values)
print(f"Mean GC: {mu:.3f} +/- {sigma:.4f}")
```

## Outliers

Contigs more than 2 std devs from the mean may indicate
contamination or horizontal gene transfer.

```biolang
let outliers = seqs
  |> filter(|s| abs(gc_content(s.seq) - mu) > threshold * sigma)
print(f"Found {len(outliers)} outlier contigs")
```

Run it:

bl notebook gc_analysis.bln

Export it:

bl notebook gc_analysis.bln --export html > gc_report.html

Tips

  • Use # @hide for setup – configuration, imports, and helper functions that readers don’t need to see
  • Use # @echo for key steps – shows both code and output for the important analysis logic
  • Use # @skip during development – temporarily disable expensive cells
  • The HTML export is self-contained – share the .html file directly
  • Convert existing Jupyter notebooks with --from-ipynb to migrate analyses

Chapter 15: Plugins

BioLang ships with a rich set of built-in functions, but bioinformatics is vast. You may need to wrap a niche aligner, call an R statistical package, or connect to a lab-specific LIMS API. The plugin system lets you extend BioLang with external tools written in Python, R, TypeScript/Deno, or native binaries, all communicating over a simple JSON protocol on stdin/stdout.

Plugin Architecture

A BioLang plugin is a standalone program that:

  1. Reads JSON requests from stdin
  2. Processes the request
  3. Writes JSON responses to stdout

BioLang manages the plugin lifecycle: it starts the subprocess when the plugin is first called, keeps it alive for subsequent calls, and shuts it down when the script ends. This avoids process startup overhead on repeated invocations.

The Plugin Manifest

Every plugin has a plugin.json file that describes its interface.

{
  "name": "blast-wrapper",
  "version": "1.0.0",
  "description": "BLAST+ sequence search wrapper",
  "kind": "python",
  "entry": "blast_plugin.py",
  "functions": [
    {
      "name": "blastn",
      "description": "Nucleotide BLAST search",
      "params": {
        "query": {"type": "string", "description": "Path to query FASTA"},
        "db": {"type": "string", "description": "BLAST database name or path"},
        "evalue": {"type": "number", "default": 1e-5, "description": "E-value threshold"},
        "max_hits": {"type": "integer", "default": 10, "description": "Maximum hits to return"},
        "threads": {"type": "integer", "default": 4, "description": "CPU threads"}
      },
      "returns": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "subject": "string",
            "identity": "number",
            "evalue": "number",
            "bitscore": "number",
            "alignment_length": "integer"
          }
        }
      }
    },
    {
      "name": "blastp",
      "description": "Protein BLAST search",
      "params": {
        "query": {"type": "string", "description": "Path to query FASTA"},
        "db": {"type": "string", "description": "BLAST database name or path"},
        "evalue": {"type": "number", "default": 1e-5},
        "max_hits": {"type": "integer", "default": 10}
      },
      "returns": {
        "type": "array"
      }
    },
    {
      "name": "makeblastdb",
      "description": "Create a BLAST database from a FASTA file",
      "params": {
        "input": {"type": "string", "description": "Input FASTA path"},
        "dbtype": {"type": "string", "enum": ["nucl", "prot"], "description": "Sequence type"},
        "out": {"type": "string", "description": "Output database prefix"}
      },
      "returns": {
        "type": "object"
      }
    }
  ]
}

The kind field tells BioLang how to launch the plugin:

KindInterpreter
pythonpython3 (or python on Windows)
rRscript
denodeno run
typescriptdeno run (same as deno)
nativeDirect executable

Communication Protocol

BioLang sends JSON messages to the plugin’s stdin and reads JSON responses from stdout. Each message is a single line of JSON terminated by a newline.

Request Format

{"id": "req_001", "function": "blastn", "params": {"query": "input.fa", "db": "nt", "evalue": 1e-10}}

Response Format

Success:

{"id": "req_001", "result": [{"subject": "NM_000546.6", "identity": 99.5, "evalue": 0.0, "bitscore": 1842}]}

Error:

{"id": "req_001", "error": {"code": "BLAST_FAILED", "message": "Database nt not found"}}

Lifecycle Messages

BioLang sends a shutdown message when the script ends:

{"id": "shutdown", "function": "__shutdown__", "params": {}}

The plugin should clean up resources and exit.

Installing Plugins

From a Directory

bl add ./my-plugins/blast-wrapper

This copies (or symlinks) the plugin into ~/.biolang/plugins/blast-wrapper/.

From a Git Repository

bl add https://github.com/lab/biolang-deseq2.git

BioLang clones the repository into the plugins directory.

Removing Plugins

bl remove blast-wrapper

Listing Installed Plugins

From the command line:

bl plugins

From the REPL:

:plugins

Both show the plugin name, version, kind, and available functions.

Using Plugins in Scripts

Once installed, import a plugin by name and call its functions.

import "blast-wrapper"

# Build a custom database
makeblastdb(input: "my_sequences.fa", dbtype: "nucl", out: "custom_db")

# Search against it
let hits = blastn(query: "query.fa", db: "custom_db",
                  evalue: 1e-20, max_hits: 5)

hits |> each(|h| print(h.subject + " identity=" + str(h.identity)
                       + "% evalue=" + str(h.evalue)))

Plugin functions integrate seamlessly with pipes and other BioLang features.

read_fasta("contigs.fa")
  |> filter(|r| len(r.seq) > kb(1))
  |> write_fasta("long_contigs.fa")

blastn(query: "long_contigs.fa", db: "nt", evalue: 1e-30, threads: 8)
  |> filter(|h| h.identity > 95.0)
  |> sort_by(|h| -h.bitscore)
  |> write_tsv("blast_hits.tsv")

Example: Python BLAST Wrapper Plugin

Here is the complete Python implementation for the blast-wrapper plugin described above.

Directory Structure

blast-wrapper/
  plugin.json        # manifest (shown earlier)
  blast_plugin.py    # implementation

blast_plugin.py

import json
import subprocess
import sys
import csv
from io import StringIO


def blastn(params):
    cmd = [
        "blastn",
        "-query", params["query"],
        "-db", params["db"],
        "-evalue", str(params.get("evalue", 1e-5)),
        "-max_target_seqs", str(params.get("max_hits", 10)),
        "-num_threads", str(params.get("threads", 4)),
        "-outfmt", "6 sseqid pident evalue bitscore length",
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        return {"error": {"code": "BLAST_FAILED", "message": result.stderr.strip()}}

    hits = []
    reader = csv.reader(StringIO(result.stdout), delimiter="\t")
    for row in reader:
        hits.append({
            "subject": row[0],
            "identity": float(row[1]),
            "evalue": float(row[2]),
            "bitscore": float(row[3]),
            "alignment_length": int(row[4]),
        })
    return {"result": hits}


def blastp(params):
    cmd = [
        "blastp",
        "-query", params["query"],
        "-db", params["db"],
        "-evalue", str(params.get("evalue", 1e-5)),
        "-max_target_seqs", str(params.get("max_hits", 10)),
        "-outfmt", "6 sseqid pident evalue bitscore length",
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        return {"error": {"code": "BLAST_FAILED", "message": result.stderr.strip()}}

    hits = []
    reader = csv.reader(StringIO(result.stdout), delimiter="\t")
    for row in reader:
        hits.append({
            "subject": row[0],
            "identity": float(row[1]),
            "evalue": float(row[2]),
            "bitscore": float(row[3]),
            "alignment_length": int(row[4]),
        })
    return {"result": hits}


def makeblastdb(params):
    cmd = [
        "makeblastdb",
        "-in", params["input"],
        "-dbtype", params["dbtype"],
        "-out", params["out"],
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        return {"error": {"code": "MAKEBLASTDB_FAILED", "message": result.stderr.strip()}}
    return {"result": {"database": params["out"], "status": "created"}}


DISPATCH = {
    "blastn": blastn,
    "blastp": blastp,
    "makeblastdb": makeblastdb,
}


def main():
    for line in sys.stdin:
        line = line.strip()
        if not line:
            continue
        request = json.loads(line)

        func_name = request["function"]
        if func_name == "__shutdown__":
            break

        handler = DISPATCH.get(func_name)
        if handler is None:
            response = {"id": request["id"],
                        "error": {"code": "UNKNOWN_FUNCTION",
                                  "message": f"No function: {func_name}"}}
        else:
            resp = handler(request["params"])
            response = {"id": request["id"], **resp}

        sys.stdout.write(json.dumps(response) + "\n")
        sys.stdout.flush()


if __name__ == "__main__":
    main()

Install and use:

bl add ./blast-wrapper

# In a BioLang script:
import "blast-wrapper"

let hits = blastn(query: "primers.fa", db: "refseq_genomic",
                  evalue: 1e-5, max_hits: 20)
hits |> filter(|h| h.identity == 100.0)
    |> each(|h| print("Perfect match: " + h.subject))

Example: R DESeq2 Plugin for Differential Expression

Directory Structure

deseq2-plugin/
  plugin.json
  deseq2_plugin.R

plugin.json

{
  "name": "deseq2",
  "version": "1.0.0",
  "description": "DESeq2 differential expression analysis",
  "kind": "r",
  "entry": "deseq2_plugin.R",
  "functions": [
    {
      "name": "deseq2_de",
      "description": "Run DESeq2 differential expression on a count matrix",
      "params": {
        "counts_file": {"type": "string", "description": "Path to raw counts TSV (genes x samples)"},
        "metadata_file": {"type": "string", "description": "Path to sample metadata TSV"},
        "design": {"type": "string", "default": "~ condition", "description": "DESeq2 design formula"},
        "contrast": {"type": "array", "description": "Contrast vector, e.g. ['condition', 'treated', 'control']"},
        "alpha": {"type": "number", "default": 0.05, "description": "Adjusted p-value threshold"},
        "lfc_threshold": {"type": "number", "default": 0.0, "description": "Log2 fold change threshold for lfcShrink"}
      },
      "returns": {
        "type": "object",
        "properties": {
          "results_file": "string",
          "n_significant": "integer",
          "n_up": "integer",
          "n_down": "integer"
        }
      }
    },
    {
      "name": "deseq2_normalize",
      "description": "Return DESeq2-normalized counts",
      "params": {
        "counts_file": {"type": "string"},
        "metadata_file": {"type": "string"},
        "design": {"type": "string", "default": "~ condition"}
      },
      "returns": {
        "type": "object",
        "properties": {
          "normalized_file": "string"
        }
      }
    }
  ]
}

deseq2_plugin.R

library(jsonlite)
library(DESeq2)

deseq2_de <- function(params) {
  counts <- read.delim(params$counts_file, row.names = 1)
  metadata <- read.delim(params$metadata_file, row.names = 1)

  # Ensure sample order matches
  metadata <- metadata[colnames(counts), , drop = FALSE]

  dds <- DESeqDataSetFromMatrix(
    countData = counts,
    colData = metadata,
    design = as.formula(params$design)
  )

  dds <- DESeq(dds)

  contrast <- params$contrast
  res <- results(dds, contrast = contrast, alpha = params$alpha)

  if (params$lfc_threshold > 0) {
    res <- lfcShrink(dds, contrast = contrast, res = res, type = "ashr")
  }

  res_df <- as.data.frame(res)
  res_df$gene <- rownames(res_df)
  out_file <- sub("\\.tsv$", "_deseq2_results.tsv", params$counts_file)
  write.table(res_df, out_file, sep = "\t", quote = FALSE, row.names = FALSE)

  sig <- res_df[!is.na(res_df$padj) & res_df$padj < params$alpha, ]

  list(
    result = list(
      results_file = out_file,
      n_significant = nrow(sig),
      n_up = sum(sig$log2FoldChange > 0),
      n_down = sum(sig$log2FoldChange < 0)
    )
  )
}

deseq2_normalize <- function(params) {
  counts <- read.delim(params$counts_file, row.names = 1)
  metadata <- read.delim(params$metadata_file, row.names = 1)
  metadata <- metadata[colnames(counts), , drop = FALSE]

  dds <- DESeqDataSetFromMatrix(
    countData = counts,
    colData = metadata,
    design = as.formula(params$design)
  )
  dds <- estimateSizeFactors(dds)
  norm_counts <- counts(dds, normalized = TRUE)

  out_file <- sub("\\.tsv$", "_normalized.tsv", params$counts_file)
  write.table(as.data.frame(norm_counts), out_file, sep = "\t", quote = FALSE)

  list(result = list(normalized_file = out_file))
}

dispatch <- list(
  deseq2_de = deseq2_de,
  deseq2_normalize = deseq2_normalize
)

# Main loop: read JSON from stdin, dispatch, write JSON to stdout
con <- file("stdin", "r")
while (TRUE) {
  line <- readLines(con, n = 1)
  if (length(line) == 0) break

  request <- fromJSON(line)
  if (request$`function` == "__shutdown__") break

  handler <- dispatch[[request$`function`]]
  if (is.null(handler)) {
    response <- list(
      id = request$id,
      error = list(code = "UNKNOWN_FUNCTION",
                   message = paste("No function:", request$`function`))
    )
  } else {
    resp <- tryCatch(
      handler(request$params),
      error = function(e) list(error = list(code = "R_ERROR", message = e$message))
    )
    response <- c(list(id = request$id), resp)
  }

  cat(toJSON(response, auto_unbox = TRUE), "\n")
  flush(stdout())
}
close(con)

Using the DESeq2 Plugin

bl add ./deseq2-plugin

# In a BioLang script:
import "deseq2"

# Run differential expression
let de = deseq2_de(
  counts_file: "raw_counts.tsv",
  metadata_file: "sample_metadata.tsv",
  design: "~ condition",
  contrast: ["condition", "treated", "control"],
  alpha: 0.05,
  lfc_threshold: 1.0
)

print(str(de.n_significant) + " DE genes ("
      + str(de.n_up) + " up, " + str(de.n_down) + " down)")

# Read the results back into BioLang for further analysis
let results = tsv(de.results_file)
  |> filter(|r| r.padj != None and r.padj < 0.01)
  |> sort_by(|r| r.padj)

# Cross-reference top hits with bio APIs
results |> take(10) |> each(|r| {
  let gene_info = ncbi_gene(r.gene)
  print(r.gene + " log2FC=" + str(r.log2FoldChange)
        + " padj=" + str(r.padj)
        + " — " + gene_info.name)
})

# Get normalized counts for downstream analysis
let norm = deseq2_normalize(
  counts_file: "raw_counts.tsv",
  metadata_file: "sample_metadata.tsv"
)
print("Normalized counts: " + norm.normalized_file)

Writing a TypeScript/Deno Plugin

For plugins that connect to REST APIs or need async I/O, Deno is a good choice.

lims-connector/plugin.json

{
  "name": "lims-connector",
  "version": "1.0.0",
  "description": "Connect to lab LIMS for sample metadata",
  "kind": "deno",
  "entry": "lims_plugin.ts",
  "functions": [
    {
      "name": "lims_samples",
      "description": "Fetch sample metadata from LIMS",
      "params": {
        "project": {"type": "string", "description": "LIMS project ID"},
        "status": {"type": "string", "default": "all", "description": "Filter by status"}
      },
      "returns": {"type": "array"}
    },
    {
      "name": "lims_submit_results",
      "description": "Push analysis results back to LIMS",
      "params": {
        "sample_id": {"type": "string"},
        "results_file": {"type": "string"},
        "qc_status": {"type": "string", "enum": ["pass", "fail", "warning"]}
      },
      "returns": {"type": "object"}
    }
  ]
}

lims_plugin.ts

import { readLines } from "https://deno.land/std/io/mod.ts";

const LIMS_URL = Deno.env.get("LIMS_API_URL") || "https://lims.example.com/api";
const LIMS_TOKEN = Deno.env.get("LIMS_API_TOKEN") || "";

interface Request {
  id: string;
  function: string;
  params: Record<string, unknown>;
}

async function limsSamples(params: Record<string, unknown>) {
  const project = params.project as string;
  const status = (params.status as string) || "all";

  const url = `${LIMS_URL}/projects/${project}/samples?status=${status}`;
  const resp = await fetch(url, {
    headers: { "Authorization": `Bearer ${LIMS_TOKEN}` },
  });

  if (!resp.ok) {
    return { error: { code: "LIMS_ERROR", message: await resp.text() } };
  }

  const data = await resp.json();
  return { result: data.samples };
}

async function limsSubmitResults(params: Record<string, unknown>) {
  const sampleId = params.sample_id as string;
  const resultsFile = params.results_file as string;
  const qcStatus = params.qc_status as string;

  const results = await Deno.readTextFile(resultsFile);

  const resp = await fetch(`${LIMS_URL}/samples/${sampleId}/results`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${LIMS_TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ results: JSON.parse(results), qc_status: qcStatus }),
  });

  if (!resp.ok) {
    return { error: { code: "LIMS_ERROR", message: await resp.text() } };
  }

  return { result: { sample_id: sampleId, status: "submitted" } };
}

const dispatch: Record<string, (p: Record<string, unknown>) => Promise<unknown>> = {
  lims_samples: limsSamples,
  lims_submit_results: limsSubmitResults,
};

for await (const line of readLines(Deno.stdin)) {
  if (!line.trim()) continue;

  const request: Request = JSON.parse(line);
  if (request.function === "__shutdown__") break;

  const handler = dispatch[request.function];
  let response;
  if (!handler) {
    response = {
      id: request.id,
      error: { code: "UNKNOWN_FUNCTION", message: `No function: ${request.function}` },
    };
  } else {
    const resp = await handler(request.params);
    response = { id: request.id, ...resp };
  }

  console.log(JSON.stringify(response));
}

Using the LIMS Plugin

import "lims-connector"

# Fetch samples from the LIMS
let samples = lims_samples(project: "WGS-2024-042", status: "sequenced")

# Process each sample
let results = samples |> par_map(|sample| {
  let bam = bwa_mem("GRCh38.fa", sample.fastq_r1, sample.fastq_r2, threads: 8)
    |> samtools_sort(threads: 4)
  let stats = samtools_flagstat(bam)
  let depth = samtools_depth(bam) |> mean()

  let qc = if stats.mapped_pct > 90 and depth > 30 then "pass"
           else if stats.mapped_pct > 80 then "warning"
           else "fail"

  let result = {sample_id: sample.id, mapped_pct: stats.mapped_pct,
                mean_depth: depth, bam: bam}
  result |> write_json(sample.id + "_results.json")

  # Push results back to LIMS
  lims_submit_results(sample_id: sample.id,
                      results_file: sample.id + "_results.json",
                      qc_status: qc)
  result
})

Plugin Discovery

BioLang searches for plugins in this order:

  1. ./plugins/ in the current working directory
  2. ~/.biolang/plugins/ in the user home directory
  3. Paths listed in the BIOLANG_PLUGIN_PATH environment variable

Each directory is scanned for subdirectories containing a plugin.json file.

Plugin Best Practices

Keep plugins focused. A plugin should wrap one tool or one API. If you have BLAST and HMMER, make them separate plugins.

Use stderr for logging. BioLang captures stdout for the JSON protocol. Diagnostic messages, progress indicators, and debug output should go to stderr.

Handle errors gracefully. Return error responses instead of crashing. The error response format lets BioLang surface meaningful messages to the user.

Document parameters. The description fields in plugin.json are shown by :plugins in the REPL and bl plugins on the command line.

Pin versions. If your plugin depends on an external tool, document the required version in the manifest description so users know what to install.

Summary

The plugin system extends BioLang with any language that can read and write JSON on stdio. Write a plugin.json manifest, implement the stdin/stdout dispatch loop, and install with bl add. Plugins integrate fully with pipes, parallel operations, and the rest of BioLang. Python plugins wrap command-line bioinformatics tools, R plugins bring statistical packages like DESeq2, and Deno/TypeScript plugins connect to REST APIs and external services.

Appendix A: Builtin Reference

BioLang ships with a comprehensive standard library of builtins designed for bioinformatics workflows. Every function listed here is available without imports – they are part of the language runtime.


Sequence Operations

Builtins that operate on bio-typed sequences (dna, rna, protein).

BuiltinDescription
complement(seq) -> Dna | RnaWatson-Crick complement of a nucleotide sequence
reverse_complement(seq) -> Dna | RnaReverse complement – the opposing strand
transcribe(seq) -> RnaTranscribe DNA to RNA (T to U)
translate(seq) -> ProteinTranslate an RNA or DNA coding sequence to amino acids
gc_content(seq) -> FloatGC fraction of a nucleotide sequence (0.0 – 1.0)
find_motif(seq, pattern) -> List[Int]All start positions where pattern occurs in seq
hamming_distance(a, b) -> IntNumber of mismatched positions between equal-length sequences
edit_distance(a, b) -> IntEdit distance between two sequences
find_orfs(seq, min_len?) -> List[Record]Open reading frames with start, stop, and frame fields
restriction_sites(seq, enzyme?) -> List[Record]Recognition sites for restriction enzymes
tm(seq) -> FloatMelting temperature estimate for a short oligonucleotide
slice(seq, start, end) -> Dna | Rna | ProteinExtract a subsequence by 0-based coordinates
# Example: quick primer analysis
let primer = dna"ATCGATCGATCG"
let rc     = reverse_complement(primer)
let temp   = tm(primer)
print("Primer Tm = " + str(temp) + "C, reverse complement = " + str(rc))

Collection Operations

General-purpose operations on lists, records, and sets.

BuiltinDescription
len(coll) -> IntNumber of elements in a list, string, or sequence
push(list, item) -> ListAppend an element, returning a new list
pop(list) -> ListRemove the last element, returning a new list
concat(a, b) -> ListConcatenate two lists
flatten(nested) -> ListFlatten one level of nesting
reverse(list) -> ListReverse element order
contains(coll, item) -> BoolTrue if item is present
index_of(list, item) -> Int | NoneFirst index of item, or None
last(list) -> AnyLast element
first(list) -> AnyFirst element
head(list, n) -> ListFirst n elements
tail(list, n) -> ListLast n elements
unique(list) -> ListRemove duplicates, preserving order
zip(a, b) -> ListPair elements from two lists into a list of tuples
enumerate(list) -> ListPair each element with its 0-based index
chunk(list, size) -> List[List]Split into fixed-size sublists
window(list, size) -> List[List]Sliding window of the given size
scan(list, init, fn) -> ListRunning accumulation (like reduce but keeps intermediates)
range(start, end, step?) -> ListInteger range
set(list) -> SetConvert a list to a deduplicated set
keys(record) -> List[Str]Field names of a record
values(record) -> ListField values of a record
has_key(record, key) -> BoolTrue if the record contains the named field
sort_by(list, fn) -> ListSort by a key function
group_by(list, fn) -> RecordGroup elements by a key function into a record of lists
partition(list, fn) -> [List, List]Split into elements that pass and fail a predicate
# Example: enumerate quality-filtered reads
let good_reads = reads
  |> filter(|r| mean_phred(r.quality) > 30)
  |> enumerate()
  |> head(5)

Higher-Order Functions

Functions that accept other functions as arguments – the backbone of BioLang’s pipeline style.

BuiltinDescription
map(coll, fn) -> ListApply fn to every element
filter(coll, fn) -> ListKeep elements where fn returns true
reduce(coll, init, fn) -> AnyFold elements into a single value
sort(coll, fn?) -> ListSort, optionally by comparator
each(coll, fn) -> NoneExecute fn for side effects on every element
flat_map(coll, fn) -> ListMap then flatten one level
take_while(coll, fn) -> ListTake leading elements while predicate holds
any(coll, fn) -> BoolTrue if fn returns true for at least one element
all(coll, fn) -> BoolTrue if fn returns true for every element
none(coll, fn) -> BoolTrue if fn returns true for no elements
find(coll, fn) -> Any | NoneFirst element satisfying fn
find_index(coll, fn) -> Int | NoneIndex of first element satisfying fn
par_map(coll, fn) -> ListParallel map across available cores
par_filter(coll, fn) -> ListParallel filter across available cores
mat_map(matrix, fn) -> MatrixApply fn element-wise to a matrix
try_call(fn, args) -> ResultCall fn with args, capturing errors instead of panicking
# Example: parallel GC content across a genome's chromosomes
let gc_values = chromosomes
  |> par_map(|chr| {name: chr.name, gc: gc_content(chr.seq)})
  |> sort_by(|r| r.gc)

Table Operations

Tabular data manipulation inspired by dataframe semantics – designed for sample sheets, variant tables, and expression matrices.

BuiltinDescription
table(data) -> TableCreate a table from a list of records
select(tbl, ...cols) -> TablePick columns by name
mutate(tbl, name, fn) -> TableAdd or transform a column
summarize(tbl, ...aggs) -> TableAggregate columns (sum, mean, etc.)
group_by(tbl, col) -> GroupedTableGroup rows by a column value (table variant)
join(a, b, on, how?) -> TableJoin two tables on a key column
csv(path) -> TableRead a CSV file into a table
tsv(path) -> TableRead a TSV file into a table
write_csv(tbl, path) -> NoneWrite a table to CSV
write_tsv(tbl, path) -> NoneWrite a table to TSV
nrow(tbl) -> IntNumber of rows
ncol(tbl) -> IntNumber of columns
colnames(tbl) -> List[Str]Column name list
row_names(tbl) -> List[Str]Row name list (if set)
# Example: summarize variant counts per chromosome
tsv("variants.tsv")
  |> group_by("chrom")
  |> summarize(count: len, mean_qual: |rows| mean(rows |> map(|r| r.quality)))
  |> write_csv("chrom_summary.csv")

Bio File I/O

Read and write standard bioinformatics file formats. Readers return lazy streams that integrate with pipes; writers flush on completion.

BuiltinDescription
read_fasta(path) -> List[Record]Parse FASTA; each record has id, desc, seq fields
read_fastq(path) -> List[Record]Parse FASTQ; each record has id, seq, quality, length fields
read_vcf(path) -> List[Record]Parse VCF; each record has chrom, pos, ref, alt, qual, info fields
read_bed(path) -> List[Record]Parse BED; each record has chrom, start, end and optional fields
read_gff(path) -> List[Record]Parse GFF/GTF; each record has seqid, type, start, end, attributes
write_fasta(records, path) -> NoneWrite records to FASTA format
write_fastq(records, path) -> NoneWrite records to FASTQ format
write_bed(records, path) -> NoneWrite records to BED format
# Example: filter FASTQ reads by quality and write survivors
read_fastq("sample_R1.fastq.gz")
  |> filter(|r| mean_phred(r.quality) > 30)
  |> write_fastq("sample_R1.filtered.fq")

Genomic Intervals

Interval arithmetic for coordinate-based genomic analysis. Intervals carry chrom, start, end, and optional strand and data fields.

BuiltinDescription
interval(chrom, start, end, strand?) -> IntervalCreate a genomic interval
interval_tree(intervals) -> IntervalTreeBuild an index for fast overlap queries
query_overlaps(tree, query) -> List[Interval]All intervals overlapping the query
query_nearest(tree, query, k?) -> List[Interval]K nearest intervals to the query
coverage(intervals) -> List[Record]Per-base or per-region coverage depth
merge_intervals(intervals, dist?) -> List[Interval]Merge overlapping or nearby intervals
intersect_intervals(a, b) -> List[Interval]Pairwise intersection of two interval sets
subtract_intervals(a, b) -> List[Interval]Regions in a not covered by b
# Example: find promoter-peak overlaps
let promoters = read_bed("examples/sample-data/promoters.bed") |> map(|r| interval(r.chrom, r.start, r.end))
let peaks     = read_bed("examples/sample-data/chip_peaks.bed") |> map(|r| interval(r.chrom, r.start, r.end))
let tree      = interval_tree(peaks)
let hits      = promoters |> flat_map(|p| query_overlaps(tree, p))
print("Found " + str(len(hits)) + " promoter-peak overlaps")

Variants

Builtins for working with genetic variant records. Variant objects carry chrom, pos, ref, alt, qual, and info fields.

BuiltinDescription
variant(chrom, pos, ref, alt) -> VariantConstruct a variant record
is_snp(v) -> BoolTrue if single-nucleotide polymorphism
is_indel(v) -> BoolTrue if insertion or deletion
is_transition(v) -> BoolTrue if purine-purine or pyrimidine-pyrimidine substitution
is_transversion(v) -> BoolTrue if purine-pyrimidine substitution
variant_type(v) -> StrClassification string: “snp”, “ins”, “del”, “mnv”, “complex”
is_het(v) -> BoolTrue if heterozygous genotype
is_hom_ref(v) -> BoolTrue if homozygous reference
is_hom_alt(v) -> BoolTrue if homozygous alternate
is_multiallelic(v) -> BoolTrue if more than one alt allele
parse_vcf_info(info_str) -> RecordParse a VCF INFO field string into a record
variant_summary(variants) -> RecordAggregate counts of SNPs, indels, Ti/Tv ratio, het/hom ratio
# Example: compute Ti/Tv ratio for a VCF
let vars = read_vcf("examples/sample-data/calls.vcf") |> filter(|v| v.qual > 30)
let summary = variant_summary(vars)
print("Ti/Tv = " + str(summary.ti_tv_ratio) + ", SNPs = " + str(summary.snp_count))

Statistics

Statistical functions for quality control, expression analysis, and hypothesis testing.

BuiltinDescription
mean(xs) -> FloatArithmetic mean
median(xs) -> FloatMedian value
stdev(xs) -> FloatSample standard deviation
variance(xs) -> FloatSample variance
quantile(xs, q) -> FloatQuantile at fraction q (0.0 – 1.0)
min(xs) -> NumMinimum value
max(xs) -> NumMaximum value
sum(xs) -> NumSum of all elements
cor(xs, ys) -> FloatPearson correlation coefficient
ttest(xs, ys) -> RecordTwo-sample t-test; returns {statistic, pvalue}
chi_square(observed, expected) -> RecordChi-squared test; returns {statistic, pvalue, df}
wilcoxon(xs, ys) -> RecordWilcoxon rank-sum test
anova(groups) -> RecordOne-way ANOVA across groups
fisher_exact(table) -> RecordFisher’s exact test on a 2x2 contingency table
p_adjust(pvals, method?) -> List[Float]Multiple testing correction (default: Benjamini-Hochberg)
lm(xs, ys) -> RecordSimple linear regression; returns {slope, intercept, r_squared}
ks_test(xs, ys) -> RecordKolmogorov-Smirnov test
mean_phred(quals) -> FloatMean Phred quality score from a quality string
# Example: differential expression significance
let control   = [5.2, 4.8, 5.1, 4.9]
let treatment = [8.1, 7.5, 8.3, 7.9]
let result    = ttest(control, treatment)
print("p-value = " + str(result.pvalue))

Linear Algebra

Matrix operations for expression matrices, PCA, distance calculations, and numerical biology.

BuiltinDescription
matrix(data) -> MatrixCreate a matrix from a list of lists (row-major)
transpose(m) -> MatrixTranspose rows and columns
mat_mul(a, b) -> MatrixMatrix multiplication
determinant(m) -> FloatDeterminant of a square matrix
inverse(m) -> MatrixMatrix inverse
eigenvalues(m) -> List[Float]Eigenvalues of a square matrix
svd(m) -> RecordSingular value decomposition; returns {u, s, vt}
solve(a, b) -> MatrixSolve the linear system Ax = b
trace(m) -> FloatSum of diagonal elements
norm(m, p?) -> FloatMatrix or vector norm (default: Frobenius / L2)
rank(m) -> IntNumerical rank
identity(n) -> Matrixn x n identity matrix
zeros(rows, cols) -> MatrixMatrix of zeros
ones(rows, cols) -> MatrixMatrix of ones
diag(values) -> MatrixDiagonal matrix from a list of values
mat_map(m, fn) -> MatrixApply fn element-wise
# Example: PCA on a gene expression matrix
let expr = tsv("examples/sample-data/counts.tsv") |> table()
let m    = matrix(expr |> select("gene_a", "gene_b", "gene_c"))
let decomp = svd(m)
print("Top 3 singular values: " + str(head(decomp.s, 3)))

Math

Standard mathematical functions available for scoring, normalization, and modeling.

BuiltinDescription
abs(x) -> NumAbsolute value
ceil(x) -> IntRound up to nearest integer
floor(x) -> IntRound down to nearest integer
round(x, digits?) -> FloatRound to digits decimal places (default: 0)
sqrt(x) -> FloatSquare root
log(x) -> FloatNatural logarithm
log2(x) -> FloatBase-2 logarithm (common in fold-change analysis)
log10(x) -> FloatBase-10 logarithm
exp(x) -> FloatEuler’s number raised to x
pow(base, exp) -> FloatExponentiation
sin(x) -> FloatSine
cos(x) -> FloatCosine
tan(x) -> FloatTangent
ode_solve(fn, y0, t_span, dt?) -> List[Record]Numerical ODE integration (Runge-Kutta)
# Example: log2 fold-change between conditions
let control   = 12.5
let treatment = 50.0
let lfc = log2(treatment / control)
print("Log2 fold-change = " + str(lfc))

String Operations

Text manipulation for parsing identifiers, annotations, and formatted output.

BuiltinDescription
split(s, delim) -> List[Str]Split string on delimiter
join(list, delim) -> StrJoin list elements into a string
trim(s) -> StrStrip leading and trailing whitespace
upper(s) -> StrConvert to uppercase
lower(s) -> StrConvert to lowercase
starts_with(s, prefix) -> BoolTrue if s begins with prefix
ends_with(s, suffix) -> BoolTrue if s ends with suffix
replace(s, from, to) -> StrReplace all occurrences
matches(s, pattern) -> BoolTrue if regex pattern matches
format(template, ...args) -> StrPrintf-style formatting

BioLang also supports f-strings for inline interpolation:

# Example: parse a FASTA header
let header = ">sp|P12345|MYG_HUMAN Myoglobin OS=Homo sapiens"
let parts  = split(header, "|")
let acc    = parts[1]
print("Accession: " + acc)

Type Operations

Runtime type inspection and conversion – useful for dynamic dispatch in pipelines that handle mixed bio types.

BuiltinDescription
type(val) -> StrRuntime type name as a string
is_dna(val) -> BoolTrue if val is a DNA sequence
is_rna(val) -> BoolTrue if val is an RNA sequence
is_protein(val) -> BoolTrue if val is a protein sequence
is_interval(val) -> BoolTrue if val is a genomic interval
is_variant(val) -> BoolTrue if val is a variant record
is_record(val) -> BoolTrue if val is a record
is_list(val) -> BoolTrue if val is a list
is_table(val) -> BoolTrue if val is a table
is_nil(val) -> BoolTrue if val is nil
is_int(val) -> BoolTrue if val is an integer
is_float(val) -> BoolTrue if val is a float
is_str(val) -> BoolTrue if val is a string
is_bool(val) -> BoolTrue if val is a boolean
into(val, target_type) -> AnyConvert between compatible types
# Example: route processing based on sequence type
let seq = read_fasta("input.fa") |> first() |> |r| r.seq
if is_dna(seq) then
  print("DNA, GC = " + str(gc_content(seq)))
else if is_protein(seq) then
  print("Protein, length = " + str(len(seq)))

Bio APIs

Remote database queries for annotation enrichment. All API builtins are async-aware and return structured records.

BuiltinDescription
ncbi_search(db, query, max?) -> List[Record]Search NCBI Entrez databases
ncbi_gene(gene_id) -> RecordFetch NCBI Gene record
ncbi_sequence(acc) -> RecordFetch sequence by accession from NCBI Nucleotide/Protein
ensembl_gene(ensembl_id) -> RecordEnsembl gene lookup by Ensembl ID
ensembl_symbol(species, symbol) -> RecordEnsembl gene lookup by species and symbol
ensembl_vep(variants) -> List[Record]Variant Effect Predictor annotation
uniprot_search(query, max?) -> List[Record]Search UniProt by keyword or accession
uniprot_entry(acc) -> RecordFull UniProt entry
ucsc_sequence(genome, chrom, start, end) -> DnaFetch genomic sequence from UCSC DAS
kegg_get(entry) -> RecordRetrieve a KEGG database entry
kegg_find(db, query) -> List[Record]Search KEGG databases
string_network(proteins, species?) -> RecordSTRING protein-protein interaction network
pdb_entry(pdb_id) -> RecordFetch PDB structure metadata
reactome_pathways(gene) -> List[Record]Reactome pathway memberships for a gene
go_term(go_id) -> RecordGene Ontology term details
go_annotations(gene, species?) -> List[Record]GO annotations for a gene
cosmic_gene(symbol) -> RecordCOSMIC cancer gene census entry
datasets_gene(symbol, taxon?) -> RecordNCBI Datasets gene data
biomart_query(dataset, attributes, filters?) -> TableBioMart query returning a table
nfcore_list(sort_by?, limit?) -> List[Record]List nf-core pipelines
nfcore_search(query, limit?) -> List[Record]Search nf-core pipelines by name/topic
nfcore_info(name) -> RecordDetailed nf-core pipeline metadata
nfcore_releases(name) -> List[Record]Release history for an nf-core pipeline
nfcore_params(name) -> RecordParameter schema for an nf-core pipeline
biocontainers_search(query, limit?) -> List[Record]Search BioContainers registry
biocontainers_popular(limit?) -> List[Record]Popular BioContainers tools
biocontainers_info(tool) -> RecordDetailed tool info with versions
biocontainers_versions(tool) -> List[Record]All versions with container image URIs
galaxy_search(query, limit?) -> List[Record]Search Galaxy ToolShed repositories
galaxy_popular(limit?) -> List[Record]Popular Galaxy ToolShed tools
galaxy_categories() -> List[Record]Galaxy ToolShed tool categories
galaxy_tool(owner, name) -> RecordGalaxy ToolShed repository details
nf_parse(path) -> RecordParse a Nextflow .nf file into a structured Record
nf_to_bl(record) -> StrGenerate BioLang pipeline code from parsed Nextflow
galaxy_to_bl(record) -> StrGenerate BioLang pipeline code from Galaxy workflow
api_endpoints() -> RecordShow current API endpoint URLs
# Example: annotate a gene list with pathway data
let genes = ["BRCA1", "TP53", "EGFR"]
genes |> each(|g| {
  let pathways = reactome_pathways(g)
  print(g + ": " + str(len(pathways)) + " pathways")
})

Utility

General-purpose helpers for debugging, timing, unit conversion, and serialization.

BuiltinDescription
print(val) -> NonePrint a value followed by a newline
assert(cond, msg?) -> NonePanic with msg if cond is false
sleep(ms) -> NonePause execution for ms milliseconds
now() -> FloatCurrent timestamp in seconds since epoch
elapsed(start) -> FloatSeconds elapsed since start (from now())
bp(n) -> IntIdentity; documents that n is in base pairs
kb(n) -> IntConvert kilobases to base pairs (n * 1000)
mb(n) -> IntConvert megabases to base pairs (n * 1_000_000)
gb(n) -> IntConvert gigabases to base pairs (n * 1_000_000_000)
to_json(val) -> StrSerialize any value to a JSON string
from_json(s) -> AnyParse a JSON string into a BioLang value
env(name) -> Str | NoneRead an environment variable
exit(code?) -> NeverTerminate the process with an exit code (default: 0)
# Example: time a heavy operation
let t0 = now()
let result = read_fasta("genome.fa")
  |> flat_map(|r| find_orfs(r.seq, 300))
print("Found " + str(len(result)) + " ORFs in " + str(elapsed(t0)) + "s")

Appendix B: Operator Reference

BioLang operators are listed below from highest precedence (binds tightest) to lowest. Within the same precedence level, operators are left-associative unless noted otherwise.


Precedence Table

1. Access and Call (highest)

SymbolNameDescriptionExample
.Field accessAccess a record field or methodvariant.chrom
()CallInvoke a function or builtingc_content(seq)
[]IndexIndex into a list, string, or sequencereads[0]

2. Exponentiation

SymbolNameDescriptionExample
**PowerRaise to a power (right-associative)2 ** 10 gives 1024

3. Unary

SymbolNameDescriptionExample
-NegateArithmetic negation-log2(fc)
!Logical NOTBoolean negation!is_snp(v)
notLogical NOT (word)Same as !, reads more naturally in predicatesnot is_indel(v)
~Bitwise NOTBitwise complement~flags

4. Multiplicative

SymbolNameDescriptionExample
*MultiplyArithmetic multiplicationcoverage * depth
/DivideArithmetic divisiongc_count / len(seq)
%ModuloRemainder after division; useful for reading frame mathpos % 3

5. Additive

SymbolNameDescriptionExample
+AddArithmetic addition; also concatenates stringsstart + kb(1)
-SubtractArithmetic subtractionend - start

6. Type Cast

SymbolNameDescriptionExample
asCastConvert between compatible typesqual as Int

7. Ranges

SymbolNameDescriptionExample
..Range (exclusive)Half-open range; useful for genomic coordinates0..len(seq)
..=Range (inclusive)Closed range1..=22 for autosomal chromosomes

8. Bit Shifts

SymbolNameDescriptionExample
<<Left shiftShift bits left1 << 4
>>Right shiftShift bits right; useful for SAM flag decodingflags >> 8

9. Bitwise AND

SymbolNameDescriptionExample
&Bitwise ANDTest individual flag bits in BAM flagsflags & 0x4 to check unmapped

10. Bitwise XOR

SymbolNameDescriptionExample
^Bitwise XORExclusive ora ^ b

11. Comparison

SymbolNameDescriptionExample
==EqualStructural equalityvariant_type(v) == "snp"
!=Not equalStructural inequalityv.chrom != "chrM"
<Less thanNumeric or lexicographic comparisonv.qual < 30
>Greater thangc_content(seq) > 0.6
<=Less or equallen(seq) <= kb(1)
>=Greater or equaldepth >= 10
~Regex matchTrue if the left string matches the right patternheader ~ "^>chr[0-9]+"

12. Logical AND

SymbolNameDescriptionExample
&&ANDShort-circuiting logical andis_snp(v) && v.qual > 30
andAND (word)Same as &&, reads naturally in filtersis_snp(v) and v.qual > 30

13. Logical OR

SymbolNameDescriptionExample
||ORShort-circuiting logical oris_het(v) || is_hom_alt(v)
orOR (word)Same as ||is_het(v) or is_hom_alt(v)

14. Null Coalesce

SymbolNameDescriptionExample
??Null coalesceUse the right-hand value when the left is nilv.info["AF"] ?? 0.0

15. Pipe

SymbolNameDescriptionExample
|>PipePass the left-hand value as the first argument to the right-hand functionreads |> filter(|r| mean_phred(r.quality) > 30)
|>>Tap pipeLike pipe, but passes the value through unchanged; used for side effectsreads |>> print |> len()

The pipe operators are the idiomatic way to build multi-step analysis pipelines in BioLang:

read_fastq("sample.fq")
  |> filter(|r| mean_phred(r.quality) > 25)
  |>> |reads| print("After QC: " + str(len(reads)) + " reads")
  |> map(|r| {id: r.id, gc: gc_content(r.seq)})
  |> write_csv("gc_report.csv")

16. Assignment (lowest)

SymbolNameDescriptionExample
=AssignBind a value to a namelet gc = gc_content(seq)
+=Add-assignIncrement in placecount += 1
-=Subtract-assignDecrement in placeremaining -= 1
*=Multiply-assignMultiply in placescore *= weight
/=Divide-assignDivide in placetotal /= n
?=Nil-assignAssign only if the target is currently nilcache ?= compute()

17. Structural (not expression operators)

These symbols appear in specific syntactic positions and do not participate in general expression precedence.

SymbolNameDescriptionExample
=>Fat arrowSeparates patterns from bodies in match and given armsmatch v { "snp" => handle_snp() }
->ArrowReturn type annotation in function signaturesfn gc(seq: Dna) -> Float

Newlines and Statement Termination

BioLang uses newlines as statement terminators – there are no semicolons. A newline ends the current expression unless it is suppressed.

Newline Suppression Rules

A newline does not terminate a statement when it appears:

  1. After a binary operator. The expression continues on the next line.

    let total = a +
        b +
        c
    
  2. After an opening delimiter ((, [, {). The expression continues until the matching closing delimiter.

    let record = {
        chrom: "chr1",
        start: 1000,
        end: 2000
    }
    
  3. After a pipe operator (|> or |>>). The pipeline continues on the next line.

    reads
      |> filter(|r| mean_phred(r.quality) > 30)
      |> map(|r| r.seq)
    
  4. After a comma inside argument lists, list literals, and record literals.

    let xs = [
        1,
        2,
        3
    ]
    
  5. After keywords that expect a continuation: then, else, do, and, or.

These rules mean that multi-line pipelines and data structures work naturally without any explicit continuation characters.

Blank Lines

Blank lines (lines containing only whitespace) are ignored between statements. Use them freely to organize code into logical sections.


Comments

SyntaxNameDescription
#Line commentEverything from # to end of line is ignored
##Doc commentAttached to the following declaration; extractable by documentation tools
# This is a regular comment

## Compute GC content for a DNA sequence.
## Returns a float between 0.0 and 1.0.
fn gc(seq: Dna) -> Float
  gc_content(seq)
end

Doc comments (##) attach to the immediately following fn, let, or type declaration and are preserved in the AST for tooling and auto-generated documentation.