Language Overview

BioLang is a domain-specific language (DSL) for bioinformatics, featuring pipe-first syntax and designed from the ground up for genomics and computational biology. It combines the expressiveness of modern functional languages with first-class support for biological data types, tabular operations, and streaming genomic pipelines.

Design Philosophy

BioLang is guided by four core principles that shape every language decision:

  • Pipe-first composition — Data flows left to right through transformation chains using the |> operator, mirroring the way bioinformatics pipelines are conceptualized.
  • Biology-native types — DNA, RNA, and protein sequences are first-class values with literal syntax, not strings with ad-hoc parsing.
  • Tables as data — Tabular data is the lingua franca of genomics. BioLang provides dplyr-style verbs directly in the language.
  • Streaming by default — Genomic datasets are often too large to fit in memory. Lazy streams let you process terabyte-scale files with constant memory.

A Quick Taste

Here is a complete BioLang program that reads a FASTQ file, filters reads by quality, and writes the result:

# Filter high-quality reads from a FASTQ file
import { read_fastq } from "bio/io"

let reads = read_fastq("sample.fastq")
  |> filter(|r| mean_phred(r.quality) >= 30)
  |> filter(|r| r.length >= 100)

print(f"Kept {len(reads)} high-quality reads")

Key Language Features

Pipe Operator

The pipe operator |> passes the left-hand value as the first argument to the right-hand function. This creates readable, linear data flows:

# Without pipes — nested, hard to read
let result = summarize(group_by(filter(data, |r| r.score > 0.5), "gene"), mean(score))

# With pipes — linear, clear
let result = data
  |> filter(|r| r.score > 0.5)
  |> group_by("gene")
  |> summarize(mean(score))

Bio Literals

BioLang provides literal syntax for biological sequences. These are typed values, not plain strings, enabling compile-time validation and specialized operations:

let seq = dna"ATCGATCGATCG"
let rna_seq = rna"AUCGAUCGAUCG"
let prot = protein"MKTLLILAVS"

# Reverse complement is a method, not a string hack
let rc = seq |> reverse_complement()

# Translation uses the standard genetic code
let translated = seq |> transcribe() |> translate()

Pattern Matching

BioLang supports expressive pattern matching with destructuring, guards, and wildcard patterns:

let result = match classify_variant(variant) {
  "missense" => analyze_protein_impact(variant),
  "nonsense" => { flag_lof(variant); "loss_of_function" },
  "synonymous" if variant.conservation > 0.9 => "conserved_silent",
  _ => "benign"
}

Table Operations

Tables are a built-in type with familiar dplyr-style verbs for data manipulation:

let variants = read_vcf("calls.vcf")
  |> select("chrom", "pos", "ref_allele", "alt_allele", "qual")
  |> filter(|v| v.qual >= 30.0)
  |> arrange(desc(qual))
  |> group_by("chrom")
  |> summarize(
    count = n(),
    avg_qual = mean(qual),
    max_qual = max(qual)
  )

Lazy Streams

Streams enable lazy, memory-efficient processing of large datasets. No data is read until the stream is consumed:

# Process a 100 GB FASTQ file with constant memory
let high_qual = read_fastq("reads.fq.gz")
  |> to_stream()
  |> filter(|r| mean_phred(r.quality) >= 30)
  |> filter(|r| r.length >= 100)
  |> take(1_000_000)
  |> collect()

Type System

BioLang uses gradual typing with full type inference. You can add type annotations when you want extra safety, or leave them off for concise scripts:

# Inferred types — concise
let x = 42
let name = "BRCA1"
let seq = dna"ATCG"

# Explicit annotations — self-documenting
let x: Int = 42
let name: String = "BRCA1"
let seq: DNA = dna"ATCG"
let scores: List[Float] = [0.1, 0.5, 0.9]

Built-in Types

CategoryTypes
PrimitivesInt, Float, String, Bool
CollectionsList, Map, Set, Table, Matrix
BiologyDNA, RNA, Protein, Interval
ControlOption, Result, Stream

Error Handling

BioLang provides both exception-style and result-style error handling, letting you choose the right approach for each situation:

# Result type with ? operator
fn load_reference(path: String) -> Result[DNA] {
  let seqs = read_fasta(path)?
  let seq = first(seqs)?
  Ok(seq.seq)
}

# try/catch for top-level scripts
try {
  let ref_seq = load_reference("hg38.fa")?
  let aligned = align(reads, ref_seq)
} catch e {
  print(f"Pipeline failed: {e.message}")
}

Next Steps

Explore the language reference in depth:

  • Variables & Types — bindings, type annotations, all value types
  • Pipes — the |> operator and composition patterns
  • Functions — declarations, lambdas, closures
  • Tables — the built-in table type and data verbs