Language Overview
BioLang is a domain-specific language (DSL) for bioinformatics, featuring pipe-first syntax and designed from the ground up for genomics and computational biology. It combines the expressiveness of modern functional languages with first-class support for biological data types, tabular operations, and streaming genomic pipelines.
Design Philosophy
BioLang is guided by four core principles that shape every language decision:
-
Pipe-first composition — Data flows left
to right through transformation chains using the
|>operator, mirroring the way bioinformatics pipelines are conceptualized. - Biology-native types — DNA, RNA, and protein sequences are first-class values with literal syntax, not strings with ad-hoc parsing.
- Tables as data — Tabular data is the lingua franca of genomics. BioLang provides dplyr-style verbs directly in the language.
- Streaming by default — Genomic datasets are often too large to fit in memory. Lazy streams let you process terabyte-scale files with constant memory.
A Quick Taste
Here is a complete BioLang program that reads a FASTQ file, filters reads by quality, and writes the result:
# Filter high-quality reads from a FASTQ file
import { read_fastq } from "bio/io"
let reads = read_fastq("sample.fastq")
|> filter(|r| mean_phred(r.quality) >= 30)
|> filter(|r| r.length >= 100)
print(f"Kept {len(reads)} high-quality reads")
Key Language Features
Pipe Operator
The pipe operator |> passes the left-hand value as the first argument
to the right-hand function. This creates readable, linear data flows:
# Without pipes — nested, hard to read
let result = summarize(group_by(filter(data, |r| r.score > 0.5), "gene"), mean(score))
# With pipes — linear, clear
let result = data
|> filter(|r| r.score > 0.5)
|> group_by("gene")
|> summarize(mean(score))
Bio Literals
BioLang provides literal syntax for biological sequences. These are typed values, not plain strings, enabling compile-time validation and specialized operations:
let seq = dna"ATCGATCGATCG"
let rna_seq = rna"AUCGAUCGAUCG"
let prot = protein"MKTLLILAVS"
# Reverse complement is a method, not a string hack
let rc = seq |> reverse_complement()
# Translation uses the standard genetic code
let translated = seq |> transcribe() |> translate()
Pattern Matching
BioLang supports expressive pattern matching with destructuring, guards, and wildcard patterns:
let result = match classify_variant(variant) {
"missense" => analyze_protein_impact(variant),
"nonsense" => { flag_lof(variant); "loss_of_function" },
"synonymous" if variant.conservation > 0.9 => "conserved_silent",
_ => "benign"
}
Table Operations
Tables are a built-in type with familiar dplyr-style verbs for data manipulation:
let variants = read_vcf("calls.vcf")
|> select("chrom", "pos", "ref_allele", "alt_allele", "qual")
|> filter(|v| v.qual >= 30.0)
|> arrange(desc(qual))
|> group_by("chrom")
|> summarize(
count = n(),
avg_qual = mean(qual),
max_qual = max(qual)
)
Lazy Streams
Streams enable lazy, memory-efficient processing of large datasets. No data is read until the stream is consumed:
# Process a 100 GB FASTQ file with constant memory
let high_qual = read_fastq("reads.fq.gz")
|> to_stream()
|> filter(|r| mean_phred(r.quality) >= 30)
|> filter(|r| r.length >= 100)
|> take(1_000_000)
|> collect()
Type System
BioLang uses gradual typing with full type inference. You can add type annotations when you want extra safety, or leave them off for concise scripts:
# Inferred types — concise
let x = 42
let name = "BRCA1"
let seq = dna"ATCG"
# Explicit annotations — self-documenting
let x: Int = 42
let name: String = "BRCA1"
let seq: DNA = dna"ATCG"
let scores: List[Float] = [0.1, 0.5, 0.9]
Built-in Types
| Category | Types |
|---|---|
| Primitives | Int, Float, String, Bool |
| Collections | List, Map, Set, Table, Matrix |
| Biology | DNA, RNA, Protein, Interval |
| Control | Option, Result, Stream |
Error Handling
BioLang provides both exception-style and result-style error handling, letting you choose the right approach for each situation:
# Result type with ? operator
fn load_reference(path: String) -> Result[DNA] {
let seqs = read_fasta(path)?
let seq = first(seqs)?
Ok(seq.seq)
}
# try/catch for top-level scripts
try {
let ref_seq = load_reference("hg38.fa")?
let aligned = align(reads, ref_seq)
} catch e {
print(f"Pipeline failed: {e.message}")
}
Next Steps
Explore the language reference in depth:
- Variables & Types — bindings, type annotations, all value types
- Pipes — the
|>operator and composition patterns - Functions — declarations, lambdas, closures
- Tables — the built-in table type and data verbs