Bio Reference Overview
BioLang treats biological data as a first-class citizen. DNA, RNA, and protein sequences are native value types with dedicated literal syntax, not strings that happen to contain nucleotides. This design philosophy extends to genomic intervals, k-mer analysis, sequence alignment, motif scanning, and high-level computational biology operations.
Native Bio Types
BioLang provides three sequence literal types with compile-time validation. Each type carries its own set of operations tailored to the underlying biology:
# DNA literal — validates ACGT alphabet
let seq = dna"ATCGATCGATCG"
# RNA literal — validates AUCG alphabet
let rna_seq = rna"AUCGAUCGAUCG"
# Protein literal — validates amino acid one-letter codes
let prot = protein"MKTLVFGADRHWYCNQEIPS"
Each literal is validated at parse time. Invalid characters produce a clear compile-time
error pointing to the offending position. At runtime, sequences carry their type
information, enabling type-safe operations — you cannot accidentally call
translate() on a DNA sequence without first transcribing it.
What Each Type Offers
| Type | Literal | Key Operations |
|---|---|---|
| DNA | dna"ATCG" | gc_content, reverse_complement, transcribe, complement, seq_len, subseq, kmer_count |
| RNA | rna"AUCG" | translate, reverse_complement, seq_len, subseq |
| Protein | protein"MKTL" | seq_len, subseq, to_string |
Format Support
BioLang has built-in readers and writers for the most common bioinformatics file formats. All format functions return lazy streams by default, enabling processing of files larger than available memory:
- FASTA / FASTQ — Sequence data with quality scores
- BED / GFF — Genomic interval annotations
- VCF — Variant calls
- SAM / BAM — Alignment data
- MAF — Multiple alignment format
# Read as a table (reusable) — best for interactive use
let records = read_fasta("data/sequences.fasta")
# Read as a stream (lazy, one-time) — best for large files
records = read_fasta("data/sequences.fasta")
# Process with pipes
records
|> filter(|r| len(r.sequence) > 1000)
|> map(|r| { id: r.id, gc: gc_content(r.sequence) })
|> to_table()
|> print()
Genomic Intervals
Intervals are another first-class type with set-theoretic operations built in. BioLang's interval engine uses an implicit interval tree for efficient overlap queries on large datasets:
# Create intervals (positional args: chrom, start, end, strand)
let exon1 = interval("chr1", 1000, 2000, "+")
let exon2 = interval("chr1", 1500, 2500)
println(exon1)
println(exon2)
# Build an interval tree for fast overlap queries
let tree = interval_tree([exon1, exon2])
let hits = query_overlaps(tree, "chr1", 1200, 1600)
println("Overlapping:", hits)
K-mer Analysis
K-mer counting and analysis is a fundamental primitive in sequence bioinformatics. BioLang provides optimized k-mer functions that use bit-packed encoding for k ≤ 32:
let seq = dna"ATCGATCGATCGATCG"
# Count all 4-mers
let counts = kmer_count(seq, 4)
# Extract minimizers for sketching
let mins = minimizers(seq, 21, 11)
Alignment & Motifs
Local and global sequence alignment with configurable scoring matrices, plus position weight matrix (PWM) scanning for motif discovery:
# Pairwise alignment (native only — requires bl CLI)
let result = align(dna"ATCGATCG", dna"ATCAATCG")
println(result.score)
println(result.cigar)
# Motif finding (works in WASM playground)
let hits = motif_find(dna"ATCGATCGATCGATCG", "ATCG")
println(hits)
Advanced Analytics
BioLang includes high-level computational biology functions for dimensionality reduction, clustering, and pathway enrichment — all accessible without importing external libraries:
# K-means clustering
let data = [[1.0, 2.0], [1.5, 1.8], [5.0, 8.0], [8.0, 8.0], [1.0, 0.6], [9.0, 11.0]]
let clusters = kmeans(data, 2)
println(clusters)
# Correlation
let x = [1, 2, 3, 4, 5]
let y = [2, 4, 5, 4, 5]
println("Correlation:", cor(x, y))
# Statistical test
let group_a = [23.1, 27.5, 22.8, 25.1, 24.3]
let group_b = [30.2, 28.9, 31.5, 29.7, 32.1]
let result = ttest(group_a, group_b)
println(result)
Explore This Section
Each topic in this reference section includes detailed function signatures, parameter descriptions, return types, and practical examples. Use the sidebar navigation to explore each area:
- Sequences — DNA, RNA, and Protein types in depth
- Formats — File format readers and writers
- Intervals — Genomic interval operations
- K-mers — K-mer counting and analysis
- Alignment — Pairwise sequence alignment
- Motifs — Motif finding and PWM scanning
- Advanced — UMAP, clustering, enrichment, sketching
Gene Type
First-class gene representation with symbol, coordinates, and metadata.
# Look up a gene by symbol
let brca1 = gene("BRCA1")
# Field access
println(brca1.symbol, "on", brca1.chrom, brca1.start, "-", brca1.end)
# Type checking
println(type(brca1)) # "Gene"
Genome Type
Reference genome with built-in assemblies and chromosome data.
# Load a built-in genome assembly
let hg38 = genome("GRCh38")
println(hg38.name, "-", hg38.species)
println("Chromosomes:", len(hg38.chromosomes))
# Available assemblies: GRCh38, GRCh37, T2T-CHM13, GRCm39
let mouse = genome("GRCm39")
# Access chromosome info
for c in hg38.chromosomes {
println(c.name, ":", c.length, "bp")
}