Bio Reference Overview

BioLang treats biological data as a first-class citizen. DNA, RNA, and protein sequences are native value types with dedicated literal syntax, not strings that happen to contain nucleotides. This design philosophy extends to genomic intervals, k-mer analysis, sequence alignment, motif scanning, and high-level computational biology operations.

Native Bio Types

BioLang provides three sequence literal types with compile-time validation. Each type carries its own set of operations tailored to the underlying biology:

# DNA literal — validates ACGT alphabet
let seq = dna"ATCGATCGATCG"

# RNA literal — validates AUCG alphabet
let rna_seq = rna"AUCGAUCGAUCG"

# Protein literal — validates amino acid one-letter codes
let prot = protein"MKTLVFGADRHWYCNQEIPS"

Each literal is validated at parse time. Invalid characters produce a clear compile-time error pointing to the offending position. At runtime, sequences carry their type information, enabling type-safe operations — you cannot accidentally call translate() on a DNA sequence without first transcribing it.

What Each Type Offers

Type Literal Key Operations
DNA dna"ATCG" gc_content, reverse_complement, transcribe, complement, seq_len, subseq, kmer_count
RNA rna"AUCG" translate, reverse_complement, seq_len, subseq
Protein protein"MKTL" seq_len, subseq, to_string

Format Support

BioLang has built-in readers and writers for the most common bioinformatics file formats. All format functions return lazy streams by default, enabling processing of files larger than available memory:

  • FASTA / FASTQ — Sequence data with quality scores
  • BED / GFF — Genomic interval annotations
  • VCF — Variant calls
  • SAM / BAM — Alignment data
  • MAF — Multiple alignment format
# Read as a table (reusable) — best for interactive use
let records = read_fasta("data/sequences.fasta")

# Read as a stream (lazy, one-time) — best for large files
records = read_fasta("data/sequences.fasta")

# Process with pipes
records
  |> filter(|r| len(r.sequence) > 1000)
  |> map(|r| { id: r.id, gc: gc_content(r.sequence) })
  |> to_table()
  |> print()

Genomic Intervals

Intervals are another first-class type with set-theoretic operations built in. BioLang's interval engine uses an implicit interval tree for efficient overlap queries on large datasets:

# Create intervals (positional args: chrom, start, end, strand)
let exon1 = interval("chr1", 1000, 2000, "+")
let exon2 = interval("chr1", 1500, 2500)
println(exon1)
println(exon2)

# Build an interval tree for fast overlap queries
let tree = interval_tree([exon1, exon2])
let hits = query_overlaps(tree, "chr1", 1200, 1600)
println("Overlapping:", hits)

K-mer Analysis

K-mer counting and analysis is a fundamental primitive in sequence bioinformatics. BioLang provides optimized k-mer functions that use bit-packed encoding for k ≤ 32:

let seq = dna"ATCGATCGATCGATCG"

# Count all 4-mers
let counts = kmer_count(seq, 4)

# Extract minimizers for sketching
let mins = minimizers(seq, 21, 11)

Alignment & Motifs

Local and global sequence alignment with configurable scoring matrices, plus position weight matrix (PWM) scanning for motif discovery:

# Pairwise alignment (native only — requires bl CLI)
let result = align(dna"ATCGATCG", dna"ATCAATCG")
println(result.score)
println(result.cigar)

# Motif finding (works in WASM playground)
let hits = motif_find(dna"ATCGATCGATCGATCG", "ATCG")
println(hits)

Advanced Analytics

BioLang includes high-level computational biology functions for dimensionality reduction, clustering, and pathway enrichment — all accessible without importing external libraries:

# K-means clustering
let data = [[1.0, 2.0], [1.5, 1.8], [5.0, 8.0], [8.0, 8.0], [1.0, 0.6], [9.0, 11.0]]
let clusters = kmeans(data, 2)
println(clusters)

# Correlation
let x = [1, 2, 3, 4, 5]
let y = [2, 4, 5, 4, 5]
println("Correlation:", cor(x, y))

# Statistical test
let group_a = [23.1, 27.5, 22.8, 25.1, 24.3]
let group_b = [30.2, 28.9, 31.5, 29.7, 32.1]
let result = ttest(group_a, group_b)
println(result)

Explore This Section

Each topic in this reference section includes detailed function signatures, parameter descriptions, return types, and practical examples. Use the sidebar navigation to explore each area:

  • Sequences — DNA, RNA, and Protein types in depth
  • Formats — File format readers and writers
  • Intervals — Genomic interval operations
  • K-mers — K-mer counting and analysis
  • Alignment — Pairwise sequence alignment
  • Motifs — Motif finding and PWM scanning
  • Advanced — UMAP, clustering, enrichment, sketching

Gene Type

First-class gene representation with symbol, coordinates, and metadata.

# Look up a gene by symbol
let brca1 = gene("BRCA1")

# Field access
println(brca1.symbol, "on", brca1.chrom, brca1.start, "-", brca1.end)

# Type checking
println(type(brca1))      # "Gene"

Genome Type

Reference genome with built-in assemblies and chromosome data.

# Load a built-in genome assembly
let hg38 = genome("GRCh38")
println(hg38.name, "-", hg38.species)
println("Chromosomes:", len(hg38.chromosomes))

# Available assemblies: GRCh38, GRCh37, T2T-CHM13, GRCm39
let mouse = genome("GRCm39")

# Access chromosome info
for c in hg38.chromosomes {
  println(c.name, ":", c.length, "bp")
}