Sequences
BioLang has three native sequence types — DNA, RNA, and
Protein. Each is a distinct type with its own literal syntax, validation
rules, and specialized operations. Sequences are immutable values; all transformation
functions return new sequences.
DNA Sequences
DNA literals use the dna"..." syntax and accept the characters
A, C, G, T (case-insensitive). IUPAC
ambiguity codes (N, R, Y, etc.) are also accepted.
# Create a DNA sequence
let seq = dna"ATCGATCGATCG"
# Length
seq_len(seq) # => 12
# GC content as a fraction between 0.0 and 1.0
gc_content(seq) # => 0.5
# Reverse complement
reverse_complement(seq) # => dna"CGATCGATCGAT"
# Complement (without reversing)
complement(seq) # => dna"TAGCTAGCTAGC"
# Transcribe DNA to RNA (T → U)
transcribe(seq) # => rna"AUCGAUCGAUCG"
# Base counts
base_counts(seq) # => {A: 3, T: 3, C: 3, G: 3}
Subsequence Extraction
Use subseq(seq, start, end) for zero-based, half-open substring extraction:
let seq = dna"ATCGATCGATCG"
subseq(seq, 0, 4) # => dna"ATCG"
subseq(seq, 4, 8) # => dna"ATCG"
Concatenation
DNA sequences can be concatenated with the + operator. Both operands must
be the same type — concatenating DNA with RNA is a type error:
let left = dna"ATCG"
let right = dna"GCTA"
let combined = left + right # => dna"ATCGGCTA"
# Type error: cannot concatenate DNA and RNA
# mixed = left + rna"AUCG" # Error!
RNA Sequences
RNA literals use rna"..." syntax and accept A, U,
C, G. The key RNA-specific operation is translation to protein:
# Create an RNA sequence
let mrna = rna"AUGAAACUGUUGGUCUUU"
# Translate to protein (standard genetic code)
let prot = translate(mrna) # => protein"MKLVVF"
# RNA also supports reverse complement
reverse_complement(mrna) # => rna"AAAGACCAACAGUUUCAU"
# Length and subsequence work identically to DNA
seq_len(mrna) # => 18
subseq(mrna, 0, 3) # => rna"AUG"
Translation Details
Translation reads the sequence in triplets (codons) starting from position 0. Stop
codons (UAA, UAG, UGA in the standard table)
terminate translation. Incomplete trailing codons are ignored:
# Stop codon terminates translation
translate(rna"AUGAAAUAAGGGUUU") # => protein"MK"
# Trailing bases after last complete codon are ignored
translate(rna"AUGAAAC") # => protein"MK" (trailing C ignored)
Protein Sequences
Protein literals use protein"..." syntax and accept the 20 standard amino
acid one-letter codes plus X (unknown) and * (stop):
# Create a protein sequence
let prot = protein"MKTLVFGADRHWYCNQEIPS"
# Length
seq_len(prot) # => 20
# Subsequence
subseq(prot, 0, 5) # => protein"MKTLV"
# Convert to string
prot |> to_string() # => "MKTLVFGADRHWYCNQEIPS"
The Central Dogma as a Pipeline
BioLang's pipe syntax makes it natural to express the central dogma of molecular biology as a data transformation pipeline:
# DNA → RNA → Protein in one pipeline
let gene = dna"ATGAAACTGTTGGTCTTT"
let protein = gene
|> transcribe() # DNA → RNA
|> translate() # RNA → Protein
# => protein"MKLVVF"
# GC content check
gc_content(gene) |> print() # => 0.389
Validation and Error Handling
Sequence literals are validated at parse time for the basic alphabet. Invalid characters produce a clear error pointing to the offending position:
# Parse-time validation catches invalid characters
# let seq = dna"ATCGXYZ" # Parse Error: invalid DNA character 'X' at position 4
# Runtime introspection
let seq = dna"ATCGATCG"
print(type(seq)) # => "DNA"
print(seq_len(seq)) # => 8
print(base_counts(seq)) # => {A: 2, T: 2, C: 2, G: 2}
Working with Sequence Collections
Sequences integrate seamlessly with BioLang's table and streaming systems:
# Build a table of sequence properties
let sequences = [
dna"ATCGATCGATCG",
dna"GCGCGCGCGCGC",
dna"ATATATATATATAT",
dna"GGGCCCAAATTT",
]
let stats = sequences
|> map(|s| {
length: seq_len(s),
gc: gc_content(s),
revcomp: reverse_complement(s) |> to_string()
})
|> to_table()
|> print()
# Filter sequences by GC content
let gc_rich = sequences
|> filter(|s| gc_content(s) > 0.6)
|> collect()
Type Conversion
Explicit conversions between types are available where biologically meaningful. Implicit conversions are never performed:
# String to sequence (use the type-name function)
let seq = dna("ATCGATCG")
let rna_seq = rna("AUCGAUCG")
let prot = protein("MKTLVF")
# Sequence to string
let str = to_string(seq) # => "ATCGATCG"
# Biologically meaningful: DNA → RNA via transcribe()
let rna_from_dna = transcribe(seq)
Quality Scores
The Quality type represents Phred+33 encoded base quality scores, with a dedicated literal syntax.
# Quality literal (Phred+33 ASCII encoding)
let q = qual"FFFFFFFF"
# Quality analysis functions
println("Mean Phred:", mean_phred(q))
println("Min Phred:", min_phred(q))
println("Error rate:", error_rate(q))
# Trim low-quality bases
let trimmed = trim_quality(q, 20) # trim below Q20
# FASTQ records include quality scores
let reads = read_fastq("data/reads.fastq")
for read in reads {
if mean_phred(read.quality) < 25 {
println("Low quality read:", read.id)
}
}
| Function | Description |
|---|---|
mean_phred(q) | Average Phred score across all positions |
min_phred(q) | Minimum Phred score |
error_rate(q) | Mean per-base error probability |
trim_quality(q, min) | Trim trailing bases below threshold |
Genomic Units
Unit helpers return integer base-pair counts for composable arithmetic.
let window = kb(500) # 500000
let offset = bp(200) # 200
let total = window + offset # 500200
let genome = gb(3.1) # 3100000000
# Use with intervals
let start = mb(10)
let end = start + kb(500)
interval("chr1", start, end)