Sequences

BioLang has three native sequence types — DNA, RNA, and Protein. Each is a distinct type with its own literal syntax, validation rules, and specialized operations. Sequences are immutable values; all transformation functions return new sequences.

DNA Sequences

DNA literals use the dna"..." syntax and accept the characters A, C, G, T (case-insensitive). IUPAC ambiguity codes (N, R, Y, etc.) are also accepted.

# Create a DNA sequence
let seq = dna"ATCGATCGATCG"

# Length
seq_len(seq)              # => 12

# GC content as a fraction between 0.0 and 1.0
gc_content(seq)           # => 0.5

# Reverse complement
reverse_complement(seq)   # => dna"CGATCGATCGAT"

# Complement (without reversing)
complement(seq)           # => dna"TAGCTAGCTAGC"

# Transcribe DNA to RNA (T → U)
transcribe(seq)           # => rna"AUCGAUCGAUCG"

# Base counts
base_counts(seq)          # => {A: 3, T: 3, C: 3, G: 3}

Subsequence Extraction

Use subseq(seq, start, end) for zero-based, half-open substring extraction:

let seq = dna"ATCGATCGATCG"

subseq(seq, 0, 4)     # => dna"ATCG"
subseq(seq, 4, 8)     # => dna"ATCG"

Concatenation

DNA sequences can be concatenated with the + operator. Both operands must be the same type — concatenating DNA with RNA is a type error:

let left = dna"ATCG"
let right = dna"GCTA"

let combined = left + right    # => dna"ATCGGCTA"

# Type error: cannot concatenate DNA and RNA
# mixed = left + rna"AUCG"   # Error!

RNA Sequences

RNA literals use rna"..." syntax and accept A, U, C, G. The key RNA-specific operation is translation to protein:

# Create an RNA sequence
let mrna = rna"AUGAAACUGUUGGUCUUU"

# Translate to protein (standard genetic code)
let prot = translate(mrna)        # => protein"MKLVVF"

# RNA also supports reverse complement
reverse_complement(mrna)       # => rna"AAAGACCAACAGUUUCAU"

# Length and subsequence work identically to DNA
seq_len(mrna)              # => 18
subseq(mrna, 0, 3)        # => rna"AUG"

Translation Details

Translation reads the sequence in triplets (codons) starting from position 0. Stop codons (UAA, UAG, UGA in the standard table) terminate translation. Incomplete trailing codons are ignored:

# Stop codon terminates translation
translate(rna"AUGAAAUAAGGGUUU")   # => protein"MK"

# Trailing bases after last complete codon are ignored
translate(rna"AUGAAAC")           # => protein"MK"  (trailing C ignored)

Protein Sequences

Protein literals use protein"..." syntax and accept the 20 standard amino acid one-letter codes plus X (unknown) and * (stop):

# Create a protein sequence
let prot = protein"MKTLVFGADRHWYCNQEIPS"

# Length
seq_len(prot)                 # => 20

# Subsequence
subseq(prot, 0, 5)           # => protein"MKTLV"

# Convert to string
prot |> to_string()           # => "MKTLVFGADRHWYCNQEIPS"

The Central Dogma as a Pipeline

BioLang's pipe syntax makes it natural to express the central dogma of molecular biology as a data transformation pipeline:

# DNA → RNA → Protein in one pipeline
let gene = dna"ATGAAACTGTTGGTCTTT"

let protein = gene
  |> transcribe()       # DNA → RNA
  |> translate()        # RNA → Protein
  # => protein"MKLVVF"

# GC content check
gc_content(gene) |> print()   # => 0.389

Validation and Error Handling

Sequence literals are validated at parse time for the basic alphabet. Invalid characters produce a clear error pointing to the offending position:

# Parse-time validation catches invalid characters
# let seq = dna"ATCGXYZ"   # Parse Error: invalid DNA character 'X' at position 4

# Runtime introspection
let seq = dna"ATCGATCG"
print(type(seq))              # => "DNA"
print(seq_len(seq))           # => 8
print(base_counts(seq))       # => {A: 2, T: 2, C: 2, G: 2}

Working with Sequence Collections

Sequences integrate seamlessly with BioLang's table and streaming systems:

# Build a table of sequence properties
let sequences = [
  dna"ATCGATCGATCG",
  dna"GCGCGCGCGCGC",
  dna"ATATATATATATAT",
  dna"GGGCCCAAATTT",
]

let stats = sequences
  |> map(|s| {
    length: seq_len(s),
    gc: gc_content(s),
    revcomp: reverse_complement(s) |> to_string()
  })
  |> to_table()
  |> print()

# Filter sequences by GC content
let gc_rich = sequences
  |> filter(|s| gc_content(s) > 0.6)
  |> collect()

Type Conversion

Explicit conversions between types are available where biologically meaningful. Implicit conversions are never performed:

# String to sequence (use the type-name function)
let seq = dna("ATCGATCG")
let rna_seq = rna("AUCGAUCG")
let prot = protein("MKTLVF")

# Sequence to string
let str = to_string(seq)        # => "ATCGATCG"

# Biologically meaningful: DNA → RNA via transcribe()
let rna_from_dna = transcribe(seq)

Quality Scores

The Quality type represents Phred+33 encoded base quality scores, with a dedicated literal syntax.

# Quality literal (Phred+33 ASCII encoding)
let q = qual"FFFFFFFF"

# Quality analysis functions
println("Mean Phred:", mean_phred(q))
println("Min Phred:", min_phred(q))
println("Error rate:", error_rate(q))

# Trim low-quality bases
let trimmed = trim_quality(q, 20)  # trim below Q20

# FASTQ records include quality scores
let reads = read_fastq("data/reads.fastq")
for read in reads {
  if mean_phred(read.quality) < 25 {
    println("Low quality read:", read.id)
  }
}
Function Description
mean_phred(q)Average Phred score across all positions
min_phred(q)Minimum Phred score
error_rate(q)Mean per-base error probability
trim_quality(q, min)Trim trailing bases below threshold

Genomic Units

Unit helpers return integer base-pair counts for composable arithmetic.

let window = kb(500)           # 500000
let offset = bp(200)           # 200
let total = window + offset    # 500200
let genome = gb(3.1)           # 3100000000

# Use with intervals
let start = mb(10)
let end = start + kb(500)
interval("chr1", start, end)