Sequence Alignment

BioLang includes a built-in pairwise sequence alignment engine supporting both local (Smith-Waterman) and global (Needleman-Wunsch) alignment modes. Alignment results include the optimal score, aligned sequences, CIGAR string, and identity statistics. For protein alignments, standard substitution matrices (BLOSUM62, PAM250) are available.

align()

The align() function performs pairwise alignment between two sequences. It returns a structured result with score, aligned sequences, and CIGAR representation:

# Global alignment (Needleman-Wunsch) — default mode
let result = align(dna"ATCGATCGATCG", dna"ATCAATCG")

print(result.score)       # => alignment score (Int)
print(result.aligned1)    # => aligned version of first sequence
print(result.aligned2)    # => aligned version of second sequence
print(result.cigar)       # => CIGAR string representation
print(result.identity)    # => fraction of matching positions (Float)
print(result.gaps)        # => number of gap positions (Int)

# Local alignment (Smith-Waterman)
result = align(dna"XXXXATCGATCGXXXX", dna"ATCGATCG", {type: "local"})

print(result.score)
print(result.aligned1)
print(result.aligned2)

Scoring Parameters

Customize the scoring scheme with match, mismatch, gap open, and gap extend penalties:

# Custom scoring via options record (third argument)
let result = align(
  dna"ATCGATCGATCG",
  dna"ATCAATCG",
  {
    type: "global",
    match: 2,
    mismatch: -1,
    gap_open: -5,
    gap_extend: -1
  }
)

# Affine gap penalties are the default:
# gap_open = -5, gap_extend = -1
# This favors fewer, longer gaps over many short gaps

score_matrix

For protein alignments, use a substitution matrix. BioLang includes BLOSUM and PAM matrices. You can also define custom matrices:

# Protein alignment with BLOSUM62
let result = align(
  protein"MKTLVFGADRHWY",
  protein"MKTLAFGADRHWY",
  {type: "global"}
)

print(result.score)
print(result.aligned1)
print(result.aligned2)

# Load a substitution matrix (returns a Matrix)
# Available: "blosum62", "blosum45", "pam250"
let blosum = score_matrix("blosum62")
print(blosum)   # 20x20 matrix with amino acid row/column names

edit_distance

Compute the Levenshtein edit distance between two sequences — the minimum number of insertions, deletions, and substitutions to transform one into the other:

# Edit distance (Levenshtein)
let d = edit_distance(dna"ATCGATCG", dna"ATCAATCG")
print(d)   # => 1

# Works with any sequence type
d = edit_distance(protein"MKTLVF", protein"MKPLVF")
print(d)   # => 1

# Also works with plain strings
d = edit_distance("kitten", "sitting")
print(d)   # => 3

hamming_distance

The Hamming distance counts the number of positions where two equal-length sequences differ. Unlike edit distance, it does not consider insertions or deletions:

# Hamming distance (sequences must be same length)
let d = hamming_distance(dna"ATCGATCG", dna"ATCAATCG")
print(d)   # => 1

# Useful for comparing barcodes or short sequences
let barcode1 = dna"ACGTACGT"
let barcode2 = dna"ACGAACGT"
d = hamming_distance(barcode1, barcode2)
print(d)   # => 1

# Error: sequences must be the same length
# hamming_distance(dna"ATCG", dna"ATCGA")  # Error!

Using Alignment Results

The align() result record includes an identity field (fraction of matching positions) and a cigar string that can be used for further analysis:

# Get identity from alignment result
let result = align(dna"ATCGATCGATCG", dna"ATCAATCG")
print(result.identity)    # => fraction of matching positions
print(result.cigar)       # => CIGAR string
print(result.gaps)        # => number of gap positions

# Compare multiple sequences against the first
let seqs = read_fasta("data/sequences.fasta") |> collect()
let ref_seq = first(seqs)
seqs |> map(|s| {
  id: s.id,
  identity: align(ref_seq.seq, s.seq).identity
}) |> to_table() |> print()

Practical Example: Variant Detection

Using alignment to detect variants between a sample and reference:

# Align sample to reference and inspect results
let ref = dna"ATCGATCGATCGATCGATCGATCG"
let sample = dna"ATCGAACGATCGTTCGATCGATCG"

let result = align(ref, sample, {type: "global"})

print("Score:", result.score)
print("Identity:", result.identity)
print("Gaps:", result.gaps)
print("CIGAR:", result.cigar)
print("Aligned ref:   ", result.aligned1)
print("Aligned sample:", result.aligned2)

# Use edit_distance for a quick similarity check
let d = edit_distance(ref, sample)
print("Edit distance:", d)

# Hamming distance works when sequences are the same length
d = hamming_distance(ref, sample)
print("Hamming distance:", d)