Motif Finding & Scanning
Motifs are short, recurring sequence patterns that often correspond to functional elements such as transcription factor binding sites, splice signals, or regulatory elements. BioLang provides tools for motif discovery, position weight matrix (PWM) construction, and genome-wide motif scanning.
motif_find
Search for exact or degenerate motif patterns in a sequence. Uses IUPAC ambiguity codes for flexible matching:
# Exact motif search
let seq = dna"ATCGATCGAAATCGATCG"
let hits = motif_find(seq, "ATCG")
# Returns a list of match records with start, end positions
print(len(hits), "matches found")
# IUPAC degenerate motif
# R = A|G, Y = C|T, N = any, W = A|T, S = C|G
hits = motif_find(seq, "RTCG") # Matches ATCG or GTCG
# Count occurrences of a motif pattern
let n = motif_count(seq, "ATCG")
print(n, "occurrences")
motif_count
Count occurrences of an IUPAC motif pattern in a sequence:
let seq = dna"ATCGATCGATCGATCG"
# Count exact matches
let n = motif_count(seq, "ATCG")
print(n, "occurrences")
# IUPAC degenerate patterns
n = motif_count(seq, "RTCG") # R = A or G
print(n, "degenerate matches")
# Scan a FASTA file for a regulatory element
read_fasta("data/sequences.fasta")
|> map(|r| {
gene: r.id,
tata_boxes: motif_count(r.seq, "TATAWAW")
})
|> to_table()
|> filter(|r| r.tata_boxes > 0)
|> print()
consensus
Derive a consensus sequence from a collection of aligned sequences. Each position uses the most frequent base:
# Build consensus from aligned sequences
let aligned = [
dna"ATCGATCG",
dna"ATCAATCG",
dna"ATCGATCG",
dna"ATCGATCA",
dna"ATCGATCG",
]
let cons = consensus(aligned)
print(cons) # => consensus sequence string
pwm (Position Weight Matrix)
Build a position weight matrix from a set of aligned sequences. The PWM captures the position-specific base preferences of a motif:
# Build a PWM from binding site sequences
let sites = [
dna"ATCGATCG",
dna"ATCAATCG",
dna"ATCGATCG",
dna"ATCGATCA",
dna"ATCGATCG",
]
let matrix = pwm(sites)
# The PWM is a list of per-position frequency records {A, C, G, T}
matrix |> map(|pos| print(pos))
# Each entry shows base frequencies at that position
pwm_scan
Scan a sequence with a PWM to find high-scoring matches. Returns positions where the PWM score exceeds a threshold:
# Build a PWM from known binding sites
let binding_sites = read_fasta("data/sequences.fasta")
|> map(|r| r.seq)
|> collect()
let matrix = pwm(binding_sites)
# Scan a promoter sequence — pwm_scan(seq, pwm, threshold?)
let promoter = dna"ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG"
# With default threshold
let hits = pwm_scan(promoter, matrix)
# With custom threshold (third argument)
hits = pwm_scan(promoter, matrix, 0.8)
hits |> map(|h| print(h.pos, h.score))
# Genome-wide scan
read_fasta("data/sequences.fasta")
|> map(|chr| {
chrom: chr.id,
sites: pwm_scan(chr.seq, matrix, 0.85)
})
|> filter(|r| len(r.sites) > 0)
|> map(|r| r.sites |> map(|s| {
chrom: r.chrom, pos: s.pos, score: s.score
}))
|> flatten()
|> to_table()
|> print()
Combining Motif Functions
The motif functions compose naturally to build analysis workflows:
# Find motifs, build a PWM, then scan other sequences
let sites = [
dna"TATAAATA",
dna"TATAATTA",
dna"TATAAATA",
dna"TATAAGTA",
]
let matrix = pwm(sites)
let cons = consensus(sites)
print("Consensus:", cons)
# Scan a new sequence
let target = dna"AAAAATATAAATAGGGGTATAAATACCCC"
let hits = pwm_scan(target, matrix, 0.7)
hits |> map(|h| print("Hit at", h.pos, "score:", h.score))
Practical Example: Transcription Factor Binding Analysis
# Analyze motif enrichment in sequences
let seqs = read_fasta("data/sequences.fasta") |> collect()
# Count occurrences of a known motif across sequences
seqs
|> map(|r| {
id: r.id,
hits: motif_count(r.seq, "TATAWAW")
})
|> to_table()
|> filter(|r| r.hits > 0)
|> print()
# Build a PWM from the sequences and scan for matches
let matrix = seqs |> map(|r| r.seq) |> pwm()
print("PWM built from", len(seqs), "sequences")
# Scan a target sequence
let target_seq = read_fasta("data/sequences.fasta") |> first()
let hits = pwm_scan(target_seq.seq, matrix, 0.8)
print(len(hits), "PWM hits found")