Rosalind Bioinformatics Armory
Solutions to the Rosalind Bioinformatics Armory problem set, implemented in idiomatic BioLang. These problems focus on using real bioinformatics tools and databases rather than implementing algorithms from scratch.
INI — Introduction to the Bioinformatics Armory
Given a DNA string, count the occurrences of each nucleotide (A, C, G, T).
# Count each nucleotide in a DNA string
let seq = dna"AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
let counts = base_counts(seq)
print(counts["A"], counts["C"], counts["G"], counts["T"])
GBK — GenBank Introduction
Search NCBI's Nucleotide database for entries from a given organism published between two dates. Return the count.
# requires: internet connection (NCBI API)
# Search GenBank for organism entries within a date range
let organism = "Anthoxanthum"
let date_from = "2003/07/25"
let date_to = "2005/12/27"
let query = organism + "[Organism] AND " + date_from + ":" + date_to + "[Publication Date]"
let result = ncbi_search("nucleotide", query)
print(len(result))
FRMT — Data Formats
Given a list of GenBank IDs, fetch their FASTA records and return the one with the shortest sequence.
# requires: internet connection (NCBI API)
# Find the shortest sequence among GenBank accessions
let ids = ["JX205496.1", "JX ## 469991.1", "JX469983.1"]
let shortest = ids
|> map(|id| {
let seq = ncbi_sequence(id)
{id: id, seq: seq, length: len(seq)}
})
|> sort_by(|a, b| a.length - b.length)
|> head(1)
print(">" + shortest[0].id)
print(shortest[0].seq)
TFSQ — FASTQ Format Introduction
Convert FASTQ records to FASTA format.
# Convert FASTQ to FASTA
let reads = read_fastq("data/reads.fastq")
for read in reads {
print(">" + read.id)
print(read.seq)
}
# Or as a pipeline:
read_fastq("data/reads.fastq")
|> map(|r| ">" + r.id + "\n" + to_string(r.seq))
|> each(|line| print(line))
PHRE — Read Quality Distribution
Given a quality threshold, count how many reads have average quality below it.
# Count reads with average quality below threshold
let threshold = 28
let reads = read_fastq("data/reads.fastq")
let below = reads
|> filter(|r| mean_phred(r.quality) < threshold)
|> len
print(below)
PTRA — Protein Translation
Given a DNA string and a protein, find which genetic code table was used for translation.
# Translate DNA using different codon tables
let dna_seq = dna"ATGGCCATGGCGCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA"
# Standard translation
let protein = dna_seq |> translate
print("Standard:", protein)
# The translate function uses standard codon table by default.
# Compare the output with the expected protein to identify the table.
RVCO — Complementing a Strand of DNA
Given FASTA records, count how many have a sequence equal to its own reverse complement.
# Count sequences that match their own reverse complement
let records = read_fasta("data/sequences.fasta")
let count = records
|> filter(|r| to_string(r.seq) == to_string(reverse_complement(r.seq)))
|> len
print(count)
FILT — Read Filtration by Quality
Filter FASTQ reads: keep only those where at least a given percentage of bases meet a quality threshold.
# Filter reads by per-base quality percentage
let quality_threshold = 20
let percentage = 90
let reads = read_fastq("data/reads.fastq")
let passed = reads |> filter(|r| {
let scores = r.quality
let total = len(scores)
let good = scores |> filter(|q| q >= quality_threshold) |> len
(good * 100) / total >= percentage
})
print(len(passed))
BPHR — Base Quality Distribution
Given FASTQ reads and a quality threshold, count positions where the average quality falls below it.
# Count positions with average quality below threshold
let threshold = 20
let reads = read_read_fastq("data/reads.fastq")
# Collect quality scores per position
let num_positions = reads |> map(|r| len(r.quality)) |> max
let position_avgs = range(0, num_positions) |> map(|pos| {
let scores_at_pos = reads
|> filter(|r| len(r.quality) > pos)
|> map(|r| r.quality[pos])
mean(scores_at_pos)
})
let below = position_avgs |> filter(|avg| avg < threshold) |> len
print(below)
BFIL — Base Filtration by Quality
Trim each FASTQ read from both ends, removing bases below a quality threshold.
# Trim reads from both ends by quality
let threshold = 20
let reads = read_fastq("data/reads.fastq")
for read in reads {
let q = read.quality
let trimmed = trim_quality(q, threshold)
print("@" + read.id)
print(read.seq) # trimmed range
print("+")
print(trimmed)
}
ORFR — Finding Genes with ORFs
Given a DNA string, find the longest open reading frame (ORF) in any reading frame on either strand.
# Find longest ORF across all reading frames
let seq = read_fasta("data/sequences.fasta")[0].seq
# Check all 6 reading frames (3 forward + 3 reverse complement)
let frames = [seq, reverse_complement(seq)]
|> flat_map(|s| [
s,
s |> slice(1),
s |> slice(2)
])
let longest_orf = frames
|> flat_map(|frame| {
let protein = frame |> translate
# Find all ORFs: sequences between M and *
to_string(protein)
|> split("*")
|> flat_map(|segment| {
let start = index_of(segment, "M")
if start >= 0 { [slice(segment, start)] } else { [] }
})
})
|> sort(|a, b| len(b) - len(a))
|> head(1)
print(longest_orf[0])
NEED — Pairwise Global Alignment
Fetch two protein sequences from UniProt and compute their global alignment score using BLOSUM62.
# requires: internet connection (UniProt API)
# Fetch two UniProt protein entries
let entry1 = uniprot_entry("B5ZC00")
let entry2 = uniprot_entry("P07204")
print(f"Protein 1: {entry1.name} ({entry1.sequence_length} aa)")
print(f"Protein 2: {entry2.name} ({entry2.sequence_length} aa)")
# Get FASTA sequences for comparison
let fasta1 = uniprot_fasta("B5ZC00")
let fasta2 = uniprot_fasta("P07204")
print(f"Sequence 1 length: {len(fasta1)}")
print(f"Sequence 2 length: {len(fasta2)}")
MEME — New Motif Discovery
Given a set of DNA sequences, find the shared motif using positional analysis.
# Discover shared motifs in a collection of sequences
let records = read_fasta("data/sequences.fasta")
let sequences = records |> map(|r| r.seq)
# Find shared k-mers across all sequences
let k = 10
let shared = sequences[0]
|> kmers(k)
|> filter(|kmer| {
sequences |> all(|s| contains(to_string(s), to_string(kmer)))
})
# Report longest shared motif
let motifs = shared |> sort(|a, b| len(b) - len(a))
if len(motifs) > 0 { print(motifs[0]) }
SUBO — Suboptimal Local Alignment
Find the number of suboptimal local alignments between two sequences using Smith-Waterman.
# Compare two sequences from a FASTA file
let records = read_fasta("data/sequences.fasta") |> collect
let seq1 = records[0].seq
let seq2 = records[1].seq
# Find shared k-mers as a proxy for local similarity
let k = 5
let kmers1 = seq1 |> kmers(k) |> set
let kmers2 = seq2 |> kmers(k) |> set
let shared = intersection(kmers1, kmers2)
print(f"Shared {k}-mers: {len(shared)} of {len(kmers1)} and {len(kmers2)}")
let similarity = len(shared) |> float / min(len(kmers1), len(kmers2)) |> float
print(f"Jaccard similarity: {similarity |> round(4)}")
CLUS — Global Multiple Alignment
Perform multiple sequence alignment and identify the most conserved column.
# Multiple sequence alignment
let records = read_fasta("data/sequences.fasta")
let sequences = records |> map(|r| to_string(r.seq))
# Find consensus by column analysis
let min_len = sequences |> map(|s| len(s)) |> min
let consensus = range(0, min_len) |> map(|pos| {
let column = sequences |> map(|s| s[pos])
# Find most frequent character
let counts = column |> group_by("_value") |> summarize(|key, rows| {key: len(rows)})
column |> sort(|a, b| {
let ca = column |> filter(|c| c == a) |> len
let cb = column |> filter(|c| c == b) |> len
cb - ca
}) |> head(1)
})
print(consensus |> join(""))
Summary
The Rosalind Bioinformatics Armory problems emphasize using real tools and databases rather than
implementing algorithms from scratch. BioLang's built-in API clients (ncbi_search,
uniprot_entry, etc.) and native sequence operations
(translate, reverse_complement,
fastq) map directly to these problems, making solutions
concise and readable.
| Problem | Key BioLang Features |
|---|---|
| INI | dna"..." literals, string ops |
| GBK | ncbi_search |
| FRMT | ncbi_sequence, pipes |
| TFSQ | fastq, format conversion |
| PHRE | mean_phred, quality scores |
| PTRA | translate |
| RVCO | reverse_complement |
| FILT | fastq, quality filtering |
| BPHR | Quality per-position analysis |
| BFIL | trim_quality |
| ORFR | translate, kmers, reading frames |
| NEED | uniprot_entry, uniprot_fasta |
| MEME | kmers, motif discovery |
| SUBO | kmers, jaccard |
| CLUS | Multiple alignment, consensus |