Rosalind Bioinformatics Armory

Solutions to the Rosalind Bioinformatics Armory problem set, implemented in idiomatic BioLang. These problems focus on using real bioinformatics tools and databases rather than implementing algorithms from scratch.

INI — Introduction to the Bioinformatics Armory

Given a DNA string, count the occurrences of each nucleotide (A, C, G, T).

# Count each nucleotide in a DNA string
let seq = dna"AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"

let counts = base_counts(seq)
print(counts["A"], counts["C"], counts["G"], counts["T"])

GBK — GenBank Introduction

Search NCBI's Nucleotide database for entries from a given organism published between two dates. Return the count.

# requires: internet connection (NCBI API)
# Search GenBank for organism entries within a date range
let organism = "Anthoxanthum"
let date_from = "2003/07/25"
let date_to = "2005/12/27"

let query = organism + "[Organism] AND " + date_from + ":" + date_to + "[Publication Date]"
let result = ncbi_search("nucleotide", query)
print(len(result))

FRMT — Data Formats

Given a list of GenBank IDs, fetch their FASTA records and return the one with the shortest sequence.

# requires: internet connection (NCBI API)
# Find the shortest sequence among GenBank accessions
let ids = ["JX205496.1", "JX ## 469991.1", "JX469983.1"]

let shortest = ids
  |> map(|id| {
    let seq = ncbi_sequence(id)
    {id: id, seq: seq, length: len(seq)}
  })
  |> sort_by(|a, b| a.length - b.length)
  |> head(1)

print(">" + shortest[0].id)
print(shortest[0].seq)

TFSQ — FASTQ Format Introduction

Convert FASTQ records to FASTA format.

# Convert FASTQ to FASTA
let reads = read_fastq("data/reads.fastq")

for read in reads {
  print(">" + read.id)
  print(read.seq)
}

# Or as a pipeline:
read_fastq("data/reads.fastq")
  |> map(|r| ">" + r.id + "\n" + to_string(r.seq))
  |> each(|line| print(line))

PHRE — Read Quality Distribution

Given a quality threshold, count how many reads have average quality below it.

# Count reads with average quality below threshold
let threshold = 28
let reads = read_fastq("data/reads.fastq")

let below = reads
  |> filter(|r| mean_phred(r.quality) < threshold)
  |> len

print(below)

PTRA — Protein Translation

Given a DNA string and a protein, find which genetic code table was used for translation.

# Translate DNA using different codon tables
let dna_seq = dna"ATGGCCATGGCGCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA"

# Standard translation
let protein = dna_seq |> translate
print("Standard:", protein)

# The translate function uses standard codon table by default.
# Compare the output with the expected protein to identify the table.

RVCO — Complementing a Strand of DNA

Given FASTA records, count how many have a sequence equal to its own reverse complement.

# Count sequences that match their own reverse complement
let records = read_fasta("data/sequences.fasta")

let count = records
  |> filter(|r| to_string(r.seq) == to_string(reverse_complement(r.seq)))
  |> len

print(count)

FILT — Read Filtration by Quality

Filter FASTQ reads: keep only those where at least a given percentage of bases meet a quality threshold.

# Filter reads by per-base quality percentage
let quality_threshold = 20
let percentage = 90
let reads = read_fastq("data/reads.fastq")

let passed = reads |> filter(|r| {
  let scores = r.quality
  let total = len(scores)
  let good = scores |> filter(|q| q >= quality_threshold) |> len
  (good * 100) / total >= percentage
})

print(len(passed))

BPHR — Base Quality Distribution

Given FASTQ reads and a quality threshold, count positions where the average quality falls below it.

# Count positions with average quality below threshold
let threshold = 20
let reads = read_read_fastq("data/reads.fastq")

# Collect quality scores per position
let num_positions = reads |> map(|r| len(r.quality)) |> max
let position_avgs = range(0, num_positions) |> map(|pos| {
  let scores_at_pos = reads
    |> filter(|r| len(r.quality) > pos)
    |> map(|r| r.quality[pos])
  mean(scores_at_pos)
})

let below = position_avgs |> filter(|avg| avg < threshold) |> len
print(below)

BFIL — Base Filtration by Quality

Trim each FASTQ read from both ends, removing bases below a quality threshold.

# Trim reads from both ends by quality
let threshold = 20
let reads = read_fastq("data/reads.fastq")

for read in reads {
  let q = read.quality
  let trimmed = trim_quality(q, threshold)
  print("@" + read.id)
  print(read.seq)   # trimmed range
  print("+")
  print(trimmed)
}

ORFR — Finding Genes with ORFs

Given a DNA string, find the longest open reading frame (ORF) in any reading frame on either strand.

# Find longest ORF across all reading frames
let seq = read_fasta("data/sequences.fasta")[0].seq

# Check all 6 reading frames (3 forward + 3 reverse complement)
let frames = [seq, reverse_complement(seq)]
  |> flat_map(|s| [
    s,
    s |> slice(1),
    s |> slice(2)
  ])

let longest_orf = frames
  |> flat_map(|frame| {
    let protein = frame |> translate
    # Find all ORFs: sequences between M and *
    to_string(protein)
      |> split("*")
      |> flat_map(|segment| {
        let start = index_of(segment, "M")
        if start >= 0 { [slice(segment, start)] } else { [] }
      })
  })
  |> sort(|a, b| len(b) - len(a))
  |> head(1)

print(longest_orf[0])

NEED — Pairwise Global Alignment

Fetch two protein sequences from UniProt and compute their global alignment score using BLOSUM62.

# requires: internet connection (UniProt API)
# Fetch two UniProt protein entries
let entry1 = uniprot_entry("B5ZC00")
let entry2 = uniprot_entry("P07204")

print(f"Protein 1: {entry1.name} ({entry1.sequence_length} aa)")
print(f"Protein 2: {entry2.name} ({entry2.sequence_length} aa)")

# Get FASTA sequences for comparison
let fasta1 = uniprot_fasta("B5ZC00")
let fasta2 = uniprot_fasta("P07204")
print(f"Sequence 1 length: {len(fasta1)}")
print(f"Sequence 2 length: {len(fasta2)}")

MEME — New Motif Discovery

Given a set of DNA sequences, find the shared motif using positional analysis.

# Discover shared motifs in a collection of sequences
let records = read_fasta("data/sequences.fasta")
let sequences = records |> map(|r| r.seq)

# Find shared k-mers across all sequences
let k = 10
let shared = sequences[0]
  |> kmers(k)
  |> filter(|kmer| {
    sequences |> all(|s| contains(to_string(s), to_string(kmer)))
  })

# Report longest shared motif
let motifs = shared |> sort(|a, b| len(b) - len(a))
if len(motifs) > 0 { print(motifs[0]) }

SUBO — Suboptimal Local Alignment

Find the number of suboptimal local alignments between two sequences using Smith-Waterman.

# Compare two sequences from a FASTA file
let records = read_fasta("data/sequences.fasta") |> collect
let seq1 = records[0].seq
let seq2 = records[1].seq

# Find shared k-mers as a proxy for local similarity
let k = 5
let kmers1 = seq1 |> kmers(k) |> set
let kmers2 = seq2 |> kmers(k) |> set
let shared = intersection(kmers1, kmers2)
print(f"Shared {k}-mers: {len(shared)} of {len(kmers1)} and {len(kmers2)}")
let similarity = len(shared) |> float / min(len(kmers1), len(kmers2)) |> float
print(f"Jaccard similarity: {similarity |> round(4)}")

CLUS — Global Multiple Alignment

Perform multiple sequence alignment and identify the most conserved column.

# Multiple sequence alignment
let records = read_fasta("data/sequences.fasta")
let sequences = records |> map(|r| to_string(r.seq))

# Find consensus by column analysis
let min_len = sequences |> map(|s| len(s)) |> min
let consensus = range(0, min_len) |> map(|pos| {
  let column = sequences |> map(|s| s[pos])
  # Find most frequent character
  let counts = column |> group_by("_value") |> summarize(|key, rows| {key: len(rows)})
  column |> sort(|a, b| {
    let ca = column |> filter(|c| c == a) |> len
    let cb = column |> filter(|c| c == b) |> len
    cb - ca
  }) |> head(1)
})

print(consensus |> join(""))

Summary

The Rosalind Bioinformatics Armory problems emphasize using real tools and databases rather than implementing algorithms from scratch. BioLang's built-in API clients (ncbi_search, uniprot_entry, etc.) and native sequence operations (translate, reverse_complement, fastq) map directly to these problems, making solutions concise and readable.

Problem Key BioLang Features
INIdna"..." literals, string ops
GBKncbi_search
FRMTncbi_sequence, pipes
TFSQfastq, format conversion
PHREmean_phred, quality scores
PTRAtranslate
RVCOreverse_complement
FILTfastq, quality filtering
BPHRQuality per-position analysis
BFILtrim_quality
ORFRtranslate, kmers, reading frames
NEEDuniprot_entry, uniprot_fasta
MEMEkmers, motif discovery
SUBOkmers, jaccard
CLUSMultiple alignment, consensus