K-mer Analysis
K-mers are contiguous subsequences of length k extracted from a longer sequence. They are foundational to many bioinformatics algorithms including genome assembly, sequence comparison, metagenomic classification, and error correction. BioLang provides optimized k-mer functions that use 2-bit encoding for k ≤ 32, enabling memory-efficient counting of billions of k-mers.
kmer_encode / kmer_decode
Convert between k-mer strings and their integer encodings. Each base is encoded as 2 bits (A=00, C=01, G=10, T=11), so a k-mer of length k requires 2k bits:
# Encode a k-mer to its integer representation
# kmer_encode(sequence, k)
let code = kmer_encode(dna"ATCG", 4) # => 27
code = kmer_encode(dna"AAAA", 4) # => 0
code = kmer_encode(dna"TTTT", 4) # => 255
# Decode back to a string
let seq = kmer_decode(27) # => "ATCG"
seq = kmer_decode(0) # => "AAAA"
kmer_count
Count all k-mer occurrences in a sequence, list, table, or stream. Returns a Table
with kmer and count columns, sorted by count descending.
Accepts DNA, RNA, strings, lists of sequences, records with a seq field,
and FASTQ streams.
let seq = dna"ATCGATCGATCGATCG"
# Count all 4-mers — returns a Table sorted by count (descending)
let counts = kmer_count(seq, 4)
print(counts) # Table: kmer | count
# Stream mode — constant memory for large FASTQ files
read_fastq("data/reads.fastq")
|> kmer_count(21)
|> head(10)
|> print()
Memory Options for Large Datasets
For whole-genome or deep-sequencing data, k-mer counting can generate millions of unique entries. BioLang provides three strategies to manage memory:
| kmer_count(seq, 21) | Default — in-memory up to ~2M unique k-mers, then auto-spills to a temporary SQLite database on disk. No code changes needed. |
| kmer_count(seq, 21, 100) | Top-N mode — only keeps the top N most frequent k-mers. Uses bounded memory with periodic pruning. Best when you only need the most common k-mers. |
| fastq(...) |> kmer_count(21) | Streaming — processes reads one at a time without loading all reads into memory. Combine with head(n) to limit output. |
The result is always sorted by count descending — no need for sort_by after kmer_count.
kmer_spectrum
The k-mer spectrum (frequency-of-frequencies) shows how many k-mers appear exactly
1 time, 2 times, etc. This is useful for genome size estimation and error rate
analysis. Pass a counts map from kmer_count:
let seq = dna"ATCGATCGATCGATCGATCG"
# Compute k-mer frequency spectrum
let spectrum = kmer_spectrum(seq, 4)
# => Table with frequency and count columns
spectrum |> print()
minimizers
Minimizers are a subsampling scheme that selects the lexicographically smallest k-mer within each window of size w. They are used in minimap2, Kraken, and many other tools for efficient sequence comparison and indexing:
let seq = dna"ATCGATCGATCGATCGATCGATCG"
# Extract minimizers: minimizers(seq, k, window)
let mins = minimizers(seq, 7, 11)
# Returns a list of { kmer, position } records
mins |> map(|m| print(m.position, m.kmer))
# Minimizer density (minimizers per base)
let density = float(len(mins)) / float(seq_len(seq))
print("Minimizer density:", density)
kmer_encode / kmer_decode
Convert between k-mer strings and their compact integer encodings, useful for efficient storage and comparison:
# Encode a k-mer to its 2-bit integer representation
let code = kmer_encode(dna"ATCG", 4)
print(code) # integer encoding
# Decode back to string
let seq = kmer_decode(code)
print(seq)
Practical Example: Metagenomic Classification
K-mer analysis enables fast taxonomic classification without alignment:
# Top 21-mers across all reads — streaming, bounded memory
read_fastq("data/reads.fastq")
|> kmer_count(21, 50)
|> print()
# Or full k-mer profile (auto-spills to disk for large datasets)
read_fastq("data/reads.fastq")
|> kmer_count(21)
|> head(20)
|> bar_chart("Top 21-mers")