K-mer Analysis

K-mers are contiguous subsequences of length k extracted from a longer sequence. They are foundational to many bioinformatics algorithms including genome assembly, sequence comparison, metagenomic classification, and error correction. BioLang provides optimized k-mer functions that use 2-bit encoding for k ≤ 32, enabling memory-efficient counting of billions of k-mers.

kmer_encode / kmer_decode

Convert between k-mer strings and their integer encodings. Each base is encoded as 2 bits (A=00, C=01, G=10, T=11), so a k-mer of length k requires 2k bits:

# Encode a k-mer to its integer representation
# kmer_encode(sequence, k)
let code = kmer_encode(dna"ATCG", 4)     # => 27
code = kmer_encode(dna"AAAA", 4)     # => 0
code = kmer_encode(dna"TTTT", 4)     # => 255

# Decode back to a string
let seq = kmer_decode(27)                # => "ATCG"
seq = kmer_decode(0)                 # => "AAAA"

kmer_count

Count all k-mer occurrences in a sequence, list, table, or stream. Returns a Table with kmer and count columns, sorted by count descending. Accepts DNA, RNA, strings, lists of sequences, records with a seq field, and FASTQ streams.

let seq = dna"ATCGATCGATCGATCG"

# Count all 4-mers — returns a Table sorted by count (descending)
let counts = kmer_count(seq, 4)
print(counts)   # Table: kmer | count

# Stream mode — constant memory for large FASTQ files
read_fastq("data/reads.fastq")
  |> kmer_count(21)
  |> head(10)
  |> print()

Memory Options for Large Datasets

For whole-genome or deep-sequencing data, k-mer counting can generate millions of unique entries. BioLang provides three strategies to manage memory:

kmer_count(seq, 21)	Default — in-memory up to ~2M unique k-mers, then auto-spills to a temporary SQLite database on disk. No code changes needed.
kmer_count(seq, 21, 100)	Top-N mode — only keeps the top N most frequent k-mers. Uses bounded memory with periodic pruning. Best when you only need the most common k-mers.
fastq(...) \|> kmer_count(21)	Streaming — processes reads one at a time without loading all reads into memory. Combine with `head(n)` to limit output.

The result is always sorted by count descending — no need for sort_by after kmer_count.

kmer_spectrum

The k-mer spectrum (frequency-of-frequencies) shows how many k-mers appear exactly 1 time, 2 times, etc. This is useful for genome size estimation and error rate analysis. Pass a counts map from kmer_count:

let seq = dna"ATCGATCGATCGATCGATCG"

# Compute k-mer frequency spectrum
let spectrum = kmer_spectrum(seq, 4)
# => Table with frequency and count columns

spectrum |> print()

minimizers

Minimizers are a subsampling scheme that selects the lexicographically smallest k-mer within each window of size w. They are used in minimap2, Kraken, and many other tools for efficient sequence comparison and indexing:

let seq = dna"ATCGATCGATCGATCGATCGATCG"

# Extract minimizers: minimizers(seq, k, window)
let mins = minimizers(seq, 7, 11)
# Returns a list of { kmer, position } records

mins |> map(|m| print(m.position, m.kmer))

# Minimizer density (minimizers per base)
let density = float(len(mins)) / float(seq_len(seq))
print("Minimizer density:", density)

kmer_encode / kmer_decode

Convert between k-mer strings and their compact integer encodings, useful for efficient storage and comparison:

# Encode a k-mer to its 2-bit integer representation
let code = kmer_encode(dna"ATCG", 4)
print(code)   # integer encoding

# Decode back to string
let seq = kmer_decode(code)
print(seq)

Practical Example: Metagenomic Classification

K-mer analysis enables fast taxonomic classification without alignment:

# Top 21-mers across all reads — streaming, bounded memory
read_fastq("data/reads.fastq")
  |> kmer_count(21, 50)
  |> print()

# Or full k-mer profile (auto-spills to disk for large datasets)
read_fastq("data/reads.fastq")
  |> kmer_count(21)
  |> head(20)
  |> bar_chart("Top 21-mers")