Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Benchmarks & Correctness

BioLang is benchmarked against Python (BioPython) and R (Bioconductor) on 30 bioinformatics tasks spanning sequence I/O, k-mer analysis, interval overlaps, variant processing, and multi-step pipelines. All results are reproducible from the benchmarks/ directory.

Test Environment

Linux (WSL2)

  • Intel Core i9-12900K, 16 GB RAM
  • BioLang 0.3.0, Python 3.12.3, R 4.3.3

Windows 11

  • Intel Core i9-12900K, 32 GB RAM
  • BioLang 0.3.0, Python 3.12.4

Results Summary

Where BioLang Wins

BioLang’s Rust I/O engine (noodles) and native 2-bit DNA encoding deliver the biggest gains on:

TaskBioLangPythonSpeedup
ENCODE Peak Overlap0.363s2.574s7.1x
Protein K-mers0.191s1.331s7.0x
FASTA Parse (30 KB)0.138s0.926s6.7x
FASTA gzipped0.141s0.930s6.6x
E. coli Genome0.176s1.081s6.1x
GC Content (51 MB)0.830s2.771s3.3x
K-mer Counting (21-mers)6.551s21.01s3.2x
FASTQ QC Pipeline2.349s5.059s2.2x

Where Python Wins

Python’s csv, re, and dict modules are highly optimized C extensions. On text-heavy parsing tasks:

TaskBioLangPythonResult
VCF Filtering0.349s0.166sPy 2.1x faster
ClinVar Variants0.661s0.265sPy 2.5x faster
CSV Join + Group-by0.281s0.156sPy 1.8x faster
GFF3 Ensembl chr220.453s0.171sPy 2.6x faster

Windows Notes

Windows process creation adds ~1s overhead per invocation, compressing speedup ratios for sub-second tasks. For accurate algorithmic comparison, refer to Linux results. CPU-bound tasks that exceed this floor (k-mer counting 3.0x, ENCODE overlap 2.9x, QC pipeline 2.5x) still show clear wins.

Code Conciseness

BioLang scripts average 50-70% fewer lines of code than equivalent Python for the same analysis task. This comes from pipe-first syntax, built-in bio types, and higher-order functions on streams.

Correctness Validation

Performance without correctness is meaningless. BioLang includes two correctness validation suites — synthetic and real-world — that compare outputs against Python (BioPython) and R (Bioconductor) as independent gold standards.

Synthetic Data Validation

Uses generated test data with controlled inputs for deterministic comparison:

TaskWhat it checksToleranceR
gc_contentGC% per sequence from FASTAfloat ±1e-6yes
kmer_countCanonical 5-mer counts from DNAexact integer
vcf_filterFilter VCF by QUAL>=30, count per chromexact integeryes
reverse_complementReverse complement of DNA sequencesexact stringyes
translateDNA→protein translationexact stringyes
csv_groupbyGroup-by aggregation (count, mean)float ±1e-6yes
gff_featuresCount features by type from GFFexact integeryes
sequence_statsN50, total length, GC from FASTAfloat ±1e-6yes
bed_intervalsBED parse, span, merge overlappingexact integeryes

Real-World Data Validation

Uses actual biological data from NCBI and ClinVar to test edge cases that synthetic data misses — non-standard bases, multi-allelic variants, overlapping bacterial genes, and variable naming conventions:

TaskReal Data SourceToleranceR
gc_contentS. cerevisiae genome (16 chromosomes)float ±1e-6yes
kmer_countE. coli K-12 genome (50 KB)exact integer
vcf_filterClinVar VCF (5,000 variants, pathogenic filter)exact integeryes
reverse_complementS. cerevisiae (5 chroms, 200bp each)exact stringyes
translateS. cerevisiae (3 chroms, 99bp each)exact stringyes
csv_groupbyClinVar variants CSV (group by significance)float ±1e-6yes
gff_featuresE. coli K-12 GFF3 annotationexact integeryes
sequence_statsS. cerevisiae genomefloat ±1e-6yes
bed_intervalsE. coli gene BED (derived from GFF)exact integeryes

Real-world data is downloaded automatically via python download_real_data.py (~25 MB total from NCBI FTP).

How It Works

Each task has three implementations — BioLang, Python, and R — that compute the same result and output JSON to stdout. A recursive comparator checks:

  • Floats: ±1e-6 tolerance
  • Integers: exact match
  • Strings: exact match
  • Dicts/lists: recursive key-by-key comparison

Running Validation

# Synthetic data validation
cd benchmarks/correctness
./validate.sh [bl_binary] [python_binary] [rscript_binary]

# Real-world data validation
python download_real_data.py
./validate_real.sh [bl_binary] [python_binary] [rscript_binary]

# Windows
.\validate.ps1 [-BL bl] [-PY python] [-RS Rscript]
.\validate_real.ps1 [-BL bl] [-PY python] [-RS Rscript]

R tests are skipped automatically if R/Bioconductor is not installed.

Reproducing Benchmarks

# Generate synthetic test data
python benchmarks/generate_data.py

# Run all benchmarks (Linux)
cd benchmarks && ./run_all.sh

# Run correctness validation (synthetic)
cd benchmarks/correctness && ./validate.sh

# Run correctness validation (real-world)
python download_real_data.py && ./validate_real.sh

Results are saved to benchmarks/results/latest/{linux,windows}/ with per-category breakdown:

  • language/ — sequence I/O, k-mers, protein, intervals, variants, file I/O, data wrangling
  • pipelines/ — QC pipeline, variant pipeline, annotation, multi-sample, RNA-seq

Methodology

  • Timing: Best of 3 wall-clock runs
  • Data: Mix of synthetic (generated) and real-world (NCBI, ClinVar, ENCODE, Ensembl)
  • K-mers: BioLang uses canonical (strand-agnostic) 21-mers; Python uses forward-only — BioLang does strictly more work
  • Fair comparison: Same input files, same output format, same machine, cold cache between runs
  • Correctness: Two independent validation suites (synthetic + real-world) ensure identical biological answers