30 Benchmarks · 2 Platforms

BioLang Performance

30 bioinformatics tasks benchmarked against Python (BioPython) and R (Bioconductor) on synthetic and real-world data from NCBI, UniProt, ClinVar, ENCODE, and Ensembl.

Best of 3 runs · Wall-clock time · v0.3.0

7.1x
ENCODE Overlap
0.363s vs 2.574s (Py)
7.0x
Protein K-mers
0.191s vs 1.331s (Py)
6.7x
FASTA Parse
0.138s vs 0.926s (Py)
3.2x
K-mer Counting
6.551s vs 21.01s (Py)

Full Results

All 30 tasks across 6 categories. Click tabs to switch platforms.

Linux (WSL2) — Intel i9-12900K, 16 GB RAM — BioLang 0.3.0, Python 3.12.3, R 4.3.3

Task BioLang Python R Speedup vs Py
Sequence I/O
FASTA Small (30 KB)0.138s0.926s1.243s6.7x
FASTA Medium (4.6 MB)0.170s0.991s1.371s5.8x
FASTA Statistics0.995s1.590s1.764s1.6x
FASTA Large (51 MB)0.821s1.649s2.103s2.0x
FASTA gzipped (1.3 MB)0.141s0.930s1.327s6.6x
FASTQ QC1.960s3.551s4.404s1.8x
Real-world Genomes
E. coli Genome0.176s1.081s1.354s6.1x
Human Chr221.126s1.673s2.092s1.5x
GC Content (51 MB)0.830s2.771s2.358s3.3x
K-mer Analysis
K-mer Counting6.551s21.01s3.2x
Chr22 21-mers10.72s28.73s2.7x
Protein K-mers0.191s1.331s1.298s7.0x
Intervals & Overlaps
ENCODE Peak Overlap0.363s2.574s7.1x
Write Filtered FASTA0.821s1.679s4.924s2.0x
Variant & Data
VCF Filtering0.349s0.166s6.312sPy 2.1x faster
ClinVar Variants0.661s0.265s1.208sPy 2.5x faster
CSV Join + Group-by0.281s0.156s0.312sPy 1.8x faster
GFF3 Ensembl chr220.453s0.171sPy 2.6x faster
BED Overlap (synthetic)0.160s0.067s1.279sPy 2.4x faster
Pipelines
FASTQ QC Pipeline2.349s5.059s2.2x
Variant Analysis Pipeline0.268s0.177sPy 1.5x faster
ClinVar Variant Pipeline0.591s0.261sPy 2.3x faster
Multi-Sample Aggregation0.245s0.090sPy 2.7x faster
RNA-seq DE Analysis0.114s0.054sPy 2.1x faster
Variant Annotation0.309s0.122sPy 2.5x faster
ClinVar + Ensembl Annotation0.409s0.183sPy 2.2x faster

Amber rows: Python wins on VCF/CSV/GFF text parsing and pipeline orchestration where Python's regex, csv, and dict modules are highly optimized C extensions. BioLang excels at I/O-heavy and compute-heavy tasks (FASTA/FASTQ parsing, k-mer counting, interval overlaps).

K-mer counting uses canonical (strand-agnostic) 21-mers in BioLang vs forward-only in Python — BioLang does strictly more work.

Methodology

Hardware

  • CPU: Intel Core i9-12900K (16C/24T)
  • Linux: WSL2, 16 GB RAM
  • Windows: Windows 11 Pro, 32 GB RAM
  • Storage: NVMe SSD

Measurement

  • Runs: Best of 3, wall-clock time
  • BioLang: v0.3.0 (tree-walking interpreter)
  • Python: 3.12.3 (Linux) / 3.12.4 (Windows)
  • R: 4.3.3 (Linux only)

Data Sources

  • Synthetic: Generated FASTA/FASTQ/VCF/BED/CSV
  • Real-world: NCBI (E. coli, Human Chr22), UniProt, ClinVar, ENCODE, Ensembl GFF3
  • Sizes: 30 KB to 51 MB per file

Notes

  • K-mers: BioLang uses canonical (strand-agnostic) 21-mers; Python uses forward-only. BioLang does strictly more work.
  • Python wins: VCF/CSV/GFF parsing and pipeline orchestration use Python's highly optimized C extensions (regex, csv, dict).
  • Windows: ~1s process creation overhead compresses speedup ratios for sub-second tasks.

Correctness Validation

Two validation suites — synthetic and real-world — cross-validate 9 tasks against Python (BioPython) and R (Bioconductor) to ensure BioLang produces identical or numerically equivalent results.

Generated test data — controlled inputs for deterministic output comparison

Task Validated Against Tolerance
GC ContentPython, RFloat ±1e-6
K-mer CountingPythonExact integer
VCF FilteringPython, RExact integer
Reverse ComplementPython, RExact string
TranslatePython, RExact string
CSV Group-byPython, RFloat ±1e-6
GFF FeaturesPython, RExact integer
Sequence StatsPython, RFloat ±1e-6
BED IntervalsPython, RExact integer

How it works

Each validated task outputs JSON from all three languages. A recursive diff compares the outputs field-by-field, applying the specified tolerance for floating-point values and exact comparison for integers and strings.

Validation scripts and real-world data download tool live in the benchmarks/correctness/ directory of the repository.

Reproduce the Benchmarks

Clone the repo and run the benchmark suite yourself.

# Clone and enter the benchmarks directory
$ git clone https://github.com/oriclabs/biolang.git
$ cd biolang/benchmarks
# Install BioLang (if not already)
$ cargo install biolang
# Generate test data
$ python generate_data.py
# Run all benchmarks (Linux)
$ bash run_all.sh
# Run correctness validation (synthetic data)
$ bash correctness/validate.sh
# Run real-world validation (E. coli, yeast, ClinVar)
$ python correctness/download_real_data.py
$ bash correctness/validate_real.sh

Raw results, scripts, and per-task analysis available at github.com/oriclabs/biolang/tree/main/benchmarks

Want even faster?

These benchmarks use BioLang's tree-walking interpreter. The bytecode compiler and Cranelift JIT are under active development.