File Formats

BioLang provides built-in readers and writers for all major bioinformatics file formats. Readers return lazy streams by default, enabling constant-memory processing of files that are gigabytes or terabytes in size. Writers accept streams or collections and handle buffering automatically.

FASTA

FASTA files contain one or more sequences, each preceded by a header line starting with >. BioLang provides two ways to read FASTA:

# Table mode (eager, reusable) — best for small/medium files
let records = read_fasta("data/sequences.fasta")
records |> filter(|r| r.length > 100)    # works
records |> map(|r| r.id)                 # still works — it's a Table

# Stream mode (lazy, one-time) — best for large files
records = read_fasta("data/sequences.fasta")
records |> map(|r| r.id) |> collect()    # works

# Each record has: id, description, seq, length
read_fasta("data/sequences.fasta")
  |> map(|r| print(r.id, seq_len(r.seq)))

# Filter and write to a new file
read_fasta("data/sequences.fasta")
  |> filter(|r| seq_len(r.seq) > 10000)
  |> write_fasta("long_contigs.fa")

# Read gzipped FASTA (auto-detected by extension)
read_fasta("data/sequences.fasta")
  |> map(|r| gc_content(r.seq))
  |> collect()

FASTQ

FASTQ extends FASTA with per-base quality scores. Records have id, seq, and quality fields. Quality scores are Phred+33 encoded:

# Table mode (eager, reusable)
let reads = read_fastq("data/reads.fastq")
reads |> filter(|r| r.length > 50)       # reusable

# Quality filtering
read_fastq("data/reads.fastq")
  |> filter(|r| mean_phred(r.quality) >= 30)
  |> write_fastq("filtered.fq.gz")

# Compute per-read statistics
read_fastq("data/reads.fastq")
  |> map(|r| {
    id: r.id,
    gc: gc_content(r.seq),
    mean_qual: mean_phred(r.quality)
  })
  |> to_table()
  |> write_csv("read_stats.csv")

BED

BED files describe genomic regions. BioLang supports BED3 through BED12. Each record maps to a table row with named columns:

# Read a BED file into a table
let regions = read_bed("data/regions.bed")

# Columns: chrom, start, end, name, score, strand (BED6)
regions
  |> filter(|r| r.score > 500)
  |> mutate("width", |r| r.end - r.start)
  |> print()

# Convert BED records to intervals for set operations
let intervals = read_bed("data/regions.bed")
  |> map(|r| interval(r.chrom, r.start, r.end, r.strand))
  |> collect()

# Write intervals back to BED
intervals |> write_bed("output.bed")

GFF / GTF

GFF3 and GTF are tab-delimited annotation formats. BioLang parses the attributes column into a structured map:

# Read a GFF3 file
let annotations = read_gff("data/annotations.gff")

# Fields: seqid, source, type, start, end, score, strand, phase, attributes
annotations
  |> filter(|r| r.type == "gene")
  |> map(|r| {
    gene_id: r.attributes["ID"],
    name: r.attributes["Name"],
    chrom: r.seqid,
    start: r.start,
    end: r.end
  })
  |> to_table()
  |> print()

# Read GTF
read_gff("data/annotations.gff")
  |> filter(|r| r.type == "exon")
  |> map(|r| r.attributes["gene_name"])
  |> unique()
  |> len()
  |> print()

VCF

VCF files contain variant calls. BioLang parses the INFO and FORMAT fields into structured maps, and genotype data into per-sample accessors:

# Read a VCF file
let variants = read_vcf("data/variants.vcf")

# Fields: chrom, pos, id, ref, alt, qual, filter, info
variants
  |> filter(|v| v.qual > 30 && v.filter == "PASS")
  |> print()

# Access variant fields
read_vcf("data/variants.vcf")
  |> map(|v| {
    chrom: v.chrom,
    pos: v.pos,
    ref_allele: v.ref,
    alt_allele: v.alt,
    qual: v.qual
  })
  |> to_table()
  |> write_csv("variant_summary.csv")

# Write filtered VCF
read_vcf("data/variants.vcf")
  |> filter(|v| v.chrom == "chr1")
  |> write_vcf("chr1_variants.vcf")

SAM / BAM

SAM and BAM files contain aligned reads. BAM files are the binary compressed form. BioLang reads both transparently:

# Read a BAM file (returns a table of alignments)
let alignments = read_bam("sorted.bam")

# Fields: qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
alignments
  |> filter(|a| a.mapq >= 30)
  |> map(|a| { name: a.qname, chrom: a.rname, pos: a.pos })
  |> to_table()
  |> print()

# Read SAM (text format)
let sam_records = read_sam("alignments.sam")
print(nrow(sam_records), "alignments")

MAF (Multiple Alignment Format)

MAF files store multiple sequence alignments, commonly used in comparative genomics. Each alignment block contains sequences from different species aligned to each other:

# Read a MAF file
let records = read_maf("multiz30way.maf")

# MAF records are returned as a table
records |> print()

Format-Specific Readers

Each format has its own reader function. Use the appropriate one for your file:

# Each format has a dedicated reader
let variants = read_vcf("data/variants.vcf")
let reads = read_fastq("data/reads.fastq")
let alignments = read_bam("data/aligned.bam")
let regions = read_bed("data/regions.bed")
let annotations = read_gff("data/annotations.gff")
let sequences = read_fasta("data/sequences.fasta")

Format Mismatch Detection

All read functions check the file's content before parsing and return a clear error if the format doesn't match. This catches common mistakes like passing a FASTA file to fastq():

# Wrong function for the file format
read_fastq("data/sequences.fasta")
# TypeError: read_fastq(): file appears to be FASTA format
#   (starts with >), not FASTQ. Use read_fasta() instead

# Works both ways
read_fasta("data/reads.fastq")
# TypeError: fasta(): file appears to be FASTQ format
#   (starts with @), not FASTA. Use fastq() instead

# Also detects binary vs text mismatches
sam("aligned.bam")
# TypeError: sam(): file appears to be BAM (binary) format,
#   not SAM (text). Use bam() instead

Streams vs Tables

fasta() and fastq() return lazy streams that can only be iterated once. For reusable data, use read_fasta() / read_fastq() which return tables:

# Stream — consumed once (good for large files)
let records = read_fasta("data/sequences.fasta")
records |> map(|r| r.id) |> collect()   # ["seq1", "seq2"]

# Table — reusable (good for small/medium files)
records = read_fasta("data/sequences.fasta")
records |> map(|r| r.id)     # works
records |> map(|r| r.seq)    # still works — it's a Table

# Other formats (bed, vcf, gff, sam, bam) default to tables.
# Use {stream: true} for lazy mode:
read_vcf("data/variants.vcf")

Writing Files

Every reader has a corresponding writer. Writers accept streams, collections, or tables and handle format-specific details like compression:

# Write functions match read functions
write_fasta(records, "output.fa")
write_fastq(records, "output.fq.gz")    # .gz = auto-compress
write_bed(intervals, "regions.bed")
write_vcf(variants, "filtered.vcf.gz")

# Pipe-friendly syntax (stream |> write)
read_fasta("data/sequences.fasta")
  |> filter(|r| seq_len(r.seq) > 500)
  |> write_fasta("filtered.fa")

Variant Type

The Variant type provides structured access to VCF-style variant data.

# Construct a variant: variant(chrom, pos, ref, alt)
let v = variant("chr17", 7674220, "G", "A")

# Access variant fields
println(v.chrom, ":", v.pos, v.ref, ">", v.alt)

# VCF reader returns table rows (not Variant values)
let variants = read_vcf("data/variants.vcf")
for v in variants {
  if v.filter == "PASS" && v.qual > 30 {
    println(v.chrom, v.pos, v.ref, ">", v.alt)
  }
}

# Type checking
println(type(v))            # "Variant"
println(is_snp(v))          # true
println(variant_type(v))    # "SNP"