Tables & Data Frames
Tables are a first-class data type in BioLang, designed for the tabular data that
dominates bioinformatics workflows. BioLang provides dplyr-inspired verbs —
select, filter, mutate, group_by,
summarize, sort_by, and join
— that compose naturally with pipes.
Creating Tables
From Records
from_records() converts a list of records into a table:
# Create a table from a list of records
let data = from_records([
{gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
{gene: "TP53", log2fc: -1.8, pvalue: 0.005},
{gene: "EGFR", log2fc: 3.1, pvalue: 0.0001},
{gene: "KRAS", log2fc: 0.4, pvalue: 0.42},
{gene: "MYC", log2fc: 1.9, pvalue: 0.008}
])
println(data)
From CSV
# Read a CSV file into a table
let data = read_csv("data/expression.csv")
println(data)
Select
Choose which columns to keep:
let data = from_records([
{gene: "BRCA1", log2fc: 2.5, pvalue: 0.001, chrom: "chr17"},
{gene: "TP53", log2fc: -1.8, pvalue: 0.005, chrom: "chr17"},
{gene: "EGFR", log2fc: 3.1, pvalue: 0.0001, chrom: "chr7"}
])
# Keep only specific columns
let subset = data |> select("gene", "pvalue")
println(subset)
Filter
Keep rows matching a predicate. The closure receives each row as a record:
let data = from_records([
{gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
{gene: "TP53", log2fc: -1.8, pvalue: 0.005},
{gene: "EGFR", log2fc: 3.1, pvalue: 0.0001},
{gene: "KRAS", log2fc: 0.4, pvalue: 0.42}
])
# Filter for significant genes
let significant = data |> filter(|r| r.pvalue < 0.01)
println(significant)
# Multiple conditions
let up_and_sig = data |> filter(|r| r.log2fc > 1.0 && r.pvalue < 0.01)
println(up_and_sig)
Mutate
Add or transform columns. Takes the table, a column name string, and a closure that receives each row as a record:
let data = from_records([
{gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
{gene: "TP53", log2fc: -1.8, pvalue: 0.005},
{gene: "EGFR", log2fc: 3.1, pvalue: 0.0001}
])
# Add a new column
let enriched = data
|> mutate("significant", |r| r.pvalue < 0.01)
|> mutate("direction", |r| if r.log2fc > 0 { "up" } else { "down" })
println(enriched)
Sorting
Use sort_by to sort rows with a comparison closure, or
arrange to sort by column names:
let data = from_records([
{gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
{gene: "TP53", log2fc: -1.8, pvalue: 0.005},
{gene: "EGFR", log2fc: 3.1, pvalue: 0.0001},
{gene: "KRAS", log2fc: 0.4, pvalue: 0.42}
])
# sort_by with a comparison closure
let by_fc = data |> sort_by(|a, b| b.log2fc - a.log2fc)
println(by_fc)
# arrange by column name (string args)
let by_pvalue = data |> arrange("pvalue")
println(by_pvalue)
Group By and Summarize
group_by splits a table into a map of sub-tables keyed by column value.
summarize then reduces each group to a single record:
let data = from_records([
{gene: "BRCA1", chrom: "chr17", score: 0.95},
{gene: "TP53", chrom: "chr17", score: 0.88},
{gene: "EGFR", chrom: "chr7", score: 0.91},
{gene: "BRAF", chrom: "chr7", score: 0.72}
])
let summary = data
|> group_by("chrom")
|> summarize(|key, rows| {
chrom: key,
count: len(col(rows, "gene")),
avg_score: mean(col(rows, "score"))
})
println(summary)
Count By
count_by counts the number of rows for each distinct value
in a column:
let data = from_records([
{gene: "BRCA1", impact: "HIGH"},
{gene: "TP53", impact: "HIGH"},
{gene: "EGFR", impact: "MODERATE"},
{gene: "KRAS", impact: "HIGH"},
{gene: "BRAF", impact: "MODERATE"}
])
let counts = data |> count_by("impact")
println(counts)
Column Access
Use col(table, "name") to extract a single column as a list:
let data = from_records([
{gene: "BRCA1", score: 0.95},
{gene: "TP53", score: 0.88},
{gene: "EGFR", score: 0.91}
])
let scores = col(data, "score")
println(scores) # [0.95, 0.88, 0.91]
println(mean(scores)) # 0.9133...
Joins
Combine two tables by a shared column. inner_join keeps only
matching rows; left_join keeps all rows from the left table:
let expression = from_records([
{gene: "BRCA1", log2fc: 2.5},
{gene: "TP53", log2fc: -1.8},
{gene: "EGFR", log2fc: 3.1}
])
let annotations = from_records([
{gene: "BRCA1", pathway: "DNA repair"},
{gene: "TP53", pathway: "Cell cycle"},
{gene: "MYC", pathway: "Proliferation"}
])
# Inner join -- only rows with matching gene in both tables
let joined = inner_join(expression, annotations, "gene")
println(joined)
# Left join -- keep all expression rows
let enriched = left_join(expression, annotations, "gene")
println(enriched)
bio_join for Multi-Omics
bio_join auto-detects common biological key columns (gene, chrom,
transcript_id, etc.) so you don't need to specify the key:
let rnaseq = from_records([
{gene: "BRCA1", fpkm: 12.5},
{gene: "TP53", fpkm: 45.2}
])
let proteomics = from_records([
{gene: "BRCA1", intensity: 8500.0},
{gene: "TP53", intensity: 12300.0}
])
# Auto-detects "gene" as the join key
let multi_omics = bio_join(rnaseq, proteomics)
println(multi_omics)
Complete Pipeline Example
The real power of table verbs is in composing them into analysis pipelines:
let data = read_csv("data/expression.csv")
data
|> filter(|r| r.pvalue < 0.01)
|> sort_by(|a, b| b.log2fc - a.log2fc)
|> each(|r| println(f"{r.gene}: FC={r.log2fc}, p={r.pvalue}"))
Writing Tables
let data = from_records([
{gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
{gene: "TP53", log2fc: -1.8, pvalue: 0.005}
])
# Write to CSV (table first, path second)
write_csv(data, "output/results.csv")
# Or with pipe syntax
data |> write_csv("output/results.csv")
# Pretty-print to console
println(data)