Tables & Data Frames

Tables are a first-class data type in BioLang, designed for the tabular data that dominates bioinformatics workflows. BioLang provides dplyr-inspired verbs — select, filter, mutate, group_by, summarize, sort_by, and join — that compose naturally with pipes.

Creating Tables

From Records

from_records() converts a list of records into a table:

# Create a table from a list of records
let data = from_records([
  {gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
  {gene: "TP53", log2fc: -1.8, pvalue: 0.005},
  {gene: "EGFR", log2fc: 3.1, pvalue: 0.0001},
  {gene: "KRAS", log2fc: 0.4, pvalue: 0.42},
  {gene: "MYC", log2fc: 1.9, pvalue: 0.008}
])
println(data)

From CSV

# Read a CSV file into a table
let data = read_csv("data/expression.csv")
println(data)

Select

Choose which columns to keep:

let data = from_records([
  {gene: "BRCA1", log2fc: 2.5, pvalue: 0.001, chrom: "chr17"},
  {gene: "TP53", log2fc: -1.8, pvalue: 0.005, chrom: "chr17"},
  {gene: "EGFR", log2fc: 3.1, pvalue: 0.0001, chrom: "chr7"}
])

# Keep only specific columns
let subset = data |> select("gene", "pvalue")
println(subset)

Filter

Keep rows matching a predicate. The closure receives each row as a record:

let data = from_records([
  {gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
  {gene: "TP53", log2fc: -1.8, pvalue: 0.005},
  {gene: "EGFR", log2fc: 3.1, pvalue: 0.0001},
  {gene: "KRAS", log2fc: 0.4, pvalue: 0.42}
])

# Filter for significant genes
let significant = data |> filter(|r| r.pvalue < 0.01)
println(significant)

# Multiple conditions
let up_and_sig = data |> filter(|r| r.log2fc > 1.0 && r.pvalue < 0.01)
println(up_and_sig)

Mutate

Add or transform columns. Takes the table, a column name string, and a closure that receives each row as a record:

let data = from_records([
  {gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
  {gene: "TP53", log2fc: -1.8, pvalue: 0.005},
  {gene: "EGFR", log2fc: 3.1, pvalue: 0.0001}
])

# Add a new column
let enriched = data
  |> mutate("significant", |r| r.pvalue < 0.01)
  |> mutate("direction", |r| if r.log2fc > 0 { "up" } else { "down" })
println(enriched)

Sorting

Use sort_by to sort rows with a comparison closure, or arrange to sort by column names:

let data = from_records([
  {gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
  {gene: "TP53", log2fc: -1.8, pvalue: 0.005},
  {gene: "EGFR", log2fc: 3.1, pvalue: 0.0001},
  {gene: "KRAS", log2fc: 0.4, pvalue: 0.42}
])

# sort_by with a comparison closure
let by_fc = data |> sort_by(|a, b| b.log2fc - a.log2fc)
println(by_fc)

# arrange by column name (string args)
let by_pvalue = data |> arrange("pvalue")
println(by_pvalue)

Group By and Summarize

group_by splits a table into a map of sub-tables keyed by column value. summarize then reduces each group to a single record:

let data = from_records([
  {gene: "BRCA1", chrom: "chr17", score: 0.95},
  {gene: "TP53", chrom: "chr17", score: 0.88},
  {gene: "EGFR", chrom: "chr7", score: 0.91},
  {gene: "BRAF", chrom: "chr7", score: 0.72}
])

let summary = data
  |> group_by("chrom")
  |> summarize(|key, rows| {
    chrom: key,
    count: len(col(rows, "gene")),
    avg_score: mean(col(rows, "score"))
  })
println(summary)

Count By

count_by counts the number of rows for each distinct value in a column:

let data = from_records([
  {gene: "BRCA1", impact: "HIGH"},
  {gene: "TP53", impact: "HIGH"},
  {gene: "EGFR", impact: "MODERATE"},
  {gene: "KRAS", impact: "HIGH"},
  {gene: "BRAF", impact: "MODERATE"}
])

let counts = data |> count_by("impact")
println(counts)

Column Access

Use col(table, "name") to extract a single column as a list:

let data = from_records([
  {gene: "BRCA1", score: 0.95},
  {gene: "TP53", score: 0.88},
  {gene: "EGFR", score: 0.91}
])

let scores = col(data, "score")
println(scores)        # [0.95, 0.88, 0.91]
println(mean(scores))  # 0.9133...

Joins

Combine two tables by a shared column. inner_join keeps only matching rows; left_join keeps all rows from the left table:

let expression = from_records([
  {gene: "BRCA1", log2fc: 2.5},
  {gene: "TP53", log2fc: -1.8},
  {gene: "EGFR", log2fc: 3.1}
])

let annotations = from_records([
  {gene: "BRCA1", pathway: "DNA repair"},
  {gene: "TP53", pathway: "Cell cycle"},
  {gene: "MYC", pathway: "Proliferation"}
])

# Inner join -- only rows with matching gene in both tables
let joined = inner_join(expression, annotations, "gene")
println(joined)

# Left join -- keep all expression rows
let enriched = left_join(expression, annotations, "gene")
println(enriched)

bio_join for Multi-Omics

bio_join auto-detects common biological key columns (gene, chrom, transcript_id, etc.) so you don't need to specify the key:

let rnaseq = from_records([
  {gene: "BRCA1", fpkm: 12.5},
  {gene: "TP53", fpkm: 45.2}
])

let proteomics = from_records([
  {gene: "BRCA1", intensity: 8500.0},
  {gene: "TP53", intensity: 12300.0}
])

# Auto-detects "gene" as the join key
let multi_omics = bio_join(rnaseq, proteomics)
println(multi_omics)

Complete Pipeline Example

The real power of table verbs is in composing them into analysis pipelines:

let data = read_csv("data/expression.csv")
data
  |> filter(|r| r.pvalue < 0.01)
  |> sort_by(|a, b| b.log2fc - a.log2fc)
  |> each(|r| println(f"{r.gene}: FC={r.log2fc}, p={r.pvalue}"))

Writing Tables

let data = from_records([
  {gene: "BRCA1", log2fc: 2.5, pvalue: 0.001},
  {gene: "TP53", log2fc: -1.8, pvalue: 0.005}
])

# Write to CSV (table first, path second)
write_csv(data, "output/results.csv")

# Or with pipe syntax
data |> write_csv("output/results.csv")

# Pretty-print to console
println(data)