Advanced Analytics

BioLang includes computational biology functions for expression normalization, dimensionality reduction (UMAP, t-SNE), graph clustering (Leiden), sequence sketching, phylogenetics, and gene set enrichment. These functions operate on matrices and tables and integrate with BioLang's pipe-first design.

Expression Normalization

Normalize count data using standard methods before downstream analysis:

# Load count data and gene lengths
let counts = read_csv("counts_matrix.csv")
let lengths = read_csv("gene_lengths.csv") |> col("length")
let total = sum(col(counts, "sample1"))

# Normalize with TPM, FPKM, or CPM
let tpm_values = tpm(counts, lengths)
let fpkm_values = fpkm(counts, lengths, total)
let cpm_values = cpm(counts)

UMAP

UMAP (Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction technique that preserves both local and global structure:

# umap(matrix)
# Returns 2D coordinates for each row

let data = matrix([
  [1.0, 2.0, 3.0],
  [4.0, 5.0, 6.0],
  [7.0, 8.0, 9.0],
  [2.0, 3.0, 4.0],
])

# Reduce to 2 dimensions
let coords = umap(data)
print(coords)

t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is another non-linear dimensionality reduction method that excels at preserving local cluster structure:

# tsne(matrix)
# Returns 2D coordinates for each row

let data = matrix([
  [1.0, 2.0, 3.0],
  [4.0, 5.0, 6.0],
  [7.0, 8.0, 9.0],
  [2.0, 3.0, 4.0],
])

let coords = tsne(data)
print(coords)

Leiden Clustering

The Leiden algorithm is an improved version of Louvain community detection, widely used in single-cell analysis for identifying cell types:

# leiden(graph)
# Returns cluster assignment for each node

let adj = matrix([
  [0, 1, 1, 0, 0],
  [1, 0, 1, 0, 0],
  [1, 1, 0, 0, 0],
  [0, 0, 0, 0, 1],
  [0, 0, 0, 1, 0],
])

# Cluster the adjacency graph
let clusters = leiden(adj)
print("Cluster assignments:", clusters)

Sketching & Distance

MinHash sketching enables fast approximate distance computation between sequences without full alignment:

# Build MinHash sketches for fast comparison
let seq1 = dna"ATCGATCGATCGATCGATCG"
let seq2 = dna"ATCGATCAATCGATCGATCG"

let s1 = sketch(seq1, 21)
let s2 = sketch(seq2, 21)

# Compute sketch distance (Jaccard-based)
let dist = sketch_dist(s1, s2)
print("Sketch distance:", dist)

# Compare many sequences pairwise
let seqs = read_fasta("data/sequences.fasta") |> collect()
seqs |> map(|a| seqs |> map(|b| sketch_dist(sketch(a.seq, 21), sketch(b.seq, 21))))
  |> print()

Enrichment Analysis

BioLang provides enrichment analysis via GMT gene set files. Use read_gmt() to load gene sets, then enrich() for over-representation analysis or gsea() for gene set enrichment:

# Load gene sets from a GMT file (e.g., MSigDB)
let gene_sets = read_gmt("c2.cp.kegg.v2023.2.symbols.gmt")

# Over-representation analysis
# enrich(gene_list, gene_sets)
let sig_genes = ["BRCA1", "TP53", "EGFR", "MYC", "KRAS"]

let enrich_results = enrich(sig_genes, gene_sets)

# Results: gene_set, pvalue, overlap, size
print(enrich_results)

# GSEA — uses a ranked list and gene sets
# gsea(ranked_list, gene_sets)
let ranked = [["BRCA1", 3.2], ["TP53", 2.8], ["EGFR", -1.5], ["MYC", -2.1]]
let gsea_results = gsea(ranked, gene_sets)
print(gsea_results)

Neighbor-Joining Trees

Build phylogenetic trees from distance matrices:

# neighbor_joining(distance_matrix)
# distance_matrix: Matrix (symmetric)

let dist = matrix([
  [0.0, 0.3, 0.5, 0.7],
  [0.3, 0.0, 0.4, 0.6],
  [0.5, 0.4, 0.0, 0.8],
  [0.7, 0.6, 0.8, 0.0],
])

let tree = neighbor_joining(dist)
print(tree)

Full Pipeline Example

# Complete analysis pipeline: sequences -> k-mers -> clustering -> enrichment

# 1. Load sequences and compute k-mer profiles
let seqs = read_fasta("data/sequences.fasta") |> collect()
# kmer_count returns a Table per sequence — top 50 k-mers each
let profiles = seqs
  |> map(|r| kmer_count(r.seq, 21, 50))
  |> collect()

# 2. Build distance matrix via sketching
let sketches = seqs |> map(|r| sketch(r.seq, 21)) |> collect()
let dist = matrix(sketches |> map(|a| sketches |> map(|b| sketch_dist(a, b))))

# 3. Dimensionality reduction
let coords = umap(dist)
print("UMAP coordinates:", coords)

# 4. Clustering
let clusters = leiden(dist)
print("Cluster assignments:", clusters)

# 5. Build phylogenetic tree
let tree = neighbor_joining(dist)
print(tree)

# 6. Enrichment analysis
let gene_sets = read_gmt("pathways.gmt")
let genes = ["BRCA1", "TP53", "EGFR", "MYC"]
let enriched = enrich(genes, gene_sets)
print(enriched)