Advanced Analytics
BioLang includes computational biology functions for expression normalization, dimensionality reduction (UMAP, t-SNE), graph clustering (Leiden), sequence sketching, phylogenetics, and gene set enrichment. These functions operate on matrices and tables and integrate with BioLang's pipe-first design.
Expression Normalization
Normalize count data using standard methods before downstream analysis:
# Load count data and gene lengths
let counts = read_csv("counts_matrix.csv")
let lengths = read_csv("gene_lengths.csv") |> col("length")
let total = sum(col(counts, "sample1"))
# Normalize with TPM, FPKM, or CPM
let tpm_values = tpm(counts, lengths)
let fpkm_values = fpkm(counts, lengths, total)
let cpm_values = cpm(counts)
UMAP
UMAP (Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction technique that preserves both local and global structure:
# umap(matrix)
# Returns 2D coordinates for each row
let data = matrix([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[2.0, 3.0, 4.0],
])
# Reduce to 2 dimensions
let coords = umap(data)
print(coords)
t-SNE
t-SNE (t-distributed Stochastic Neighbor Embedding) is another non-linear dimensionality reduction method that excels at preserving local cluster structure:
# tsne(matrix)
# Returns 2D coordinates for each row
let data = matrix([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[2.0, 3.0, 4.0],
])
let coords = tsne(data)
print(coords)
Leiden Clustering
The Leiden algorithm is an improved version of Louvain community detection, widely used in single-cell analysis for identifying cell types:
# leiden(graph)
# Returns cluster assignment for each node
let adj = matrix([
[0, 1, 1, 0, 0],
[1, 0, 1, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 1, 0],
])
# Cluster the adjacency graph
let clusters = leiden(adj)
print("Cluster assignments:", clusters)
Sketching & Distance
MinHash sketching enables fast approximate distance computation between sequences without full alignment:
# Build MinHash sketches for fast comparison
let seq1 = dna"ATCGATCGATCGATCGATCG"
let seq2 = dna"ATCGATCAATCGATCGATCG"
let s1 = sketch(seq1, 21)
let s2 = sketch(seq2, 21)
# Compute sketch distance (Jaccard-based)
let dist = sketch_dist(s1, s2)
print("Sketch distance:", dist)
# Compare many sequences pairwise
let seqs = read_fasta("data/sequences.fasta") |> collect()
seqs |> map(|a| seqs |> map(|b| sketch_dist(sketch(a.seq, 21), sketch(b.seq, 21))))
|> print()
Enrichment Analysis
BioLang provides enrichment analysis via GMT gene set files. Use
read_gmt() to load gene sets, then enrich() for
over-representation analysis or gsea() for gene set enrichment:
# Load gene sets from a GMT file (e.g., MSigDB)
let gene_sets = read_gmt("c2.cp.kegg.v2023.2.symbols.gmt")
# Over-representation analysis
# enrich(gene_list, gene_sets)
let sig_genes = ["BRCA1", "TP53", "EGFR", "MYC", "KRAS"]
let enrich_results = enrich(sig_genes, gene_sets)
# Results: gene_set, pvalue, overlap, size
print(enrich_results)
# GSEA — uses a ranked list and gene sets
# gsea(ranked_list, gene_sets)
let ranked = [["BRCA1", 3.2], ["TP53", 2.8], ["EGFR", -1.5], ["MYC", -2.1]]
let gsea_results = gsea(ranked, gene_sets)
print(gsea_results)
Neighbor-Joining Trees
Build phylogenetic trees from distance matrices:
# neighbor_joining(distance_matrix)
# distance_matrix: Matrix (symmetric)
let dist = matrix([
[0.0, 0.3, 0.5, 0.7],
[0.3, 0.0, 0.4, 0.6],
[0.5, 0.4, 0.0, 0.8],
[0.7, 0.6, 0.8, 0.0],
])
let tree = neighbor_joining(dist)
print(tree)
Full Pipeline Example
# Complete analysis pipeline: sequences -> k-mers -> clustering -> enrichment
# 1. Load sequences and compute k-mer profiles
let seqs = read_fasta("data/sequences.fasta") |> collect()
# kmer_count returns a Table per sequence — top 50 k-mers each
let profiles = seqs
|> map(|r| kmer_count(r.seq, 21, 50))
|> collect()
# 2. Build distance matrix via sketching
let sketches = seqs |> map(|r| sketch(r.seq, 21)) |> collect()
let dist = matrix(sketches |> map(|a| sketches |> map(|b| sketch_dist(a, b))))
# 3. Dimensionality reduction
let coords = umap(dist)
print("UMAP coordinates:", coords)
# 4. Clustering
let clusters = leiden(dist)
print("Cluster assignments:", clusters)
# 5. Build phylogenetic tree
let tree = neighbor_joining(dist)
print(tree)
# 6. Enrichment analysis
let gene_sets = read_gmt("pathways.gmt")
let genes = ["BRCA1", "TP53", "EGFR", "MYC"]
let enriched = enrich(genes, gene_sets)
print(enriched)