PDB, PubMed & Enrichment
11 builtins for querying protein structures from the RCSB PDB, searching biomedical literature via PubMed, and running gene set enrichment analysis. All functions return structured records or tables ready for piping.
PDB Structures
pdb_entry(pdb_id)
String → Record
Fetches an entry from the RCSB Protein Data Bank. Returns a record with structural metadata including the title, experimental method, resolution, source organism, and release date.
# requires: internet connection (PDB API)
let hb = pdb_entry("4HHB")
print(hb.title) # "THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN..."
print(hb.method) # "X-RAY DIFFRACTION"
print(hb.resolution) # 1.74
print(hb.organism) # "Homo sapiens"
print(hb.release_date) # "1984-07-17"
pdb_search(query)
String → List
Searches the PDB by keyword. Returns a list of PDB IDs matching the query.
Use the results with pdb_entry() to fetch full metadata.
# requires: internet connection (PDB API)
let ids = pdb_search("insulin receptor")
print(len(ids)) # number of matching structures
print(ids[0..5]) # first 5 PDB IDs
# Fetch details for the top hits
ids[0..3]
|> map(|id| pdb_entry(id))
|> each(|e| print(e.title + " — " + str(e.resolution) + " A"))
pdb_chains(pdb_id)
String → List
Lists the polymer chains in a structure. Each entry is a record with chain ID, entity ID, and description.
# requires: internet connection (PDB API)
let chains = pdb_chains("4HHB")
chains |> each(|c| print(c.chain_id + ": " + c.description))
# A: HEMOGLOBIN (DEOXY) (ALPHA CHAIN)
# B: HEMOGLOBIN (DEOXY) (BETA CHAIN)
# C: HEMOGLOBIN (DEOXY) (ALPHA CHAIN)
# D: HEMOGLOBIN (DEOXY) (BETA CHAIN)
pdb_entity(pdb_id, entity_id)
String, Int → Record
Fetches detailed information about a specific entity within a structure. Returns a record with the entity description, type, molecular weight, chain assignments, and organism source.
# requires: internet connection (PDB API)
let alpha = pdb_entity("4HHB", 1)
print(alpha.description) # "HEMOGLOBIN ALPHA CHAIN"
print(alpha.type) # "polypeptide(L)"
print(alpha.weight) # 15126.4
print(alpha.chains) # ["A", "C"]
pdb_sequence(pdb_id, entity_id)
String, Int → String
Retrieves the amino acid sequence for a given entity. Returns a plain protein sequence string suitable for alignment or further analysis.
# requires: internet connection (PDB API)
let seq = pdb_sequence("4HHB", 1)
print(len(seq)) # 141 residues
print(seq[0..20]) # "VLSPADKTNVKAAWGKVGAH"
Example: Compare Hemoglobin Chains
Fetch both hemoglobin subunit sequences from PDB entry 4HHB and compare their lengths and composition.
# requires: internet connection (PDB API)
# Fetch chain information
let chains = pdb_chains("4HHB")
let entities = chains
|> map(|c| c.entity_id)
|> unique()
# Get sequences for each unique entity
let seqs = entities |> map(|eid| {
entity_id: eid,
info: pdb_entity("4HHB", eid),
sequence: pdb_sequence("4HHB", eid),
})
# Compare subunits
seqs |> each(|s| {
let seq = s.sequence
print(s.info.description)
print(" Length: " + str(len(seq)) + " residues")
print(" Chains: " + join(s.info.chains, ", "))
print(" His pct: " + str(round(count(seq, "H") / len(seq) * 100, 1)) + "%")
})
# Output:
# HEMOGLOBIN ALPHA CHAIN
# Length: 141 residues
# Chains: A, C
# His pct: 7.1%
# HEMOGLOBIN BETA CHAIN
# Length: 146 residues
# Chains: B, D
# His pct: 6.2%
PubMed
pubmed_search(query, max_results?)
String, Int? → Record
Searches PubMed via the NCBI E-utilities. Returns a record with
count (total matches in the database) and ids
(list of PMIDs for the returned page). The optional second argument
limits how many IDs are returned (default 20).
# requires: internet connection (NCBI PubMed API)
let results = pubmed_search("CRISPR Cas9 therapy", 5)
# pubmed_search returns a list of PMID strings directly
print(len(results)) # number of IDs returned (up to limit)
print(results) # ["39847123", "39841056", "39838901", "39835444", "39831200"]
pubmed_abstract(pmid)
String → Record
Fetches the abstract and metadata for a single PubMed article. Returns a
record with title, abstract, authors,
journal, year, and doi.
# requires: internet connection (NCBI PubMed API)
let paper = pubmed_abstract("32553272")
print(paper.title) # "Search-and-replace genome editing without..."
print(paper.journal) # "Nature"
print(paper.year) # 2020
print(paper.authors[0]) # "Anzalone AV"
Example: Literature Review Pipeline
Search for recent papers on a topic, fetch their abstracts, and build a summary table for review.
# requires: internet connection (NCBI PubMed API)
# Search for recent gene therapy papers
let hits = pubmed_search("AAV gene therapy retinal dystrophy", 10)
print("Found " + str(hits.count) + " articles")
# Fetch abstracts for each result
let papers = hits.ids |> map(|pmid| pubmed_abstract(pmid))
# Build a summary table
let review = papers |> map(|p| {
year: p.year,
first_author: p.authors[0],
journal: p.journal,
title: p.title,
doi: p.doi,
})
# Sort by year descending and display
review
|> to_table()
|> sort_by(|r| -r.year)
|> print()
# Export for reference management
review |> to_table() |> write_tsv("literature_review.tsv")
Enrichment Analysis
BioLang provides builtins for two standard gene set enrichment approaches: over-representation analysis (ORA) and gene set enrichment analysis (GSEA). Both work with gene sets in GMT format and return tables of significant terms.
read_gmt(path)
String → Record
Reads a Gene Matrix Transposed (GMT) file. Each line defines one gene set
as: set_name\tdescription\tgene1\tgene2\t.... Returns a record
mapping set names to their gene lists.
# GMT format (tab-separated):
# GO:0006915 Apoptotic process BAX BCL2 CASP3 CASP9 TP53
# GO:0007049 Cell cycle CDK1 CDK2 CCNB1 CCND1 RB1
let gene_sets = read_gmt("c5.go.bp.v2024.1.symbols.gmt")
print(len(keys(gene_sets))) # number of gene sets loaded
print(gene_sets["GO:0006915"]) # ["BAX", "BCL2", "CASP3", "CASP9", "TP53"]
enrich(genes, gene_sets, background)
List, Record, List → Table
Performs over-representation analysis (ORA) using the hypergeometric test.
Tests whether each gene set is enriched in your gene list relative to the
background. Returns a table with columns: term,
overlap, set_size, p_value,
fdr (Benjamini-Hochberg adjusted), and genes
(the overlapping gene symbols). Rows are sorted by p-value ascending.
let de_genes = ["TP53", "BRCA1", "CDK2", "CASP3", "BAX", "BCL2", "MYC"]
let gene_sets = read_gmt("c5.go.bp.v2024.1.symbols.gmt")
let background = read_lines("all_expressed_genes.txt")
let ora = enrich(de_genes, gene_sets, background)
ora
|> filter(|r| r.fdr < 0.05)
|> print()
# term | overlap | set_size | p_value | fdr | genes
# ----------------+---------+----------+----------+---------+-----------------------
# GO:0006915 | 4 | 312 | 1.2e-06 | 3.4e-04 | TP53,BAX,BCL2,CASP3
# GO:0008219 | 3 | 198 | 8.7e-05 | 0.012 | TP53,BAX,CASP3
# ...
gsea(ranked_genes, gene_sets)
Table, Record → Table
Performs Gene Set Enrichment Analysis on a pre-ranked gene list. The
ranked_genes table must have gene and
score columns (e.g., log2 fold change or signed
-log10(p)). Uses a permutation test (1000 permutations) to
compute enrichment scores and significance. Returns a table with:
term, es (enrichment score), nes
(normalized ES), p_value, fdr, and
leading_edge (the core enrichment genes).
# Prepare ranked gene list (e.g., from differential expression)
let de_results = tsv("de_results.tsv")
let ranked = de_results
|> mutate("score", |r| r.log2fc * -log10(r.pvalue))
|> select("gene", "score")
|> sort_by(|r| -r.score)
let gene_sets = read_gmt("h.all.v2024.1.symbols.gmt")
let gsea_results = gsea(ranked, gene_sets)
gsea_results
|> filter(|r| r.fdr < 0.25)
|> print()
# term | es | nes | p_value | fdr | leading_edge
# ---------------------------+-------+-------+---------+-------+-------------------
# HALLMARK_APOPTOSIS | 0.62 | 2.14 | 0.0 | 0.002 | BAX,CASP3,TP53...
# HALLMARK_P53_PATHWAY | 0.58 | 2.01 | 0.001 | 0.008 | CDKN1A,MDM2,BAX...
# HALLMARK_MYC_TARGETS_V1 | -0.45 | -1.87 | 0.002 | 0.015 | NOP56,RRP1,NPM1...
GMT Format
The Gene Matrix Transposed (GMT) format is a tab-delimited text file widely used by MSigDB, Enrichr, and other enrichment tools. Each line defines one gene set:
SET_NAME description_or_url GENE1 GENE2 GENE3 ...
HALLMARK_APOPTOSIS https://www.gsea-msigdb.org/... AIM2 ALKBH5 ANP32B ...
HALLMARK_P53_PATHWAY https://www.gsea-msigdb.org/... AEN ANKLE1 APC ...
The first column is the set name, the second is a description (often a URL),
and all remaining columns are gene symbols. BioLang’s read_gmt()
discards the description column and maps set names to gene lists. Common GMT
sources include the
Molecular Signatures Database (MSigDB),
Gene Ontology, KEGG, and Reactome exports.
Example: RNA-seq Enrichment Pipeline
A complete workflow: load differential expression results, run both ORA and GSEA, and export the significant pathways.
# Load differential expression results
let de = tsv("deseq2_results.tsv")
|> filter(|r| !is_nan(r.padj))
# --- Over-Representation Analysis ---
# Select significant DE genes (padj < 0.05, |log2FC| > 1)
let sig_genes = de
|> filter(|r| r.padj < 0.05 and abs(r.log2fc) > 1.0)
|> map(|r| r.gene)
# All tested genes as background
let background = de |> map(|r| r.gene)
# Load GO Biological Process gene sets
let go_bp = read_gmt("c5.go.bp.v2024.1.symbols.gmt")
# Run ORA
let ora_results = enrich(sig_genes, go_bp, background)
let sig_ora = ora_results |> filter(|r| r.fdr < 0.05)
print("ORA: " + str(nrow(sig_ora)) + " significant GO terms")
sig_ora |> head(10) |> print()
# --- Gene Set Enrichment Analysis ---
# Rank all genes by signed significance
let ranked = de
|> mutate("score", |r| sign(r.log2fc) * -log10(max(r.pvalue, 1e-300)))
|> select("gene", "score")
|> sort_by(|r| -r.score)
# Load Hallmark gene sets
let hallmarks = read_gmt("h.all.v2024.1.symbols.gmt")
# Run GSEA
let gsea_results = gsea(ranked, hallmarks)
let sig_gsea = gsea_results |> filter(|r| r.fdr < 0.25)
print("GSEA: " + str(nrow(sig_gsea)) + " significant hallmark pathways")
# Separate up- and down-regulated pathways
let up_paths = sig_gsea |> filter(|r| r.nes > 0) |> sort_by(|r| -r.nes)
let dn_paths = sig_gsea |> filter(|r| r.nes < 0) |> sort_by(|r| r.nes)
print("Upregulated pathways:")
up_paths |> each(|r| print(" " + r.term + " (NES=" + str(round(r.nes, 2)) + ")"))
print("Downregulated pathways:")
dn_paths |> each(|r| print(" " + r.term + " (NES=" + str(round(r.nes, 2)) + ")"))
# Export results
sig_ora |> write_tsv("ora_significant.tsv")
sig_gsea |> write_tsv("gsea_significant.tsv")
PDB Reference
| Builtin | Returns | Description |
|---|---|---|
pdb_entry(pdb_id) | Record | Fetch structure metadata (title, method, resolution, organism, release_date) |
pdb_search(query) | List | Search PDB by keyword; returns list of PDB IDs |
pdb_chains(pdb_id) | List | List polymer chains (chain_id, entity_id, description) |
pdb_entity(pdb_id, entity_id) | Record | Entity detail (description, type, weight, chains) |
pdb_sequence(pdb_id, entity_id) | String | Amino acid sequence for an entity |
PubMed Reference
| Builtin | Returns | Description |
|---|---|---|
pubmed_search(query, max?) | Record | Search PubMed; returns {count, ids} |
pubmed_abstract(pmid) | Record | Fetch article metadata and abstract text |
Enrichment Reference
| Builtin | Returns | Description |
|---|---|---|
read_gmt(path) | Record | Load GMT file as {set_name: [genes]} record |
enrich(genes, sets, bg) | Table | Over-representation analysis (hypergeometric + BH FDR) |
gsea(ranked, sets) | Table | Gene Set Enrichment Analysis (permutation test, 1000 perms) |