PDB, PubMed & Enrichment

11 builtins for querying protein structures from the RCSB PDB, searching biomedical literature via PubMed, and running gene set enrichment analysis. All functions return structured records or tables ready for piping.

PDB Structures

pdb_entry(pdb_id) String → Record

Fetches an entry from the RCSB Protein Data Bank. Returns a record with structural metadata including the title, experimental method, resolution, source organism, and release date.

# requires: internet connection (PDB API)
let hb = pdb_entry("4HHB")
print(hb.title)         # "THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN..."
print(hb.method)        # "X-RAY DIFFRACTION"
print(hb.resolution)    # 1.74
print(hb.organism)      # "Homo sapiens"
print(hb.release_date)  # "1984-07-17"

pdb_search(query) String → List

Searches the PDB by keyword. Returns a list of PDB IDs matching the query. Use the results with pdb_entry() to fetch full metadata.

# requires: internet connection (PDB API)
let ids = pdb_search("insulin receptor")
print(len(ids))   # number of matching structures
print(ids[0..5])  # first 5 PDB IDs

# Fetch details for the top hits
ids[0..3]
  |> map(|id| pdb_entry(id))
  |> each(|e| print(e.title + " — " + str(e.resolution) + " A"))

pdb_chains(pdb_id) String → List

Lists the polymer chains in a structure. Each entry is a record with chain ID, entity ID, and description.

# requires: internet connection (PDB API)
let chains = pdb_chains("4HHB")
chains |> each(|c| print(c.chain_id + ": " + c.description))
# A: HEMOGLOBIN (DEOXY) (ALPHA CHAIN)
# B: HEMOGLOBIN (DEOXY) (BETA CHAIN)
# C: HEMOGLOBIN (DEOXY) (ALPHA CHAIN)
# D: HEMOGLOBIN (DEOXY) (BETA CHAIN)

pdb_entity(pdb_id, entity_id) String, Int → Record

Fetches detailed information about a specific entity within a structure. Returns a record with the entity description, type, molecular weight, chain assignments, and organism source.

# requires: internet connection (PDB API)
let alpha = pdb_entity("4HHB", 1)
print(alpha.description)   # "HEMOGLOBIN ALPHA CHAIN"
print(alpha.type)          # "polypeptide(L)"
print(alpha.weight)        # 15126.4
print(alpha.chains)        # ["A", "C"]

pdb_sequence(pdb_id, entity_id) String, Int → String

Retrieves the amino acid sequence for a given entity. Returns a plain protein sequence string suitable for alignment or further analysis.

# requires: internet connection (PDB API)
let seq = pdb_sequence("4HHB", 1)
print(len(seq))       # 141 residues
print(seq[0..20])     # "VLSPADKTNVKAAWGKVGAH"

Example: Compare Hemoglobin Chains

Fetch both hemoglobin subunit sequences from PDB entry 4HHB and compare their lengths and composition.

# requires: internet connection (PDB API)
# Fetch chain information
let chains = pdb_chains("4HHB")
let entities = chains
  |> map(|c| c.entity_id)
  |> unique()

# Get sequences for each unique entity
let seqs = entities |> map(|eid| {
  entity_id: eid,
  info: pdb_entity("4HHB", eid),
  sequence: pdb_sequence("4HHB", eid),
})

# Compare subunits
seqs |> each(|s| {
  let seq = s.sequence
  print(s.info.description)
  print("  Length:   " + str(len(seq)) + " residues")
  print("  Chains:  " + join(s.info.chains, ", "))
  print("  His pct: " + str(round(count(seq, "H") / len(seq) * 100, 1)) + "%")
})

# Output:
#   HEMOGLOBIN ALPHA CHAIN
#     Length:   141 residues
#     Chains:  A, C
#     His pct: 7.1%
#   HEMOGLOBIN BETA CHAIN
#     Length:   146 residues
#     Chains:  B, D
#     His pct: 6.2%

PubMed

pubmed_search(query, max_results?) String, Int? → Record

Searches PubMed via the NCBI E-utilities. Returns a record with count (total matches in the database) and ids (list of PMIDs for the returned page). The optional second argument limits how many IDs are returned (default 20).

# requires: internet connection (NCBI PubMed API)
let results = pubmed_search("CRISPR Cas9 therapy", 5)
# pubmed_search returns a list of PMID strings directly
print(len(results))    # number of IDs returned (up to limit)
print(results)         # ["39847123", "39841056", "39838901", "39835444", "39831200"]

pubmed_abstract(pmid) String → Record

Fetches the abstract and metadata for a single PubMed article. Returns a record with title, abstract, authors, journal, year, and doi.

# requires: internet connection (NCBI PubMed API)
let paper = pubmed_abstract("32553272")
print(paper.title)     # "Search-and-replace genome editing without..."
print(paper.journal)   # "Nature"
print(paper.year)      # 2020
print(paper.authors[0]) # "Anzalone AV"

Example: Literature Review Pipeline

Search for recent papers on a topic, fetch their abstracts, and build a summary table for review.

# requires: internet connection (NCBI PubMed API)
# Search for recent gene therapy papers
let hits = pubmed_search("AAV gene therapy retinal dystrophy", 10)
print("Found " + str(hits.count) + " articles")

# Fetch abstracts for each result
let papers = hits.ids |> map(|pmid| pubmed_abstract(pmid))

# Build a summary table
let review = papers |> map(|p| {
  year: p.year,
  first_author: p.authors[0],
  journal: p.journal,
  title: p.title,
  doi: p.doi,
})

# Sort by year descending and display
review
  |> to_table()
  |> sort_by(|r| -r.year)
  |> print()

# Export for reference management
review |> to_table() |> write_tsv("literature_review.tsv")

Enrichment Analysis

BioLang provides builtins for two standard gene set enrichment approaches: over-representation analysis (ORA) and gene set enrichment analysis (GSEA). Both work with gene sets in GMT format and return tables of significant terms.

read_gmt(path) String → Record

Reads a Gene Matrix Transposed (GMT) file. Each line defines one gene set as: set_name\tdescription\tgene1\tgene2\t.... Returns a record mapping set names to their gene lists.

# GMT format (tab-separated):
#   GO:0006915  Apoptotic process  BAX  BCL2  CASP3  CASP9  TP53
#   GO:0007049  Cell cycle         CDK1 CDK2  CCNB1  CCND1  RB1

let gene_sets = read_gmt("c5.go.bp.v2024.1.symbols.gmt")
print(len(keys(gene_sets)))  # number of gene sets loaded
print(gene_sets["GO:0006915"])  # ["BAX", "BCL2", "CASP3", "CASP9", "TP53"]

enrich(genes, gene_sets, background) List, Record, List → Table

Performs over-representation analysis (ORA) using the hypergeometric test. Tests whether each gene set is enriched in your gene list relative to the background. Returns a table with columns: term, overlap, set_size, p_value, fdr (Benjamini-Hochberg adjusted), and genes (the overlapping gene symbols). Rows are sorted by p-value ascending.

let de_genes = ["TP53", "BRCA1", "CDK2", "CASP3", "BAX", "BCL2", "MYC"]
let gene_sets = read_gmt("c5.go.bp.v2024.1.symbols.gmt")
let background = read_lines("all_expressed_genes.txt")

let ora = enrich(de_genes, gene_sets, background)
ora
  |> filter(|r| r.fdr < 0.05)
  |> print()

#  term           | overlap | set_size | p_value  | fdr     | genes
# ----------------+---------+----------+----------+---------+-----------------------
#  GO:0006915     | 4       | 312      | 1.2e-06  | 3.4e-04 | TP53,BAX,BCL2,CASP3
#  GO:0008219     | 3       | 198      | 8.7e-05  | 0.012   | TP53,BAX,CASP3
#  ...

gsea(ranked_genes, gene_sets) Table, Record → Table

Performs Gene Set Enrichment Analysis on a pre-ranked gene list. The ranked_genes table must have gene and score columns (e.g., log2 fold change or signed -log10(p)). Uses a permutation test (1000 permutations) to compute enrichment scores and significance. Returns a table with: term, es (enrichment score), nes (normalized ES), p_value, fdr, and leading_edge (the core enrichment genes).

# Prepare ranked gene list (e.g., from differential expression)
let de_results = tsv("de_results.tsv")
let ranked = de_results
  |> mutate("score", |r| r.log2fc * -log10(r.pvalue))
  |> select("gene", "score")
  |> sort_by(|r| -r.score)

let gene_sets = read_gmt("h.all.v2024.1.symbols.gmt")

let gsea_results = gsea(ranked, gene_sets)
gsea_results
  |> filter(|r| r.fdr < 0.25)
  |> print()

#  term                      | es    | nes   | p_value | fdr   | leading_edge
# ---------------------------+-------+-------+---------+-------+-------------------
#  HALLMARK_APOPTOSIS        | 0.62  | 2.14  | 0.0     | 0.002 | BAX,CASP3,TP53...
#  HALLMARK_P53_PATHWAY      | 0.58  | 2.01  | 0.001   | 0.008 | CDKN1A,MDM2,BAX...
#  HALLMARK_MYC_TARGETS_V1   | -0.45 | -1.87 | 0.002   | 0.015 | NOP56,RRP1,NPM1...

GMT Format

The Gene Matrix Transposed (GMT) format is a tab-delimited text file widely used by MSigDB, Enrichr, and other enrichment tools. Each line defines one gene set:

SET_NAME    description_or_url    GENE1    GENE2    GENE3    ...
HALLMARK_APOPTOSIS    https://www.gsea-msigdb.org/...    AIM2    ALKBH5    ANP32B    ...
HALLMARK_P53_PATHWAY    https://www.gsea-msigdb.org/...    AEN    ANKLE1    APC    ...

The first column is the set name, the second is a description (often a URL), and all remaining columns are gene symbols. BioLang’s read_gmt() discards the description column and maps set names to gene lists. Common GMT sources include the Molecular Signatures Database (MSigDB), Gene Ontology, KEGG, and Reactome exports.

Example: RNA-seq Enrichment Pipeline

A complete workflow: load differential expression results, run both ORA and GSEA, and export the significant pathways.

# Load differential expression results
let de = tsv("deseq2_results.tsv")
  |> filter(|r| !is_nan(r.padj))

# --- Over-Representation Analysis ---

# Select significant DE genes (padj < 0.05, |log2FC| > 1)
let sig_genes = de
  |> filter(|r| r.padj < 0.05 and abs(r.log2fc) > 1.0)
  |> map(|r| r.gene)

# All tested genes as background
let background = de |> map(|r| r.gene)

# Load GO Biological Process gene sets
let go_bp = read_gmt("c5.go.bp.v2024.1.symbols.gmt")

# Run ORA
let ora_results = enrich(sig_genes, go_bp, background)
let sig_ora = ora_results |> filter(|r| r.fdr < 0.05)
print("ORA: " + str(nrow(sig_ora)) + " significant GO terms")
sig_ora |> head(10) |> print()

# --- Gene Set Enrichment Analysis ---

# Rank all genes by signed significance
let ranked = de
  |> mutate("score", |r| sign(r.log2fc) * -log10(max(r.pvalue, 1e-300)))
  |> select("gene", "score")
  |> sort_by(|r| -r.score)

# Load Hallmark gene sets
let hallmarks = read_gmt("h.all.v2024.1.symbols.gmt")

# Run GSEA
let gsea_results = gsea(ranked, hallmarks)
let sig_gsea = gsea_results |> filter(|r| r.fdr < 0.25)
print("GSEA: " + str(nrow(sig_gsea)) + " significant hallmark pathways")

# Separate up- and down-regulated pathways
let up_paths = sig_gsea |> filter(|r| r.nes > 0) |> sort_by(|r| -r.nes)
let dn_paths = sig_gsea |> filter(|r| r.nes < 0) |> sort_by(|r| r.nes)

print("Upregulated pathways:")
up_paths |> each(|r| print("  " + r.term + " (NES=" + str(round(r.nes, 2)) + ")"))

print("Downregulated pathways:")
dn_paths |> each(|r| print("  " + r.term + " (NES=" + str(round(r.nes, 2)) + ")"))

# Export results
sig_ora |> write_tsv("ora_significant.tsv")
sig_gsea |> write_tsv("gsea_significant.tsv")

PDB Reference

Builtin	Returns	Description
`pdb_entry(pdb_id)`	Record	Fetch structure metadata (title, method, resolution, organism, release_date)
`pdb_search(query)`	List	Search PDB by keyword; returns list of PDB IDs
`pdb_chains(pdb_id)`	List	List polymer chains (chain_id, entity_id, description)
`pdb_entity(pdb_id, entity_id)`	Record	Entity detail (description, type, weight, chains)
`pdb_sequence(pdb_id, entity_id)`	String	Amino acid sequence for an entity

PubMed Reference

Builtin	Returns	Description
`pubmed_search(query, max?)`	Record	Search PubMed; returns {count, ids}
`pubmed_abstract(pmid)`	Record	Fetch article metadata and abstract text

Enrichment Reference

Builtin	Returns	Description
`read_gmt(path)`	Record	Load GMT file as {set_name: [genes]} record
`enrich(genes, sets, bg)`	Table	Over-representation analysis (hypergeometric + BH FDR)
`gsea(ranked, sets)`	Table	Gene Set Enrichment Analysis (permutation test, 1000 perms)