Day 9: Biological Databases and APIs

The Problem

You found a mutation in gene BRCA1. What does this gene do? Is this mutation known? What pathway is it in? What protein does it encode? What other proteins does it interact with? What 3D structures are available?

This information exists — scattered across a dozen databases maintained by organizations around the world. NCBI in Bethesda, EBI in Cambridge, KEGG in Kyoto, RCSB in New Jersey. Manually searching each one, copying identifiers between browser tabs, cross-referencing results — it takes hours for a single gene. For a list of 50 candidate genes from a screen, it takes days.

With API calls, it takes seconds.

BioLang has built-in clients for 12+ biological databases. No packages to install. No authentication boilerplate. No JSON parsing. You call a function, you get structured data back.

The Database Landscape

Biological knowledge is distributed across specialized databases. Each one is the authoritative source for a particular kind of information:

Biological database landscape

No single database has the complete picture. NCBI has the sequences but not the pathways. KEGG has the pathways but not the 3D structures. PDB has the structures but not the interaction networks. The real power comes from querying multiple databases and combining the results.

Database	Maintained By	Speciality	BioLang Functions
NCBI	NIH (USA)	Sequences, genes, literature	`ncbi_gene`, `ncbi_search`, `ncbi_sequence`
Ensembl	EBI/EMBL	Gene models, variants, orthology	`ensembl_symbol`, `ensembl_sequence`, `ensembl_vep`
UniProt	EBI/SIB/PIR	Protein function, features	`uniprot_entry`, `uniprot_search`, `uniprot_features`
KEGG	Kyoto Univ	Pathways, metabolism	`kegg_get`, `kegg_find`, `kegg_link`
PDB	RCSB (USA)	3D protein structures	`pdb_entry`, `pdb_search`
STRING	EMBL	Protein-protein interactions	`string_network`, `string_enrichment`
Gene Ontology	GO Consortium	Functional annotations	`go_term`, `go_annotations`
Reactome	EBI/OICR	Biological pathways	`reactome_pathways`, `reactome_search`

NCBI — The Central Repository

The National Center for Biotechnology Information (NCBI) is the largest repository of biological data. It hosts GenBank (sequences), PubMed (literature), Gene (gene records), and dozens of other databases. Nearly every bioinformatician interacts with NCBI daily.

BioLang’s NCBI functions wrap the E-utilities API, handling the XML parsing, rate limiting, and error recovery for you.

Looking Up a Gene

The simplest operation: look up a gene by symbol.

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

let gene = ncbi_gene("BRCA1")
println(f"Symbol: {gene.symbol}")
println(f"Name: {gene.name}")
println(f"Description: {gene.description}")
println(f"Chromosome: {gene.chromosome}")
println(f"Location: {gene.location}")
println(f"Organism: {gene.organism}")

Expected output (approximate — NCBI data is updated regularly):

Symbol: BRCA1
Name: BRCA1 DNA repair associated
Description: BRCA1 DNA repair associated
Chromosome: 17
Location: 17q21.31
Organism: Homo sapiens

ncbi_gene() returns a record with fields: id, symbol, name, description, organism, chromosome, location, summary. When the search matches a single gene, you get the full record directly. When it matches multiple genes, you get a list of NCBI Gene IDs.

Searching NCBI Databases

NCBI hosts over 40 databases. You can search any of them with ncbi_search():

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

# Search PubMed for articles about BRCA1 and breast cancer
let pubmed_ids = ncbi_search("pubmed", "BRCA1 breast cancer", 5)
println(f"PubMed hits: {len(pubmed_ids)}")
for id in pubmed_ids {
    println(f"  PMID: {id}")
}

# Search the Gene database
let gene_ids = ncbi_search("gene", "TP53 homo sapiens", 5)
println(f"Gene IDs: {len(gene_ids)}")

Note the argument order: ncbi_search(database, query, max_results). The max_results parameter is optional (defaults to 20).

Fetching Sequences

Retrieve a sequence by its accession number:

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

# Fetch BRCA1 mRNA sequence (RefSeq accession)
let fasta = ncbi_sequence("NM_007294")
println(f"Sequence (first 100 chars):")
println(fasta |> take(200))

ncbi_sequence() returns the raw FASTA text. You can parse it further or write it to a file.

Ensembl — Gene Models and Variants

Ensembl, maintained by the European Bioinformatics Institute (EBI), provides gene annotations, comparative genomics, and variant effect prediction. Its REST API is particularly well-designed and fast.

Looking Up a Gene by Symbol

# requires: internet connection

let gene = ensembl_symbol("homo_sapiens", "BRCA1")
println(f"Ensembl ID: {gene.id}")
println(f"Symbol: {gene.symbol}")
println(f"Biotype: {gene.biotype}")
println(f"Chromosome: {gene.chromosome}")
println(f"Start: {gene.start}")
println(f"End: {gene.end}")
println(f"Strand: {gene.strand}")

Expected output (approximate):

Ensembl ID: ENSG00000012048
Symbol: BRCA1
Biotype: protein_coding
Chromosome: 17
Start: 43044295
End: 43170245
Strand: -1

Note the argument order: ensembl_symbol(species, symbol). Species uses Ensembl’s underscore-separated format: "homo_sapiens", "mus_musculus", "danio_rerio".

Getting Protein Sequences

Once you have an Ensembl gene ID, you can retrieve its sequence in different forms:

# requires: internet connection

let gene = ensembl_symbol("homo_sapiens", "BRCA1")

# Get the protein sequence
let protein = ensembl_sequence(gene.id, "protein")
println(f"Protein length: {len(protein.seq)} amino acids")
println(f"First 50 aa: {protein.seq |> take(50)}")

# Get the coding sequence (CDS)
let cds = ensembl_sequence(gene.id, "cds")
println(f"CDS length: {len(cds.seq)} bases")

ensembl_sequence() takes an Ensembl ID and an optional sequence type: "genomic" (default), "cds", "cdna", or "protein". It returns a record with id, seq, and molecule fields.

Variant Effect Prediction (VEP)

One of Ensembl’s most powerful features is VEP — the Variant Effect Predictor. Given a variant, it tells you the predicted biological consequence:

# requires: internet connection

# Predict the effect of a BRCA1 variant (HGVS notation)
let results = ensembl_vep("17:g.43091434G>A")
for r in results {
    println(f"Alleles: {r.allele_string}")
    println(f"Most severe: {r.most_severe_consequence}")
    for tc in r.transcript_consequences {
        println(f"  Transcript: {tc.transcript_id}")
        println(f"  Impact: {tc.impact}")
        println(f"  Consequences: {tc.consequences}")
    }
}

VEP accepts HGVS notation (e.g., "17:g.43091434G>A") and returns a list of result records, each containing transcript-level consequence predictions with impact severity (HIGH, MODERATE, LOW, MODIFIER).

UniProt — Protein Knowledge

UniProt is the definitive resource for protein function, domains, post-translational modifications, and literature. Every well-characterized protein has a UniProt entry curated by expert biologists.

Looking Up a Protein

# requires: internet connection

# Look up BRCA1 by its UniProt accession
let entry = uniprot_entry("P38398")
println(f"Name: {entry.name}")
println(f"Organism: {entry.organism}")
println(f"Length: {entry.sequence_length} aa")
println(f"Gene names: {entry.gene_names}")
println(f"Function: {entry.function}")

Expected output (approximate):

Name: BRCA1_HUMAN
Organism: Homo sapiens (Human)
Length: 1863 aa
Gene names: ["BRCA1", "RNF53"]
Function: E3 ubiquitin-protein ligase that...

uniprot_entry() returns a record with accession, name, organism, sequence_length, gene_names (a list), and function.

Searching UniProt

# requires: internet connection

# Search for human BRCA1 proteins
let results = uniprot_search("BRCA1 AND organism_name:human", 5)
println(f"Results: {len(results)}")
for entry in results {
    println(f"  {entry.accession}: {entry.name} ({entry.sequence_length} aa)")
}

uniprot_search() takes a query string (using UniProt’s query syntax) and an optional limit (defaults to 10). It returns a list of protein entry records.

Protein Features and Domains

# requires: internet connection

# Get structural and functional features of BRCA1
let features = uniprot_features("P38398")
println(f"Total features: {len(features)}")

# Find just the domains
let domains = features |> filter(|f| f.type == "Domain")
println(f"Domains: {len(domains)}")
for d in domains {
    println(f"  {d.description} ({d.location})")
}

# Find binding sites
let sites = features |> filter(|f| f.type == "Binding site")
println(f"Binding sites: {len(sites)}")

Each feature record has type, location, and description fields. Common types include "Domain", "Region", "Binding site", "Modified residue", "Disulfide bond", and "Chain".

Gene Ontology Terms from UniProt

# requires: internet connection

# Get GO terms associated with BRCA1
let go_terms = uniprot_go("P38398")
println(f"GO terms: {len(go_terms)}")
for t in go_terms |> take(5) {
    println(f"  {t.id}: {t.term} ({t.aspect})")
}

KEGG — Pathways and Metabolism

The Kyoto Encyclopedia of Genes and Genomes links genes to metabolic and signaling pathways. It is especially valuable for understanding how individual genes fit into larger biological systems.

Finding Genes in KEGG

# requires: internet connection

# Find BRCA1 in the KEGG database
let results = kegg_find("genes", "BRCA1")
println(f"KEGG hits: {len(results)}")
for r in results |> take(5) {
    println(f"  {r.id}: {r.description}")
}

kegg_find() takes a database name and a query string. The database can be "genes", "pathway", "compound", "disease", "drug", and more. It returns a list of records with id and description.

Getting Detailed Entries

# requires: internet connection

# Get detailed entry for human BRCA1
let entry = kegg_get("hsa:672")
println(f"KEGG entry (first 500 chars):")
println(entry |> take(500))

kegg_get() returns the raw KEGG flat-file text for any KEGG identifier. KEGG IDs use an organism prefix: hsa for Homo sapiens, mmu for Mus musculus, etc.

Linking to Pathways

The real power of KEGG is connecting genes to pathways:

# requires: internet connection

# Find pathways that BRCA1 participates in
let links = kegg_link("pathway", "hsa:672")
println(f"Pathways involving BRCA1: {len(links)}")
for link in links {
    println(f"  {link.source} -> {link.target}")
}

kegg_link() takes two arguments: target database and source identifier. It returns a list of records with source and target fields.

PDB — 3D Protein Structures

The Protein Data Bank (PDB) contains experimentally determined 3D structures of proteins, nucleic acids, and their complexes. If you want to see what a protein actually looks like, this is where you go.

Looking Up a Structure

# requires: internet connection

# Get information about BRCA1 BRCT domain structure
let structure = pdb_entry("1JM7")
println(f"Title: {structure.title}")
println(f"Method: {structure.method}")
println(f"Resolution: {structure.resolution}")
println(f"Release date: {structure.release_date}")
println(f"Organism: {structure.organism}")

Expected output (approximate):

Title: Crystal structure of the BRCT repeat region from...
Method: X-RAY DIFFRACTION
Resolution: 2.5
Release date: 2001-07-06
Organism: Homo sapiens

pdb_entry() returns a record with id, title, method, resolution (may be nil for NMR structures), release_date, and organism.

Searching for Structures

# requires: internet connection

# Find all PDB structures related to BRCA1
let pdb_ids = pdb_search("BRCA1")
println(f"PDB structures for BRCA1: {len(pdb_ids)}")
for id in pdb_ids |> take(10) {
    println(f"  {id}")
}

pdb_search() returns a list of PDB ID strings.

Getting Entity and Sequence Information

# requires: internet connection

# Get entity details for a specific chain
let entity = pdb_entity("1JM7", 1)
println(f"Entity type: {entity.entity_type}")
println(f"Description: {entity.description}")

# Get the protein sequence from the structure
let seq = pdb_sequence("1JM7", 1)
println(f"Sequence: {seq}")
println(f"Length: {len(seq)} aa")

STRING — Protein Interactions

STRING (Search Tool for Recurring Instances of Neighbouring Genes) maps known and predicted protein-protein interactions. Understanding which proteins interact is crucial for interpreting experimental results.

Getting an Interaction Network

# requires: internet connection

# Get interaction partners for BRCA1
# string_network takes a list of protein identifiers and a species taxonomy ID
let network = string_network(["BRCA1"], 9606)
println(f"Interaction partners: {len(network)}")

# Show top interactors by score
let top = network
    |> sort_by(|n| n.score)
    |> reverse()
    |> take(5)

for partner in top {
    println(f"  {partner.protein_a} <-> {partner.protein_b}: score={partner.score}")
}

Note that string_network() takes a list of protein identifiers (not a single string) and a species taxonomy ID. Common taxonomy IDs: 9606 (human), 10090 (mouse), 7955 (zebrafish), 6239 (C. elegans), 7227 (D. melanogaster).

Each interaction record has protein_a, protein_b, and score fields. The score ranges from 0 to 1, where higher scores indicate more confident interactions.

Functional Enrichment

# requires: internet connection

# Check if a set of genes is enriched for specific functions
let enrichment = string_enrichment(["BRCA1", "BRCA2", "RAD51", "TP53", "ATM"], 9606)
println(f"Enriched terms: {len(enrichment)}")
for e in enrichment |> take(5) {
    println(f"  [{e.category}] {e.description}: p={e.p_value}, FDR={e.fdr}")
}

string_enrichment() takes a list of gene symbols and a species taxonomy ID. It returns a list of enrichment records with category, term, description, gene_count, p_value, and fdr.

Gene Ontology and Reactome

Gene Ontology (GO)

The Gene Ontology provides a standardized vocabulary for describing gene function across all organisms. Every GO term belongs to one of three namespaces:

Molecular Function — what the protein does (e.g., “kinase activity”)
Biological Process — what pathway it participates in (e.g., “DNA repair”)
Cellular Component — where in the cell it acts (e.g., “nucleus”)

# requires: internet connection

# Look up a specific GO term
let term = go_term("GO:0006281")
println(f"ID: {term.id}")
println(f"Name: {term.name}")
println(f"Aspect: {term.aspect}")
println(f"Definition: {term.definition}")

Expected output:

ID: GO:0006281
Name: DNA repair
Aspect: biological_process
Definition: The process of restoring DNA after damage...

GO Annotations for a Gene

# requires: internet connection

# Get GO annotations for BRCA1 (by UniProt accession)
let annotations = go_annotations("P38398")
println(f"GO annotations: {len(annotations)}")
for a in annotations |> take(5) {
    println(f"  {a.go_id}: {a.go_name} ({a.aspect})")
    println(f"    Evidence: {a.evidence}")
}

go_annotations() takes a gene/protein identifier and an optional limit (defaults to 25). Each annotation has go_id, go_name, aspect, evidence, and gene_product_id fields.

Navigating the GO Hierarchy

GO terms form a directed acyclic graph (DAG). You can traverse it:

# requires: internet connection

# Find child terms of "DNA repair"
let children = go_children("GO:0006281")
println(f"Child terms of DNA repair: {len(children)}")
for c in children |> take(5) {
    println(f"  {c.id}: {c.name}")
}

# Find parent terms
let parents = go_parents("GO:0006281")
println(f"Parent terms: {len(parents)}")
for p in parents {
    println(f"  {p.id}: {p.name}")
}

Reactome — Biological Pathways

Reactome is a curated database of biological pathways and reactions, maintained by EBI and the Ontario Institute for Cancer Research.

# requires: internet connection

# Find pathways involving BRCA1
let pathways = reactome_pathways("BRCA1")
println(f"Reactome pathways: {len(pathways)}")
for p in pathways |> take(5) {
    println(f"  {p.id}: {p.name} ({p.species})")
}

reactome_pathways() takes a gene symbol and an optional species (defaults to "Homo sapiens"). It returns a list of pathway records with id, name, and species.

You can also search Reactome by keyword:

# requires: internet connection

let results = reactome_search("DNA damage response")
println(f"Search results: {len(results)}")

Combining Multiple Databases

The real power of programmatic database access is cross-referencing. A single gene symbol unlocks information across every database simultaneously. What would take 30 minutes of browser-tab switching takes 10 lines of code.

A Complete Gene Profile

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

fn gene_profile(symbol) {
    println(f"\n{'=' * 50}")
    println(f"  Gene Profile: {symbol}")
    println(f"{'=' * 50}")

    # NCBI: basic gene info
    let gene = ncbi_gene(symbol)
    println(f"\n[NCBI Gene]")
    println(f"  Description: {gene.description}")
    println(f"  Chromosome: {gene.chromosome}")
    println(f"  Location: {gene.location}")

    # Ensembl: genomic coordinates
    let ens = ensembl_symbol("homo_sapiens", symbol)
    println(f"\n[Ensembl]")
    println(f"  ID: {ens.id}")
    println(f"  Biotype: {ens.biotype}")
    println(f"  Position: chr{ens.chromosome}:{ens.start}-{ens.end}")

    # Ensembl: protein sequence
    let protein = ensembl_sequence(ens.id, "protein")
    println(f"  Protein: {len(protein.seq)} amino acids")

    # UniProt: function
    let results = uniprot_search(f"{symbol} AND organism_name:human", 1)
    if len(results) > 0 {
        let entry = results |> first()
        println(f"\n[UniProt]")
        println(f"  Accession: {entry.accession}")
        println(f"  Name: {entry.name}")
        println(f"  Function: {entry.function}")
    }

    # STRING: interactions
    let network = string_network([symbol], 9606)
    println(f"\n[STRING]")
    println(f"  Interaction partners: {len(network)}")
    let top3 = network
        |> sort_by(|n| n.score)
        |> reverse()
        |> take(3)
    for partner in top3 {
        println(f"    {partner.protein_b}: {partner.score}")
    }

    # PDB: structures
    let structures = pdb_search(symbol)
    println(f"\n[PDB]")
    println(f"  Available structures: {len(structures)}")

    # Reactome: pathways
    let pathways = reactome_pathways(symbol)
    println(f"\n[Reactome]")
    println(f"  Pathways: {len(pathways)}")
    for p in pathways |> take(3) {
        println(f"    {p.name}")
    }

    sleep(1)  # respect rate limits between genes
}

Profiling Multiple Genes

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

# Profile a set of cancer-related genes
let cancer_genes = ["BRCA1", "TP53", "EGFR"]
for gene in cancer_genes {
    gene_profile(gene)
}

This is the kind of analysis that is impractical to do manually but trivial with API calls. Three genes, six databases each, complete profiles in under a minute.

Building a Comparison Table

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

# Collect structured data for comparison
let genes = ["BRCA1", "TP53", "EGFR", "KRAS", "MYC"]
let rows = []
for symbol in genes {
    let gene = ncbi_gene(symbol)
    let ens = ensembl_symbol("homo_sapiens", symbol)
    let protein = ensembl_sequence(ens.id, "protein")
    let network = string_network([symbol], 9606)
    let pathways = reactome_pathways(symbol)

    rows = push(rows, {
        gene: symbol,
        chromosome: gene.chromosome,
        protein_length: len(protein.seq),
        interactions: len(network),
        pathways: len(pathways)
    })

    sleep(0.5)  # be respectful
}

let results = rows |> to_table()
println(results)

Expected output (approximate):

gene   | chromosome | protein_length | interactions | pathways
-------|------------|----------------|--------------|--------
BRCA1  | 17         | 1863           | 10           | 25
TP53   | 17         | 393            | 10           | 18
EGFR   | 7          | 1210           | 10           | 30
KRAS   | 12         | 189            | 10           | 22
MYC    | 8          | 439            | 10           | 15

Rate Limiting and Best Practices

Biological databases are shared public resources. Hammering them with thousands of requests per second will get your IP temporarily blocked — and slow down the service for everyone.

Rate Limits by Database

Database	Rate Limit	With API Key
NCBI	3 requests/second	10/second with `NCBI_API_KEY`
Ensembl	15 requests/second	—
UniProt	Reasonable use (no hard limit)	—
KEGG	10 requests/second	—
PDB	No published limit	—
STRING	1 request/second	—
QuickGO	10 requests/second	—
Reactome	No published limit	—

Setting Up API Keys

NCBI strongly recommends registering for an API key. It is free and takes 30 seconds:

Go to ncbi.nlm.nih.gov/account/settings
Click “Create an API Key”
Set the environment variable:

export NCBI_API_KEY="your_key_here"

BioLang automatically detects and uses the NCBI_API_KEY environment variable for all NCBI calls.

Batch Queries with Rate Limiting

When querying multiple genes, add delays between requests:

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

let genes = ["BRCA1", "TP53", "EGFR", "KRAS", "MYC",
             "PIK3CA", "BRAF", "APC", "RB1", "PTEN"]

let results = []
for gene in genes {
    let info = ncbi_gene(gene)
    results = push(results, {gene: gene, chrom: info.chromosome, desc: info.description})
    sleep(0.5)  # be respectful
}

let results_table = results |> to_table()
println(results_table)

Best Practices

Cache results — if you are going to query the same gene repeatedly during development, save the result to a variable or file instead of calling the API each time.
Use sleep() in loops — add at least 0.3–0.5 seconds between requests when iterating over a list of genes.
Handle errors gracefully — API calls can fail due to network issues, maintenance windows, or invalid identifiers. Use try/catch for production scripts.
Start small — test your query with 2–3 genes before running it on 500.
Set NCBI_API_KEY — it is free and triples your rate limit.

# requires: internet connection
# optional: NCBI_API_KEY for higher rate limits

# Robust batch query with error handling
let genes = ["BRCA1", "TP53", "INVALID_GENE", "EGFR"]
let results = []
let errors = []

for gene in genes {
    let result = try {
        let info = ncbi_gene(gene)
        push(results, {gene: gene, chrom: info.chromosome})
    } catch e {
        push(errors, {gene: gene, error: e})
    }
    sleep(0.5)
}

println(f"Successful: {len(results)}")
println(f"Failed: {len(errors)}")
for err in errors {
    println(f"  {err.gene}: {err.error}")
}

Exercises

Gene Lookup: Look up your favorite gene in NCBI using ncbi_gene() and print its chromosome location, description, and summary. Try at least two different genes.
Protein Size Estimation: Use ensembl_symbol() and ensembl_sequence() to get the protein sequence of TP53. Calculate its length and estimate its molecular weight (average amino acid weight is approximately 110 daltons).
UniProt Search: Search UniProt for "insulin AND organism_name:human" and list the accession numbers and names of the results.
Interaction Network: Use string_network() to find interaction partners for MYC (species 9606). Sort by score and print the top 5.
Multi-Database Report: Write a gene_report(symbol) function that queries at least 3 databases (NCBI, Ensembl, and one other) and returns a summary record with fields like chromosome, protein_length, num_interactions, and num_pathways. Test it on EGFR and KRAS.

Key Takeaways

BioLang has built-in clients for 12+ biological databases — no packages to install, no JSON to parse.
NCBI is the central repository for sequences, genes, and literature. ncbi_gene() is often your starting point.
Ensembl provides gene models, coordinates, and the invaluable Variant Effect Predictor (ensembl_vep()).
UniProt is the authoritative source for protein function, domains, and curated annotations.
KEGG connects genes to metabolic and signaling pathways. Use kegg_link() to find pathway memberships.
PDB gives you 3D protein structures. STRING maps protein-protein interaction networks.
GO and Reactome provide functional annotations and biological pathway context.
Combining databases gives a complete picture no single source provides. A 10-line function can profile a gene across six databases.
Respect rate limits: use sleep() in batch queries, set NCBI_API_KEY for NCBI, and cache results when possible.
All API functions require internet access. Some need API keys: NCBI (optional, recommended), COSMIC (required).

What’s Next

Tomorrow we move from fetching data to organizing it. Day 10: Tables — The Bioinformatician’s Workbench covers selecting, filtering, joining, and reshaping tabular data — the format that most bioinformatics analysis ultimately lives in.

Keyboard shortcuts

Practical Bioinformatics in 30 Days