Intermediate ~30 min

Tutorial: Knowledge Graphs

In this tutorial you'll build a protein interaction network from scratch, query it for biological insights, and combine it with API data from STRING. You'll learn BioLang's graph builtins and how to use graphs for pathway analysis.

Prerequisites: Familiarity with basic BioLang syntax from Hello Genomics and Tables.

What you'll build

A gene interaction graph for the p53 tumor suppressor pathway. You'll create nodes, add weighted edges, find shortest paths, identify connected components, extract subgraphs, and combine the graph with STRING database scores.

Run this tutorial: Download knowledge-graphs.bl and run it with bl run examples/tutorials/knowledge-graphs.bl

Step 1: Create a graph

Start with an empty graph and add some nodes:

# Create an empty undirected graph
let g = graph()

# Add nodes with attributes
let g = add_node(g, "TP53", {role: "tumor_suppressor", chr: "17p13.1"})
let g = add_node(g, "MDM2", {role: "oncogene", chr: "12q15"})
let g = add_node(g, "CDKN1A", {role: "cdk_inhibitor", chr: "6p21.2"})
let g = add_node(g, "BAX", {role: "pro_apoptotic", chr: "19q13.33"})
let g = add_node(g, "BCL2", {role: "anti_apoptotic", chr: "18q21.33"})

print(f"Nodes: {node_count(g)}")  # 5

Step 2: Add edges

Connect genes with weighted edges representing interaction confidence:

# TP53 interactions
let g = add_edge(g, "TP53", "MDM2", {score: 0.999, type: "inhibition"})
let g = add_edge(g, "TP53", "CDKN1A", {score: 0.998, type: "activation"})
let g = add_edge(g, "TP53", "BAX", {score: 0.992, type: "activation"})

# Apoptosis regulators
let g = add_edge(g, "BAX", "BCL2", {score: 0.976, type: "inhibition"})

# MDM2 feedback
let g = add_edge(g, "MDM2", "CDKN1A", {score: 0.871, type: "inhibition"})

print(f"Edges: {edge_count(g)}")  # 5

Nodes are auto-created when you add edges, so you can skip add_node() if you don't need attributes. But adding attributes makes the graph more useful for downstream analysis.

Step 3: Query the graph

Use graph builtins to explore relationships:

# Direct neighbors
let tp53_partners = neighbors(g, "TP53")
print(f"TP53 interacts with: {tp53_partners}")
# ["BAX", "CDKN1A", "MDM2"]

# Node degree (number of connections)
print(f"TP53 degree: {degree(g, 'TP53')}")    # 3
print(f"BCL2 degree: {degree(g, 'BCL2')}")    # 1

# Check if an edge exists
print(has_edge(g, "TP53", "BAX"))     # true
print(has_edge(g, "BCL2", "MDM2"))    # false

Step 4: Shortest paths

Find the shortest path between any two genes:

# How does BCL2 connect to MDM2?
let path = shortest_path(g, "BCL2", "MDM2")
print(f"BCL2 → MDM2: {path}")
# ["BCL2", "BAX", "TP53", "MDM2"]

# No path between disconnected components returns nil
let g2 = add_node(g, "ISOLATED_GENE", {})
let path2 = shortest_path(g, "ISOLATED_GENE", "TP53")
print(path2)  # nil

Step 5: Network analysis

Analyze the global structure of the network:

# Connected components
let components = connected_components(g)
print(f"Components: {len(components)}")
components |> each(|c| print(f"  {c}"))

# Get all nodes and edges as data
let all_nodes = nodes(g)
print(f"All nodes: {all_nodes}")

let edge_table = edges(g)
print(edge_table)
# Table with columns: source, target, and any edge attributes

Step 6: Extract subgraphs

Pull out a subset of the network for focused analysis:

# Extract the apoptosis subnetwork
let apoptosis_genes = ["TP53", "BAX", "BCL2"]
let sub = subgraph(g, apoptosis_genes)

print(f"Subgraph: {node_count(sub)} nodes, {edge_count(sub)} edges")
# Subgraph: 3 nodes, 2 edges (TP53-BAX, BAX-BCL2)
# The MDM2 and CDKN1A edges are excluded

Step 7: Build from data

In practice you'll build graphs from files or API results, not by hand. Here's how to construct a graph from a CSV edge list:

# requires: interactions.csv in working directory
# Suppose interactions.csv has columns: gene_a, gene_b, score
let edges_data = read_csv("data/expression.csv")

let g = graph()
let g = edges_data |> reduce(g, |g, row| {
  add_edge(g, row.gene_a, row.gene_b, {score: into(row.score, "Float")})
})

print(f"Built graph: {node_count(g)} genes, {edge_count(g)} interactions")

Step 8: Combine with STRING API

Fetch real interaction data from the STRING database and build a graph:

# requires: internet connection
# Fetch TP53 interaction network from STRING
let network = string_network(["TP53"], 9606)

# Build a graph from the API results — each record has {protein_a, protein_b, score}
let g = graph()
let g = network |> reduce(g, |g, row| {
  add_edge(g, row.protein_a, row.protein_b, {
    score: row.score,
    nscore: row.nscore,
    escore: row.escore
  })
})

print(f"STRING network: {node_count(g)} proteins, {edge_count(g)} interactions")

# Find hub genes (highest degree)
let hub_genes = nodes(g)
  |> map(|n| {gene: n, degree: degree(g, n)})
  |> sort_by(|r| r.degree, desc: true)
  |> take(5)

print("Top 5 hub genes:")
hub_genes |> each(|r| print(f"  {r.gene}: {r.degree} connections"))

Step 9: Directed graphs

For regulatory networks where direction matters, create a directed graph:

# Directed graph: edges have a direction
let dg = graph(directed: true)

let dg = add_edge(dg, "TP53", "CDKN1A", {type: "activates"})
let dg = add_edge(dg, "TP53", "BAX", {type: "activates"})
let dg = add_edge(dg, "MDM2", "TP53", {type: "inhibits"})
let dg = add_edge(dg, "TP53", "MDM2", {type: "activates"})

# In directed graphs, neighbors returns outgoing connections only
let tp53_targets = neighbors(dg, "TP53")
print(f"TP53 activates/inhibits: {tp53_targets}")
# ["BAX", "CDKN1A", "MDM2"]

# MDM2 only has one outgoing edge
let mdm2_targets = neighbors(dg, "MDM2")
print(f"MDM2 targets: {mdm2_targets}")
# ["TP53"]

Step 10: Removing nodes and edges

Simulate knockouts by removing genes from the network:

# What happens if we knock out TP53?
let ko = remove_node(g, "TP53")
print(f"After TP53 knockout: {node_count(ko)} nodes, {edge_count(ko)} edges")

# How many components now?
let ko_components = connected_components(ko)
print(f"Components after knockout: {len(ko_components)}")
ko_components |> each(|c| print(f"  {c}"))

# Remove a specific interaction
let g_no_feedback = remove_edge(g, "MDM2", "CDKN1A")
print(f"Removed MDM2-CDKN1A: {edge_count(g_no_feedback)} edges remain")

Complete example

Here's the full analysis as a notebook-ready script:

# p53_network.bl — Build and analyze the p53 interaction network

# Build the graph
let g = graph()
let g = add_edge(g, "TP53", "MDM2", {score: 0.999, type: "inhibition"})
let g = add_edge(g, "TP53", "CDKN1A", {score: 0.998, type: "activation"})
let g = add_edge(g, "TP53", "BAX", {score: 0.992, type: "activation"})
let g = add_edge(g, "BAX", "BCL2", {score: 0.976, type: "inhibition"})
let g = add_edge(g, "MDM2", "CDKN1A", {score: 0.871, type: "inhibition"})
let g = add_edge(g, "CDKN1A", "CDK2", {score: 0.945, type: "inhibition"})
let g = add_edge(g, "CDK2", "RB1", {score: 0.932, type: "phosphorylation"})

print(f"Network: {node_count(g)} genes, {edge_count(g)} interactions")
print(f"Components: {len(connected_components(g))}")

# Hub analysis
let hubs = nodes(g)
  |> map(|n| {gene: n, deg: degree(g, n)})
  |> sort_by(|r| r.deg, desc: true)

print("\nDegree ranking:")
hubs |> each(|r| print(f"  {r.gene}: {r.deg}"))

# Path from cell cycle to apoptosis
let path = shortest_path(g, "CDK2", "BCL2")
print(f"\nCell cycle → apoptosis: {path}")

# TP53 knockout simulation
let ko = remove_node(g, "TP53")
let ko_comp = connected_components(ko)
print(f"\nTP53 knockout: {len(ko_comp)} disconnected components")
ko_comp |> each(|c| print(f"  {c}"))

Builtin reference

Function	Description
graph(directed?)	Create empty graph (undirected by default)
add_node(g, id, attrs?)	Add node with optional attributes
add_edge(g, src, tgt, attrs?)	Add edge (auto-creates nodes)
remove_node(g, id)	Remove node and its edges
remove_edge(g, src, tgt)	Remove specific edge
neighbors(g, id)	List of adjacent nodes
degree(g, id)	Number of connections
has_edge(g, src, tgt)	Check if edge exists
nodes(g)	List of all node IDs
edges(g)	Table of all edges
node_count(g)	Number of nodes
edge_count(g)	Number of edges
shortest_path(g, src, tgt)	BFS shortest path (list or nil)
connected_components(g)	List of component node lists
subgraph(g, node_list)	Extract induced subgraph

What's next

Enrichment Analysis — combine graphs with gene set enrichment
Knowledge Graphs reference — full builtin documentation
Querying Databases — fetch data from NCBI, Ensembl, and more