Day 18: Genomic Coordinates and Intervals
| Difficulty | Intermediate |
| Biology knowledge | Intermediate (genomic coordinates, exons, variants) |
| Coding knowledge | Intermediate (records, pipes, lambda functions, interval trees) |
| Time | ~3 hours |
| Prerequisites | Days 1-17 completed, BioLang installed (see Appendix A) |
| Data needed | Generated by init.bl (exons BED, variants VCF, annotations GFF) |
What You’ll Learn
- Why coordinate systems are the #1 source of bioinformatics bugs
- The difference between 0-based half-open (BED) and 1-based inclusive (VCF, GFF) coordinates
- How to create and manipulate genomic intervals
- How interval trees enable fast overlap queries on millions of regions
- How to filter variants by genomic region (exonic vs intronic)
- How to read and write BED files, and work with GFF annotations
The Problem
Your exome capture kit targets 200,000 regions. Your variant caller found 50,000 variants. Which variants fall inside targeted regions? Which exons overlap regulatory elements? Genomic interval operations answer these questions in milliseconds.
Genomic coordinates are deceptively simple — a chromosome name, a start position, and an end position. But the way those positions are counted differs between file formats, and getting it wrong means your analysis is off by one base. That one base can be the difference between “variant in exon” and “variant in intron.” At genome scale, you cannot check these by eye. You need fast, correct interval operations.
Coordinate Systems
This is the single most important concept in this chapter. If you get coordinates wrong, every downstream analysis is silently incorrect.
Position: 1 2 3 4 5 6 7 8 9 10
Sequence: A T C G A T C G A T
1-based inclusive (VCF, GFF, SAM):
"positions 3-7" = C G A T C (5 bases)
start=3, end=7, length = end - start + 1 = 5
0-based half-open (BED, BAM index, UCSC):
"positions 2-7" = C G A T C (5 bases, same region!)
start=2, end=7, length = end - start = 5
start is included, end is EXCLUDED
The key rules:
| Format | System | Start | End | Length formula |
|---|---|---|---|---|
| BED | 0-based half-open | Included | Excluded | end - start |
| VCF | 1-based inclusive | Included | Included | end - start + 1 |
| GFF/GTF | 1-based inclusive | Included | Included | end - start + 1 |
| SAM | 1-based inclusive | Included | Included | end - start + 1 |
| BAM index | 0-based half-open | Included | Excluded | end - start |
The same five bases (CGATC) are represented as:
- BED:
chr1 2 7(start at 2, end at 7, end excluded) - VCF:
chr1 3(position 3, 1-based) - GFF:
chr1 3 7(start at 3, end at 7, both included)
Why half-open intervals? They have nice mathematical properties: the length is simply end - start, adjacent intervals share an endpoint without overlapping (e.g., [0,5) and [5,10) cover positions 0-9 with no gap or overlap), and the empty interval is [n,n).
Creating Intervals
BioLang has a native Interval type for genomic coordinates. Intervals use 0-based half-open coordinates internally, matching BED format.
# BioLang intervals
let brca1 = interval("chr17", 43044295, 43125483)
let tp53 = interval("chr17", 7668402, 7687550)
println(f"BRCA1: {brca1}")
println(f" Chromosome: {brca1.chrom}")
println(f" Start: {brca1.start}")
println(f" End: {brca1.end}")
println(f" Length: {brca1.end - brca1.start} bp")
Expected output:
BRCA1: chr17:43044295-43125483
Chromosome: chr17
Start: 43044295
End: 43125483
Length: 81188 bp
You can also attach strand information:
# With strand information
let gene = interval("chr17", 43044295, 43125483, strand: "+")
println(f"Strand: {gene.strand}")
Expected output:
Strand: +
The strand indicates which DNA strand the feature is on: + for forward, - for reverse. BRCA1 is on the minus strand, but for interval arithmetic the strand does not affect overlap calculations.
Reading BED Files as Intervals
BED (Browser Extensible Data) files store genomic regions. Each line has at minimum three tab-separated columns: chromosome, start, end.
# requires: data/exons.bed in working directory
let exons = read_bed("data/exons.bed")
println(f"Exon regions: {len(exons)}")
# Convert to intervals
let intervals = exons |> map(|r| interval(r.chrom, r.start, r.end))
# Calculate total exonic bases
let total = exons |> map(|r| r.end - r.start) |> reduce(|a, b| a + b)
println(f"Total exonic bases: {total}")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
Expected output:
Exon regions: 20
Total exonic bases: 22750
Each record from read_bed has .chrom, .start, and .end fields, plus .name, .score, and .strand if present in the file.
Interval Trees
When you have thousands of regions and thousands of queries, checking every pair for overlap is O(n * m) — far too slow. An interval tree organizes regions into a balanced search structure that answers “what overlaps this query?” in O(log n + k) time, where k is the number of results.
How interval trees help:
Naive approach:
20,000 exons x 50,000 variants = 1,000,000,000 comparisons
Interval tree:
Build tree: O(n log n) = ~300,000 operations
Per query: O(log n + k) = ~15 operations + results
Total: ~750,000 operations
Speedup: ~1,300x faster
# Build an interval tree for fast queries
let regions = [
interval("chr17", 43044295, 43050000),
interval("chr17", 43060000, 43070000),
interval("chr17", 43080000, 43090000),
interval("chr17", 43100000, 43125483),
]
let tree = interval_tree(regions)
# Query: what overlaps this region?
let query = interval("chr17", 43065000, 43085000)
let hits = query_overlaps(tree, query)
println(f"Overlapping regions: {len(hits)}")
Expected output:
Overlapping regions: 2
The query interval [43065000, 43085000) overlaps two regions: [43060000, 43070000) (overlaps at 43065000-43070000) and [43080000, 43090000) (overlaps at 43080000-43085000).
Overlap Queries
Once you have an interval tree, BioLang provides several query operations:
# Count overlaps (without returning them)
let regions = [
interval("chr17", 43044295, 43050000),
interval("chr17", 43060000, 43070000),
interval("chr17", 43080000, 43090000),
interval("chr17", 43100000, 43125483),
]
let tree = interval_tree(regions)
let query = interval("chr17", 43065000, 43085000)
let n = count_overlaps(tree, query)
println(f"Number of overlaps: {n}")
Expected output:
Number of overlaps: 2
You can also query many intervals at once:
# Bulk overlaps --- query many intervals at once
let queries = [
interval("chr17", 43045000, 43046000),
interval("chr17", 43065000, 43066000),
interval("chr17", 43095000, 43096000),
]
let results = bulk_overlaps(tree, queries)
for i in range(0, len(queries)) {
println(f"Query {i}: {len(results[i])} overlaps")
}
Expected output:
Query 0: 1 overlaps
Query 1: 1 overlaps
Query 2: 0 overlaps
Query 0 hits the first region (43044295-43050000), Query 1 hits the second (43060000-43070000), and Query 2 falls in a gap between the third and fourth regions.
To find the closest region when there is no overlap:
# Find nearest non-overlapping interval
let lonely = interval("chr17", 43055000, 43056000)
let nearest = query_nearest(tree, lonely)
println(f"Nearest region: {nearest}")
Expected output:
Nearest region: chr17:43060000-43070000
The interval [43055000, 43056000) does not overlap any region. The closest region is [43060000, 43070000), which starts 4000 bp away.
Practical Example: Variant-in-Region Filtering
The most common interval operation in genomics: classifying variants as exonic or non-exonic. This requires converting between coordinate systems — VCF uses 1-based positions while BED uses 0-based half-open.
# Which variants fall inside exons?
# requires: data/variants.vcf, data/exons.bed in working directory
let variants = read_vcf("data/variants.vcf")
let exons = read_bed("data/exons.bed")
# Build tree from exons
let exon_intervals = exons |> map(|e| interval(e.chrom, e.start, e.end))
let tree = interval_tree(exon_intervals)
# Check each variant
let exonic_variants = variants |> filter(|v| {
let v_interval = interval(v.chrom, v.pos - 1, v.pos) # VCF 1-based -> 0-based
count_overlaps(tree, v_interval) > 0
})
println(f"Total variants: {len(variants)}")
println(f"Exonic variants: {len(exonic_variants)}")
println(f"Intronic/intergenic: {len(variants) - len(exonic_variants)}")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
Expected output:
Total variants: 15
Exonic variants: 10
Intronic/intergenic: 5
Notice the coordinate conversion: v.pos - 1 converts VCF’s 1-based position to a 0-based start, and v.pos becomes the exclusive end. This creates a 1-bp interval in BED coordinates that represents the variant position.
Coverage Analysis
Coverage analysis counts how many features (reads, intervals) overlap each position in a region. This is fundamental for assessing sequencing depth.
# Compute read depth across a region
# coverage() takes a list of [start, end] pairs
let reads = [
[100, 250],
[150, 300],
[200, 350],
[400, 550],
[420, 600],
]
coverage(reads, "chr1")
Expected output:
chr1:100-600
▂▄▆▆▄▂▁▁▁▁▃▃▁
max_depth=3 mean_depth=1.4 intervals=5
The coverage() function takes a list of [start, end] pairs and renders a sparkline showing depth across the region. The first three reads overlap at positions 200-250, giving a depth of 3. Positions 350-400 have zero coverage (a gap). This is the same algorithm used by bedtools genomecov.
Coordinate Conversion
Converting between coordinate systems is something you will do constantly. Write explicit conversion functions and use them everywhere — never do ad-hoc +1 or -1 adjustments scattered through your code.
# BED to VCF coordinates (and back)
fn bed_to_vcf(chrom, start, end) {
# BED: 0-based, half-open -> VCF: 1-based
{chrom: chrom, pos: start + 1}
}
fn vcf_to_bed(chrom, pos) {
# VCF: 1-based -> BED: 0-based, half-open
{chrom: chrom, start: pos - 1, end: pos}
}
# Example
let bed_region = {chrom: "chr17", start: 43044294, end: 43044295}
let vcf_pos = bed_to_vcf(bed_region.chrom, bed_region.start, bed_region.end)
println(f"BED {bed_region.start}-{bed_region.end} -> VCF pos {vcf_pos.pos}")
let vcf_variant = {chrom: "chr17", pos: 43044295}
let bed_coords = vcf_to_bed(vcf_variant.chrom, vcf_variant.pos)
println(f"VCF pos {vcf_variant.pos} -> BED {bed_coords.start}-{bed_coords.end}")
# Verify round-trip
let roundtrip = bed_to_vcf(bed_coords.chrom, bed_coords.start, bed_coords.end)
println(f"Round-trip VCF pos: {roundtrip.pos} (should be {vcf_variant.pos})")
Expected output:
BED 43044294-43044295 -> VCF pos 43044295
VCF pos 43044295 -> BED 43044294-43044295
Round-trip VCF pos: 43044295 (should be 43044295)
The round-trip test is crucial. If you convert BED to VCF and back and do not get the original coordinates, your conversion is wrong.
Working with GFF Annotations
GFF (General Feature Format) files describe genomic features like genes, exons, and regulatory elements. GFF uses 1-based inclusive coordinates.
# requires: data/annotations.gff in working directory
let features = read_gff("data/annotations.gff")
# Find all exons for a specific gene
let brca1_exons = features
|> filter(|f| f.type == "exon")
|> filter(|f| contains(str(f), "BRCA1"))
println(f"BRCA1 exons: {len(brca1_exons)}")
# Build interval tree from exons (convert GFF 1-based -> 0-based)
let exon_tree = interval_tree(
brca1_exons |> map(|e| interval(e.chrom, e.start - 1, e.end))
)
println(f"Interval tree built from {len(brca1_exons)} exons")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
Expected output:
BRCA1 exons: 5
Interval tree built from 5 exons
Note the coordinate conversion: GFF start is 1-based, so we subtract 1 to get a 0-based start. GFF end is 1-based inclusive, which happens to equal the 0-based exclusive end (e.g., 1-based position 7 inclusive = 0-based position 7 exclusive), so we use e.end as-is.
Writing BED Files
After filtering or computing intervals, you often need to export results as BED files for downstream tools.
# Export filtered regions
let high_coverage = [
{chrom: "chr17", start: 43044295, end: 43050000},
{chrom: "chr17", start: 43100000, end: 43125483},
]
write_bed(high_coverage, "results/high_coverage.bed")
println("Wrote high-coverage regions to BED file")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
Expected output:
Wrote high-coverage regions to BED file
The write_bed function writes tab-separated BED format. Each record must have .chrom, .start, and .end fields at minimum. Optional fields (.name, .score, .strand) are included if present.
Complete Example: Exome Coverage Report
This example ties together everything from the chapter: reading BED and VCF files, building interval trees, classifying variants by overlap, and summarizing results.
# Exome Coverage Analysis
# requires: data/exons.bed, data/variants.vcf in working directory
println("=== Exome Coverage Report ===\n")
# Load target regions
let targets = read_bed("data/exons.bed")
let total_target_bp = targets |> map(|r| r.end - r.start) |> reduce(|a, b| a + b)
println(f"Target regions: {len(targets)}")
println(f"Total target bases: {total_target_bp}")
# Build interval tree
let tree = interval_tree(targets |> map(|t| interval(t.chrom, t.start, t.end)))
# Classify variants
let variants = read_vcf("data/variants.vcf")
let on_target = variants |> filter(|v| {
count_overlaps(tree, interval(v.chrom, v.pos - 1, v.pos)) > 0
}) |> collect()
let off_target = len(variants) - len(on_target)
println(f"\nVariant classification:")
println(f" On-target: {len(on_target)}")
println(f" Off-target: {off_target}")
println(f" On-target rate: {round(len(on_target) / len(variants) * 100, 1)}%")
# Per-chromosome summary
let by_chrom = on_target
|> to_table()
|> group_by("chrom")
|> summarize(|chrom, rows| {chrom: chrom, n: len(rows)})
println(f"\nOn-target variants per chromosome:")
println(by_chrom)
write_csv(on_target |> to_table(), "results/on_target_variants.csv")
println("\nResults saved")
println("\n=== Report complete ===")
Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with
bl run.
Expected output:
=== Exome Coverage Report ===
Target regions: 20
Total target bases: 22750
Variant classification:
On-target: 10
Off-target: 5
On-target rate: 66.7%
On-target variants per chromosome:
chrom | n
chr17 | 10
Results saved
=== Report complete ===
Exercises
-
Gene overlap query. Create intervals for 5 genes on chr17 and build an interval tree. Query which genes overlap the region chr17:43050000-43090000.
-
Coordinate conversion. Convert these VCF positions to BED coordinates and verify each conversion round-trips correctly: chr1:100, chr2:500, chr7:1000, chrX:2500, chr17:43044295.
-
Per-chromosome region size. Read
data/exons.bedand calculate the mean exon size per chromosome usinggroup_byandsummarize. -
Promoter variant detection. Define a promoter as the 1000 bp region upstream of a gene start. Given 5 gene start positions, build an interval tree of promoter regions and find which variants from
data/variants.vcffall in promoter regions. -
Coverage depth histogram. Given a list of 10 overlapping read intervals, compute coverage using
coverage()and find the maximum depth and the total number of bases at each depth level.
Key Takeaways
- Coordinate systems (0-based vs 1-based) are the #1 source of bioinformatics bugs — always convert explicitly
- BED = 0-based half-open, VCF/GFF = 1-based inclusive — these describe the same biology differently
- Interval trees enable O(log n) overlap queries on millions of regions
interval_tree()+query_overlaps()is the core pattern for genomic region analysis- Coverage analysis shows read depth across genomic regions
- Always validate coordinate conversions with known examples and round-trip tests
- Write explicit conversion functions (
bed_to_vcf,vcf_to_bed) — never scatter ad-hoc+1/-1adjustments
What’s Next
Tomorrow we tackle biological data visualization — Manhattan plots, ideograms, genome tracks, and more. Visualization turns the numbers from today’s interval analysis into figures that tell a story.