Day 18: Genomic Coordinates and Intervals


Difficulty	Intermediate
Biology knowledge	Intermediate (genomic coordinates, exons, variants)
Coding knowledge	Intermediate (records, pipes, lambda functions, interval trees)
Time	~3 hours
Prerequisites	Days 1-17 completed, BioLang installed (see Appendix A)
Data needed	Generated by `init.bl` (exons BED, variants VCF, annotations GFF)

What You’ll Learn

Why coordinate systems are the #1 source of bioinformatics bugs
The difference between 0-based half-open (BED) and 1-based inclusive (VCF, GFF) coordinates
How to create and manipulate genomic intervals
How interval trees enable fast overlap queries on millions of regions
How to filter variants by genomic region (exonic vs intronic)
How to read and write BED files, and work with GFF annotations

Your exome capture kit targets 200,000 regions. Your variant caller found 50,000 variants. Which variants fall inside targeted regions? Which exons overlap regulatory elements? Genomic interval operations answer these questions in milliseconds.

Genomic coordinates are deceptively simple — a chromosome name, a start position, and an end position. But the way those positions are counted differs between file formats, and getting it wrong means your analysis is off by one base. That one base can be the difference between “variant in exon” and “variant in intron.” At genome scale, you cannot check these by eye. You need fast, correct interval operations.

Coordinate Systems

This is the single most important concept in this chapter. If you get coordinates wrong, every downstream analysis is silently incorrect.

Position:   1  2  3  4  5  6  7  8  9 10
Sequence:   A  T  C  G  A  T  C  G  A  T

1-based inclusive (VCF, GFF, SAM):
  "positions 3-7" = C G A T C  (5 bases)
  start=3, end=7, length = end - start + 1 = 5

0-based half-open (BED, BAM index, UCSC):
  "positions 2-7" = C G A T C  (5 bases, same region!)
  start=2, end=7, length = end - start = 5
  start is included, end is EXCLUDED

The key rules:

Format	System	Start	End	Length formula
BED	0-based half-open	Included	Excluded	end - start
VCF	1-based inclusive	Included	Included	end - start + 1
GFF/GTF	1-based inclusive	Included	Included	end - start + 1
SAM	1-based inclusive	Included	Included	end - start + 1
BAM index	0-based half-open	Included	Excluded	end - start

The same five bases (CGATC) are represented as:

BED: chr1 2 7 (start at 2, end at 7, end excluded)
VCF: chr1 3 (position 3, 1-based)
GFF: chr1 3 7 (start at 3, end at 7, both included)

Why half-open intervals? They have nice mathematical properties: the length is simply end - start, adjacent intervals share an endpoint without overlapping (e.g., [0,5) and [5,10) cover positions 0-9 with no gap or overlap), and the empty interval is [n,n).

Creating Intervals

BioLang has a native Interval type for genomic coordinates. Intervals use 0-based half-open coordinates internally, matching BED format.

# BioLang intervals
let brca1 = interval("chr17", 43044295, 43125483)
let tp53 = interval("chr17", 7668402, 7687550)

println(f"BRCA1: {brca1}")
println(f"  Chromosome: {brca1.chrom}")
println(f"  Start: {brca1.start}")
println(f"  End: {brca1.end}")
println(f"  Length: {brca1.end - brca1.start} bp")

Expected output:

BRCA1: chr17:43044295-43125483
  Chromosome: chr17
  Start: 43044295
  End: 43125483
  Length: 81188 bp

You can also attach strand information:

# With strand information
let gene = interval("chr17", 43044295, 43125483, strand: "+")
println(f"Strand: {gene.strand}")

Expected output:

Strand: +

The strand indicates which DNA strand the feature is on: + for forward, - for reverse. BRCA1 is on the minus strand, but for interval arithmetic the strand does not affect overlap calculations.

Reading BED Files as Intervals

BED (Browser Extensible Data) files store genomic regions. Each line has at minimum three tab-separated columns: chromosome, start, end.

# requires: data/exons.bed in working directory
let exons = read_bed("data/exons.bed")
println(f"Exon regions: {len(exons)}")

# Convert to intervals
let intervals = exons |> map(|r| interval(r.chrom, r.start, r.end))

# Calculate total exonic bases
let total = exons |> map(|r| r.end - r.start) |> reduce(|a, b| a + b)
println(f"Total exonic bases: {total}")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

Expected output:

Exon regions: 20
Total exonic bases: 22750

Each record from read_bed has .chrom, .start, and .end fields, plus .name, .score, and .strand if present in the file.

Interval Trees

When you have thousands of regions and thousands of queries, checking every pair for overlap is O(n * m) — far too slow. An interval tree organizes regions into a balanced search structure that answers “what overlaps this query?” in O(log n + k) time, where k is the number of results.

How interval trees help:

Naive approach:
  20,000 exons x 50,000 variants = 1,000,000,000 comparisons

Interval tree:
  Build tree: O(n log n)  = ~300,000 operations
  Per query:  O(log n + k) = ~15 operations + results
  Total:      ~750,000 operations

Speedup: ~1,300x faster

# Build an interval tree for fast queries
let regions = [
    interval("chr17", 43044295, 43050000),
    interval("chr17", 43060000, 43070000),
    interval("chr17", 43080000, 43090000),
    interval("chr17", 43100000, 43125483),
]
let tree = interval_tree(regions)

# Query: what overlaps this region?
let query = interval("chr17", 43065000, 43085000)
let hits = query_overlaps(tree, query)
println(f"Overlapping regions: {len(hits)}")

Expected output:

Overlapping regions: 2

The query interval [43065000, 43085000) overlaps two regions: [43060000, 43070000) (overlaps at 43065000-43070000) and [43080000, 43090000) (overlaps at 43080000-43085000).

Overlap Queries

Once you have an interval tree, BioLang provides several query operations:

# Count overlaps (without returning them)
let regions = [
    interval("chr17", 43044295, 43050000),
    interval("chr17", 43060000, 43070000),
    interval("chr17", 43080000, 43090000),
    interval("chr17", 43100000, 43125483),
]
let tree = interval_tree(regions)

let query = interval("chr17", 43065000, 43085000)
let n = count_overlaps(tree, query)
println(f"Number of overlaps: {n}")

Expected output:

Number of overlaps: 2

You can also query many intervals at once:

# Bulk overlaps --- query many intervals at once
let queries = [
    interval("chr17", 43045000, 43046000),
    interval("chr17", 43065000, 43066000),
    interval("chr17", 43095000, 43096000),
]
let results = bulk_overlaps(tree, queries)
for i in range(0, len(queries)) {
    println(f"Query {i}: {len(results[i])} overlaps")
}

Expected output:

Query 0: 1 overlaps
Query 1: 1 overlaps
Query 2: 0 overlaps

Query 0 hits the first region (43044295-43050000), Query 1 hits the second (43060000-43070000), and Query 2 falls in a gap between the third and fourth regions.

To find the closest region when there is no overlap:

# Find nearest non-overlapping interval
let lonely = interval("chr17", 43055000, 43056000)
let nearest = query_nearest(tree, lonely)
println(f"Nearest region: {nearest}")

Expected output:

Nearest region: chr17:43060000-43070000

The interval [43055000, 43056000) does not overlap any region. The closest region is [43060000, 43070000), which starts 4000 bp away.

Practical Example: Variant-in-Region Filtering

The most common interval operation in genomics: classifying variants as exonic or non-exonic. This requires converting between coordinate systems — VCF uses 1-based positions while BED uses 0-based half-open.

# Which variants fall inside exons?
# requires: data/variants.vcf, data/exons.bed in working directory
let variants = read_vcf("data/variants.vcf")
let exons = read_bed("data/exons.bed")

# Build tree from exons
let exon_intervals = exons |> map(|e| interval(e.chrom, e.start, e.end))
let tree = interval_tree(exon_intervals)

# Check each variant
let exonic_variants = variants |> filter(|v| {
    let v_interval = interval(v.chrom, v.pos - 1, v.pos)  # VCF 1-based -> 0-based
    count_overlaps(tree, v_interval) > 0
})

println(f"Total variants: {len(variants)}")
println(f"Exonic variants: {len(exonic_variants)}")
println(f"Intronic/intergenic: {len(variants) - len(exonic_variants)}")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

Expected output:

Total variants: 15
Exonic variants: 10
Intronic/intergenic: 5

Notice the coordinate conversion: v.pos - 1 converts VCF’s 1-based position to a 0-based start, and v.pos becomes the exclusive end. This creates a 1-bp interval in BED coordinates that represents the variant position.

Coverage Analysis

Coverage analysis counts how many features (reads, intervals) overlap each position in a region. This is fundamental for assessing sequencing depth.

# Compute read depth across a region
# coverage() takes a list of [start, end] pairs
let reads = [
    [100, 250],
    [150, 300],
    [200, 350],
    [400, 550],
    [420, 600],
]
coverage(reads, "chr1")

Expected output:

  chr1:100-600
  ▂▄▆▆▄▂▁▁▁▁▃▃▁
  max_depth=3  mean_depth=1.4  intervals=5

The coverage() function takes a list of [start, end] pairs and renders a sparkline showing depth across the region. The first three reads overlap at positions 200-250, giving a depth of 3. Positions 350-400 have zero coverage (a gap). This is the same algorithm used by bedtools genomecov.

Coordinate Conversion

Converting between coordinate systems is something you will do constantly. Write explicit conversion functions and use them everywhere — never do ad-hoc +1 or -1 adjustments scattered through your code.

# BED to VCF coordinates (and back)
fn bed_to_vcf(chrom, start, end) {
    # BED: 0-based, half-open -> VCF: 1-based
    {chrom: chrom, pos: start + 1}
}

fn vcf_to_bed(chrom, pos) {
    # VCF: 1-based -> BED: 0-based, half-open
    {chrom: chrom, start: pos - 1, end: pos}
}

# Example
let bed_region = {chrom: "chr17", start: 43044294, end: 43044295}
let vcf_pos = bed_to_vcf(bed_region.chrom, bed_region.start, bed_region.end)
println(f"BED {bed_region.start}-{bed_region.end} -> VCF pos {vcf_pos.pos}")

let vcf_variant = {chrom: "chr17", pos: 43044295}
let bed_coords = vcf_to_bed(vcf_variant.chrom, vcf_variant.pos)
println(f"VCF pos {vcf_variant.pos} -> BED {bed_coords.start}-{bed_coords.end}")

# Verify round-trip
let roundtrip = bed_to_vcf(bed_coords.chrom, bed_coords.start, bed_coords.end)
println(f"Round-trip VCF pos: {roundtrip.pos} (should be {vcf_variant.pos})")

Expected output:

BED 43044294-43044295 -> VCF pos 43044295
VCF pos 43044295 -> BED 43044294-43044295
Round-trip VCF pos: 43044295 (should be 43044295)

The round-trip test is crucial. If you convert BED to VCF and back and do not get the original coordinates, your conversion is wrong.

Working with GFF Annotations

GFF (General Feature Format) files describe genomic features like genes, exons, and regulatory elements. GFF uses 1-based inclusive coordinates.

# requires: data/annotations.gff in working directory
let features = read_gff("data/annotations.gff")

# Find all exons for a specific gene
let brca1_exons = features
    |> filter(|f| f.type == "exon")
    |> filter(|f| contains(str(f), "BRCA1"))

println(f"BRCA1 exons: {len(brca1_exons)}")

# Build interval tree from exons (convert GFF 1-based -> 0-based)
let exon_tree = interval_tree(
    brca1_exons |> map(|e| interval(e.chrom, e.start - 1, e.end))
)
println(f"Interval tree built from {len(brca1_exons)} exons")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

Expected output:

BRCA1 exons: 5
Interval tree built from 5 exons

Note the coordinate conversion: GFF start is 1-based, so we subtract 1 to get a 0-based start. GFF end is 1-based inclusive, which happens to equal the 0-based exclusive end (e.g., 1-based position 7 inclusive = 0-based position 7 exclusive), so we use e.end as-is.

Writing BED Files

After filtering or computing intervals, you often need to export results as BED files for downstream tools.

# Export filtered regions
let high_coverage = [
    {chrom: "chr17", start: 43044295, end: 43050000},
    {chrom: "chr17", start: 43100000, end: 43125483},
]
write_bed(high_coverage, "results/high_coverage.bed")
println("Wrote high-coverage regions to BED file")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

Expected output:

Wrote high-coverage regions to BED file

The write_bed function writes tab-separated BED format. Each record must have .chrom, .start, and .end fields at minimum. Optional fields (.name, .score, .strand) are included if present.

Complete Example: Exome Coverage Report

This example ties together everything from the chapter: reading BED and VCF files, building interval trees, classifying variants by overlap, and summarizing results.

# Exome Coverage Analysis
# requires: data/exons.bed, data/variants.vcf in working directory

println("=== Exome Coverage Report ===\n")

# Load target regions
let targets = read_bed("data/exons.bed")
let total_target_bp = targets |> map(|r| r.end - r.start) |> reduce(|a, b| a + b)
println(f"Target regions: {len(targets)}")
println(f"Total target bases: {total_target_bp}")

# Build interval tree
let tree = interval_tree(targets |> map(|t| interval(t.chrom, t.start, t.end)))

# Classify variants
let variants = read_vcf("data/variants.vcf")
let on_target = variants |> filter(|v| {
    count_overlaps(tree, interval(v.chrom, v.pos - 1, v.pos)) > 0
}) |> collect()

let off_target = len(variants) - len(on_target)
println(f"\nVariant classification:")
println(f"  On-target:  {len(on_target)}")
println(f"  Off-target: {off_target}")
println(f"  On-target rate: {round(len(on_target) / len(variants) * 100, 1)}%")

# Per-chromosome summary
let by_chrom = on_target
    |> to_table()
    |> group_by("chrom")
    |> summarize(|chrom, rows| {chrom: chrom, n: len(rows)})
println(f"\nOn-target variants per chromosome:")
println(by_chrom)

write_csv(on_target |> to_table(), "results/on_target_variants.csv")
println("\nResults saved")
println("\n=== Report complete ===")

Requires CLI: This example uses file I/O / network APIs not available in the browser. Run with bl run.

Expected output:

=== Exome Coverage Report ===

Target regions: 20
Total target bases: 22750

Variant classification:
  On-target:  10
  Off-target: 5
  On-target rate: 66.7%

On-target variants per chromosome:
chrom | n
chr17 | 10

Results saved

=== Report complete ===

Exercises

Gene overlap query. Create intervals for 5 genes on chr17 and build an interval tree. Query which genes overlap the region chr17:43050000-43090000.
Coordinate conversion. Convert these VCF positions to BED coordinates and verify each conversion round-trips correctly: chr1:100, chr2:500, chr7:1000, chrX:2500, chr17:43044295.
Per-chromosome region size. Read data/exons.bed and calculate the mean exon size per chromosome using group_by and summarize.
Promoter variant detection. Define a promoter as the 1000 bp region upstream of a gene start. Given 5 gene start positions, build an interval tree of promoter regions and find which variants from data/variants.vcf fall in promoter regions.
Coverage depth histogram. Given a list of 10 overlapping read intervals, compute coverage using coverage() and find the maximum depth and the total number of bases at each depth level.

Key Takeaways

Coordinate systems (0-based vs 1-based) are the #1 source of bioinformatics bugs — always convert explicitly
BED = 0-based half-open, VCF/GFF = 1-based inclusive — these describe the same biology differently
Interval trees enable O(log n) overlap queries on millions of regions
interval_tree() + query_overlaps() is the core pattern for genomic region analysis
Coverage analysis shows read depth across genomic regions
Always validate coordinate conversions with known examples and round-trip tests
Write explicit conversion functions (bed_to_vcf, vcf_to_bed) — never scatter ad-hoc +1/-1 adjustments

What’s Next

Tomorrow we tackle biological data visualization — Manhattan plots, ideograms, genome tracks, and more. Visualization turns the numbers from today’s interval analysis into figures that tell a story.

Practical Bioinformatics in 30 Days

Day 18: Genomic Coordinates and Intervals

What You’ll Learn

The Problem

Coordinate Systems

Creating Intervals

Reading BED Files as Intervals

Interval Trees

Overlap Queries

Practical Example: Variant-in-Region Filtering

Coverage Analysis

Coordinate Conversion

Working with GFF Annotations

Writing BED Files

Complete Example: Exome Coverage Report

Exercises

Key Takeaways

What’s Next

Keyboard shortcuts

Practical Bioinformatics in 30 Days