Genomic Intervals

Genomic intervals are a fundamental abstraction in bioinformatics, representing regions on a chromosome or contig. BioLang provides the Interval type as a first-class value with set-theoretic operations, an implicit interval tree for efficient overlap queries, and windowing functions for sliding-window analyses.

Creating Intervals

Use the interval() constructor to create individual intervals. Coordinates are zero-based, half-open (matching BED convention):

# Basic interval: chromosome, start, end
let region = interval("chr1", 1000, 2000)

# With strand information (4th positional argument)
let exon = interval("chr1", 1000, 2000, "+")
let antisense = interval("chr1", 1000, 2000, "-")

# Access fields
print(region.chrom)    # => "chr1"
print(region.start)    # => 1000
print(region.end)      # => 2000
print(exon.strand)     # => "+"

# Width/length
region |> len()        # => 1000

Intersect

intersect() returns the overlapping portion of two intervals. If they do not overlap, it returns nil:

# Intersect two BED tables (tables with chrom, start, end columns)
let exons = read_bed("data/regions.bed")
let peaks = read_bed("data/regions.bed")

# Find all pairwise intersections
let shared = intersect(exons, peaks)
print(nrow(shared), "overlapping regions")

Merge

merge() combines overlapping or book-ended intervals into a single contiguous interval. On collections, it merges all overlapping intervals:

# Merge overlapping intervals in a table (collapses overlapping rows)
# The table must have chrom, start, end columns
let regions = read_bed("data/regions.bed")
let merged = merge_intervals(regions)
print(nrow(merged), "merged regions")

Subtract

subtract() removes the portion of one interval that overlaps with another. The result may be zero, one, or two intervals:

# Subtract one table's intervals from another
let exons = read_bed("data/regions.bed")
let repeats = read_bed("data/regions.bed")

# Remove repeat-overlapping portions from exons
let clean_exons = subtract(exons, repeats)
print(nrow(clean_exons), "clean exon regions")

Closest

closest() finds the nearest interval in a collection to a query interval. Returns both the interval and the distance:

# Find closest intervals between two tables
# Returns a table with the closest match for each row in the first table
let queries = read_bed("data/regions.bed")
let genes = read_bed("data/regions.bed")

let result = closest(queries, genes)
print(result)

Interval Tree

For repeated overlap queries against a large collection, build an interval tree for O(log n + k) query time instead of O(n):

# Build an interval tree from a BED file
let tree = read_bed("data/regions.bed") |> interval_tree()

# Query the tree for overlapping intervals
let hits = query_overlaps(tree, "chr1", 50000, 51000)
print(nrow(hits), "annotations overlap the query region")

# Find nearest interval to a region
let nearest = query_nearest(tree, "chr1", 50000, 51000)
print(nearest)

Sliding Window Analysis

Use kmers() to extract sliding windows of a given size for genome-wide analyses:

# Sliding window GC content using kmers
let seq = dna"ATCGATCGATCGATCGATCGATCG"
kmers(seq, 6)
  |> map(|k| gc_content(k))
  |> to_table()
  |> print()

Combining Operations

Interval operations compose naturally with BioLang's pipe syntax:

# Find promoter regions that overlap with ChIP-seq peaks
# but not with known repeat elements
let genes = read_bed("data/regions.bed")
let peaks = read_bed("data/regions.bed")
let repeats = read_bed("data/regions.bed")

# Use flank() to get upstream regions (2kb)
let promoters = flank(genes, 2000)

# Intersect with peaks, then subtract repeats, then merge
let result = intersect(promoters, peaks)
result = subtract(result, repeats)
result = merge_intervals(result)
write_bed(result, "clean_promoter_peaks.bed")