Genomic Intervals
Genomic intervals are a fundamental abstraction in bioinformatics, representing
regions on a chromosome or contig. BioLang provides the Interval type
as a first-class value with set-theoretic operations, an implicit interval tree for
efficient overlap queries, and windowing functions for sliding-window analyses.
Creating Intervals
Use the interval() constructor to create individual intervals. Coordinates
are zero-based, half-open (matching BED convention):
# Basic interval: chromosome, start, end
let region = interval("chr1", 1000, 2000)
# With strand information (4th positional argument)
let exon = interval("chr1", 1000, 2000, "+")
let antisense = interval("chr1", 1000, 2000, "-")
# Access fields
print(region.chrom) # => "chr1"
print(region.start) # => 1000
print(region.end) # => 2000
print(exon.strand) # => "+"
# Width/length
region |> len() # => 1000
Intersect
intersect() returns the overlapping portion of two intervals. If they do
not overlap, it returns nil:
# Intersect two BED tables (tables with chrom, start, end columns)
let exons = read_bed("data/regions.bed")
let peaks = read_bed("data/regions.bed")
# Find all pairwise intersections
let shared = intersect(exons, peaks)
print(nrow(shared), "overlapping regions")
Merge
merge() combines overlapping or book-ended intervals into a single
contiguous interval. On collections, it merges all overlapping intervals:
# Merge overlapping intervals in a table (collapses overlapping rows)
# The table must have chrom, start, end columns
let regions = read_bed("data/regions.bed")
let merged = merge_intervals(regions)
print(nrow(merged), "merged regions")
Subtract
subtract() removes the portion of one interval that overlaps with another.
The result may be zero, one, or two intervals:
# Subtract one table's intervals from another
let exons = read_bed("data/regions.bed")
let repeats = read_bed("data/regions.bed")
# Remove repeat-overlapping portions from exons
let clean_exons = subtract(exons, repeats)
print(nrow(clean_exons), "clean exon regions")
Closest
closest() finds the nearest interval in a collection to a query interval.
Returns both the interval and the distance:
# Find closest intervals between two tables
# Returns a table with the closest match for each row in the first table
let queries = read_bed("data/regions.bed")
let genes = read_bed("data/regions.bed")
let result = closest(queries, genes)
print(result)
Interval Tree
For repeated overlap queries against a large collection, build an interval tree for O(log n + k) query time instead of O(n):
# Build an interval tree from a BED file
let tree = read_bed("data/regions.bed") |> interval_tree()
# Query the tree for overlapping intervals
let hits = query_overlaps(tree, "chr1", 50000, 51000)
print(nrow(hits), "annotations overlap the query region")
# Find nearest interval to a region
let nearest = query_nearest(tree, "chr1", 50000, 51000)
print(nearest)
Sliding Window Analysis
Use kmers() to extract sliding windows of a given size for
genome-wide analyses:
# Sliding window GC content using kmers
let seq = dna"ATCGATCGATCGATCGATCGATCG"
kmers(seq, 6)
|> map(|k| gc_content(k))
|> to_table()
|> print()
Combining Operations
Interval operations compose naturally with BioLang's pipe syntax:
# Find promoter regions that overlap with ChIP-seq peaks
# but not with known repeat elements
let genes = read_bed("data/regions.bed")
let peaks = read_bed("data/regions.bed")
let repeats = read_bed("data/regions.bed")
# Use flank() to get upstream regions (2kb)
let promoters = flank(genes, 2000)
# Intersect with peaks, then subtract repeats, then merge
let result = intersect(promoters, peaks)
result = subtract(result, repeats)
result = merge_intervals(result)
write_bed(result, "clean_promoter_peaks.bed")