Day 25: Error Handling in Production
| Difficulty | Intermediate–Advanced |
| Biology knowledge | Intermediate (FASTQ quality scores, FASTA format, sequence data) |
| Coding knowledge | Intermediate (functions, records, pipes, tables, file I/O) |
| Time | ~3 hours |
| Prerequisites | Days 1–24 completed, BioLang installed (see Appendix A) |
| Data needed | Generated locally via init.bl (includes intentionally corrupted files) |
What You’ll Learn
- Why production pipelines need deliberate error handling strategies
- How to use
try/catchto recover from failures without crashing - How to validate inputs before processing begins
- How to implement retry logic for transient failures
- How to handle partial failures in batch processing
- How to log errors for post-mortem debugging
- How to build resilient pipelines that degrade gracefully
- How to test error paths systematically
The Problem
“My pipeline crashed at 3 AM on sample 187 of 200 — now what?”
You have built an overnight pipeline that processes 200 FASTQ files, filters them for quality, extracts sequence statistics, and writes a summary report. It ran perfectly on your test set of 10 files. You submitted it at midnight and went to sleep. At 7 AM, you check the results and find: the pipeline crashed on sample 187. Samples 1–186 were processed, but samples 188–200 were never touched. The error message says “unexpected character at line 4” — a corrupted FASTQ record.
Now you face a cascade of bad options. You could restart the entire pipeline from scratch, wasting 6 hours of compute on samples you already processed. You could manually edit sample 187 out of the input list and run only 188–200, but that requires you to understand exactly where the pipeline state was left. You could fix the corrupted file, but you need to find which of 200 files is sample 187, and you do not know if there are more corrupted files downstream.
All of these problems share a root cause: the pipeline assumed every input would be well-formed. It had no plan for failure.
Production bioinformatics pipelines encounter every category of error:
This chapter teaches you to handle all of them. By the end, you will have a pipeline that processes every valid sample, skips corrupted ones, retries transient failures, logs everything, and produces a report telling you exactly what happened.
Section 1: try/catch Basics
The try/catch construct is BioLang’s mechanism for recovering from errors. When code inside try throws an error, execution jumps to the catch block instead of crashing the entire program.
Your First try/catch
The simplest pattern catches an error and substitutes a default value:
let result = try { int("not_a_number") } catch err { -1 }
The variable err in the catch block contains the error message as a string. You can inspect it, log it, or ignore it:
let value = try {
read_csv("missing_file.csv")
} catch err {
println(f"Warning: {err}")
[]
}
This is fundamentally different from letting the error crash your program. Without try/catch, a missing file terminates everything. With it, you decide what happens next.
try/catch Is an Expression
In BioLang, try/catch returns a value. This means you can use it anywhere you would use an expression — in variable assignments, function arguments, or pipe chains:
let samples = try { read_csv("data/sample_sheet.csv") } catch err { [] }
let count = len(try { read_lines("data.txt") } catch err { [] })
let safe_mean = try { mean(values) } catch err { 0.0 }
This is more concise than languages where try/catch is a statement that cannot return a value.
Nested try/catch
You can nest try/catch blocks when different operations need different fallback strategies:
let result = try {
let data = try { read_csv("primary.csv") } catch err { read_csv("backup.csv") }
data |> filter(|row| row.quality > 20)
} catch err {
println(f"Both data sources failed: {err}")
[]
}
The inner try/catch tries a primary file and falls back to a backup. The outer try/catch handles the case where both files are missing or the filter operation fails.
Throwing Errors
Use error() to throw your own errors. This is how you enforce preconditions and signal problems to callers:
let validate_quality = |threshold| {
if threshold < 0 {
error("Quality threshold cannot be negative")
}
if threshold > 41 {
error("Quality threshold exceeds Phred+33 maximum")
}
threshold
}
let q = try { validate_quality(-5) } catch err { println(err) }
Custom errors make debugging vastly easier than cryptic runtime errors. When your pipeline fails at 3 AM, "Quality threshold cannot be negative" tells you exactly what went wrong and where.
Section 2: Error Types and Messages
Not all errors deserve the same response. A corrupted file is permanent — retrying will not fix it. A network timeout is transient — retrying might succeed. Your error handling strategy should distinguish between these.
Classifying Errors
A practical approach is to examine the error message string:
let classify_error = |err_msg| {
if contains(err_msg, "not found") { "missing" }
else if contains(err_msg, "permission") { "access" }
else if contains(err_msg, "timeout") { "transient" }
else if contains(err_msg, "parse") { "data_corrupt" }
else if contains(err_msg, "disk") { "resource" }
else { "unknown" }
}
This classification drives different recovery strategies:
let handle_error = |err_msg, context| {
let category = classify_error(err_msg)
if category == "transient" {
{ action: "retry", message: err_msg, context: context }
} else if category == "missing" {
{ action: "skip", message: err_msg, context: context }
} else if category == "data_corrupt" {
{ action: "skip", message: err_msg, context: context }
} else if category == "resource" {
{ action: "abort", message: err_msg, context: context }
} else {
{ action: "log_and_skip", message: err_msg, context: context }
}
}
Structured Error Records
Instead of returning bare values or nil on failure, return structured records that carry context:
let safe_read_fastq = |path| {
try {
let records = read_fastq(path)
{ ok: true, data: records, path: path, error: nil }
} catch err {
{ ok: false, data: [], path: path, error: err }
}
}
The caller can then inspect the ok field:
let result = safe_read_fastq("data/reads.fastq")
if result.ok {
let stats = process(result.data)
} else {
println(f"Skipping {result.path}: {result.error}")
}
This pattern — often called a “result record” — keeps errors in the data flow rather than in the control flow. You never lose track of which file failed or why.
Section 3: Retry Logic
Transient errors — network timeouts, rate limits, temporary server unavailability — often resolve on their own. Retry logic gives your pipeline resilience against these hiccups.
Simple Retry
The simplest retry pattern loops a fixed number of times:
let retry = |f, max_attempts| {
let attempt = 1
let last_error = ""
let result = nil
let succeeded = false
range(0, max_attempts) |> each(|i| {
if succeeded == false {
try {
result = f()
succeeded = true
} catch err {
last_error = err
attempt = attempt + 1
sleep(1000)
}
}
})
if succeeded { result }
else { error(f"Failed after {max_attempts} attempts: {last_error}") }
}
Usage:
let data = retry(|| { read_csv("network_share/data.csv") }, 3)
Retry with Exponential Backoff
Fixed-interval retries can overwhelm a struggling server. Exponential backoff increases the wait time between attempts, giving the server time to recover:
let retry_backoff = |f, max_attempts, base_delay_ms| {
let last_error = ""
let result = nil
let succeeded = false
range(0, max_attempts) |> each(|i| {
if succeeded == false {
try {
result = f()
succeeded = true
} catch err {
last_error = err
let delay = base_delay_ms
range(0, i) |> each(|_| { delay = delay * 2 })
if delay > 30000 { delay = 30000 }
sleep(delay)
}
}
})
if succeeded { result }
else { error(f"Failed after {max_attempts} attempts: {last_error}") }
}
The cap at 30 seconds prevents absurdly long waits. In practice, if a service is not responding after 30 seconds of backoff, it is probably down for maintenance — not experiencing a brief hiccup.
Retry Only Transient Errors
Not every error deserves a retry. Retrying a “file not found” error is pointless. Combine error classification with retry logic:
let retry_if_transient = |f, max_attempts| {
let last_error = ""
let result = nil
let succeeded = false
range(0, max_attempts) |> each(|i| {
if succeeded == false {
try {
result = f()
succeeded = true
} catch err {
last_error = err
let category = classify_error(err)
if category != "transient" {
error(err)
}
sleep(1000)
}
}
})
if succeeded { result }
else { error(f"Failed after {max_attempts} attempts: {last_error}") }
}
Section 4: Input Validation
The cheapest error to handle is the one you prevent. Validating inputs before processing begins catches problems early, when the error message can be specific and actionable.
File Existence and Format
let validate_input_file = |path, expected_ext| {
if file_exists(path) == false {
error(f"Input file not found: {path}")
}
if ends_with(path, expected_ext) == false {
error(f"Expected {expected_ext} file, got: {path}")
}
let lines = read_lines(path)
if len(lines) == 0 {
error(f"Input file is empty: {path}")
}
true
}
FASTQ Record Validation
FASTQ files have a strict four-line structure. A corrupted file might have truncated records, missing quality lines, or mismatched sequence/quality lengths:
let validate_fastq_record = |record| {
if typeof(record) != "Record" {
error("Invalid record type")
}
let seq = record.sequence
let qual = record.quality
if len(seq) == 0 {
error(f"Empty sequence in record: {record.id}")
}
if len(seq) != len(qual) {
error(f"Sequence/quality length mismatch in {record.id}: seq={len(seq)} qual={len(qual)}")
}
true
}
Batch Input Validation
Before processing 200 files, check them all first. This takes seconds and saves hours:
let validate_batch = |file_paths| {
let errors = []
file_paths |> each(|path| {
try {
validate_input_file(path, ".fastq")
} catch err {
errors = errors + [{ path: path, error: err }]
}
})
if len(errors) > 0 {
errors |> each(|e| {
println(f"INVALID: {e.path} --- {e.error}")
})
error(f"Validation failed: {len(errors)} of {len(file_paths)} files have problems")
}
true
}
The decision flow for whether to abort or continue depends on how many files fail validation:
Section 5: Defensive File I/O
File operations are a leading source of pipeline failures. Files can be missing, empty, corrupted, in the wrong format, or on a filesystem that runs out of space mid-write.
Safe Reading
Wrap every file read in a function that validates the result:
let safe_read_csv = |path| {
if file_exists(path) == false {
error(f"File not found: {path}")
}
let data = try {
read_csv(path)
} catch err {
error(f"Failed to parse CSV {path}: {err}")
}
if len(data) == 0 {
error(f"CSV file is empty: {path}")
}
data
}
Safe Writing with Verification
Writing is trickier than reading. A write can appear to succeed but produce a truncated file if the disk fills up mid-write. Write to a temporary file first, then verify:
let safe_write_csv = |data, path| {
let tmp_path = path + ".tmp"
try {
write_csv(data, tmp_path)
} catch err {
error(f"Failed to write {path}: {err}")
}
if file_exists(tmp_path) == false {
error(f"Write appeared to succeed but temp file not found: {tmp_path}")
}
let verify = try { read_csv(tmp_path) } catch err {
error(f"Written file is not valid CSV: {err}")
}
if len(verify) != len(data) {
error(f"Row count mismatch: wrote {len(data)} but read back {len(verify)}")
}
try {
write_csv(data, path)
} catch err {
error(f"Failed to write final output to {path}: {err}")
}
true
}
Directory Safety
let ensure_dir = |path| {
try {
mkdir(path)
} catch err {
if contains(str(err), "exists") == false {
error(f"Cannot create directory {path}: {err}")
}
}
}
Section 6: Partial Failure and Recovery
In batch processing, the question is not if a sample will fail but when. The key design decision is: should a single failure stop everything, or should the pipeline continue with the remaining samples?
The Accumulator Pattern
Process each item independently and collect successes and failures separately:
let process_batch = |items, process_fn| {
let successes = []
let failures = []
items |> each(|item| {
try {
let result = process_fn(item)
successes = successes + [result]
} catch err {
failures = failures + [{ item: item, error: err }]
}
})
{ successes: successes, failures: failures }
}
This pattern guarantees that one bad sample never prevents the other 199 from being processed.
Checkpointing
For long-running pipelines, save progress periodically so you can resume after a crash:
let process_with_checkpoint = |items, process_fn, checkpoint_path| {
let completed = if file_exists(checkpoint_path) {
try { json_decode(read_lines(checkpoint_path) |> join("\n")) } catch err { [] }
} else {
[]
}
let remaining = items |> filter(|item| {
let done = completed |> filter(|c| c == item)
len(done) == 0
})
remaining |> each(|item| {
try {
process_fn(item)
completed = completed + [item]
write_lines([json_encode(completed)], checkpoint_path)
} catch err {
println(f"Failed: {item} --- {err}")
}
})
completed
}
If the pipeline crashes at sample 187, you restart it and it picks up at sample 188 — no wasted work.
Error Propagation Flow
Understanding how errors flow through a pipeline helps you place try/catch blocks at the right level:
The rule of thumb: catch data errors at the per-sample level (skip and continue), but let resource errors (disk full, out of memory) propagate up and abort the pipeline. There is no point processing 200 samples if you cannot write the results.
Section 7: Logging Errors
When a pipeline runs overnight, print() output disappears into a terminal that nobody is watching. Write errors to a structured log file that you can analyze after the fact.
Error Log as a Table
let create_error_log = || {
[]
}
let log_error = |log, timestamp, source, severity, message| {
log + [{
timestamp: timestamp,
source: source,
severity: severity,
message: message
}]
}
let save_error_log = |log, path| {
if len(log) > 0 {
let table = log |> to_table()
write_csv(table, path)
} else {
write_lines(["timestamp,source,severity,message"], path)
}
}
Usage in a pipeline:
let errors = create_error_log()
let timestamp = format_date(now(), "%Y-%m-%d %H:%M:%S")
errors = log_error(errors, timestamp, "sample_187.fastq", "ERROR",
"Truncated record at line 4")
errors = log_error(errors, timestamp, "sample_192.fastq", "WARN",
"Low quality, 80% filtered")
save_error_log(errors, "output/error_log.csv")
After the pipeline finishes (or crashes), the error log tells you exactly what happened:
timestamp,source,severity,message
2025-01-15 03:14:22,sample_187.fastq,ERROR,Truncated record at line 4
2025-01-15 03:28:45,sample_192.fastq,WARN,Low quality 80% filtered
Summary Statistics
At the end of a pipeline run, produce a summary that answers the key question: Did it work?
let summarize_run = |total, successes, failures, errors| {
let success_rate = if total > 0 { (successes * 100) / total } else { 0 }
{
total_samples: total,
succeeded: successes,
failed: failures,
success_rate_pct: success_rate,
error_count: len(errors),
status: if failures == 0 { "COMPLETE" }
else if success_rate > 90 { "PARTIAL_SUCCESS" }
else { "FAILED" }
}
}
Section 8: Building a Resilient Pipeline
Let us put all the pieces together. This section builds a production-grade FASTQ processing pipeline that handles every error category from the taxonomy at the start of this chapter.
Pipeline Architecture
INPUT FILES VALIDATION PROCESSING OUTPUT
────────── ────────── ────────── ──────
sample_001.fastq ──┐
sample_002.fastq ──┤ ┌────────────────┐ ┌──────────────┐ ┌──────────┐
sample_003.fastq ──┼────▶│ Check exists │──▶│ Read FASTQ │──▶│ Stats │
... ──┤ │ Check format │ │ Filter qual │ │ Table │
sample_200.fastq ──┘ │ Check non-empty│ │ Compute GC │ │ │
└───────┬────────┘ └──────┬───────┘ └────┬─────┘
│ │ │
skip invalid skip corrupt write results
log reason log reason + error log
│ │ │
▼ ▼ ▼
error_log.csv error_log.csv summary.json
The Complete Pipeline
let run_pipeline = |input_dir, output_dir| {
ensure_dir(output_dir)
let errors = create_error_log()
let results = []
let timestamp = format_date(now(), "%Y-%m-%d %H:%M:%S")
let files = try {
list_dir(input_dir) |> filter(|f| ends_with(f, ".fastq"))
} catch err {
errors = log_error(errors, timestamp, input_dir, "FATAL",
f"Cannot list directory: {err}")
save_error_log(errors, output_dir + "/error_log.csv")
error(f"Cannot access input directory: {err}")
}
if len(files) == 0 {
error(f"No FASTQ files found in {input_dir}")
}
files |> each(|file| {
let path = input_dir + "/" + file
let ts = format_date(now(), "%Y-%m-%d %H:%M:%S")
try {
let records = read_fastq(path)
if len(records) == 0 {
errors = log_error(errors, ts, file, "WARN",
"Empty file, skipping")
} else {
let valid = records |> filter(|r| {
let ok = try {
len(r.sequence) == len(r.quality)
} catch err { false }
ok
})
let filtered = valid |> quality_filter(20)
let stats = {
file: file,
total_records: len(records),
valid_records: len(valid),
passed_qc: len(filtered),
pct_passed: if len(valid) > 0 {
(len(filtered) * 100) / len(valid)
} else { 0 },
mean_gc: if len(filtered) > 0 {
filtered
|> map(|r| gc_content(r.sequence))
|> mean()
} else { 0.0 }
}
results = results + [stats]
if len(valid) < len(records) {
let dropped = len(records) - len(valid)
errors = log_error(errors, ts, file, "WARN",
f"{dropped} records had seq/qual length mismatch")
}
}
} catch err {
errors = log_error(errors, ts, file, "ERROR",
f"Processing failed: {err}")
}
})
let summary = summarize_run(len(files), len(results),
len(files) - len(results), errors)
if len(results) > 0 {
let table = results |> to_table()
write_csv(table, output_dir + "/qc_results.csv")
}
save_error_log(errors, output_dir + "/error_log.csv")
write_lines([json_encode(summary)], output_dir + "/summary.json")
summary
}
Call it:
Requires CLI: This example uses file I/O not available in the browser. Run with
bl run.
let result = run_pipeline("data/fastq", "data/output")
println(f"Pipeline {result.status}: {result.succeeded}/{result.total_samples} samples processed")
Section 9: Testing Error Paths
Most pipelines are tested only with good inputs. Production bugs hide in the error paths — the code that runs when things go wrong. Test your error handling as deliberately as you test your analysis.
Testing with Intentionally Bad Data
The init.bl script for this chapter generates files specifically designed to trigger errors:
good_001.fastqthroughgood_005.fastq— well-formed, passes all checkstruncated.fastq— FASTQ file cut off mid-recordempty.fastq— zero bytesbad_quality.fastq— valid format but all low-quality basesmismatched.fastq— sequence and quality lines have different lengths
A robust pipeline should handle all five error cases without crashing, processing the good samples and logging the bad ones.
Testing Error Classification
let test_classify = || {
let cases = [
{ input: "file not found: x.fastq", expected: "missing" },
{ input: "permission denied", expected: "access" },
{ input: "connection timeout after 30s", expected: "transient" },
{ input: "parse error at line 4", expected: "data_corrupt" },
{ input: "disk quota exceeded", expected: "resource" },
{ input: "something unexpected", expected: "unknown" }
]
cases |> each(|c| {
let result = classify_error(c.input)
if result != c.expected {
error(f"classify_error failed: got {result}, expected {c.expected}")
}
})
true
}
Testing Retry Logic
let test_retry = || {
let call_count = 0
let flaky_fn = || {
call_count = call_count + 1
if call_count < 3 { error("transient failure") }
"success"
}
let result = retry(flaky_fn, 5)
if result != "success" { error("Retry did not return success") }
if call_count != 3 { error(f"Expected 3 calls, got {call_count}") }
true
}
Exercises
Exercise 1: Validate a Sample Sheet
Write a function validate_sample_sheet(path) that reads a CSV sample sheet and checks:
- File exists and is non-empty
- Required columns
sample_id,fastq_r1, andfastq_r2are present - No duplicate
sample_idvalues - All referenced FASTQ files exist
Return a record with { valid: bool, errors: [...] }.
Exercise 2: Retry with Jitter
Modify the retry_backoff function to add random jitter to the delay. When multiple pipelines retry against the same server simultaneously, they can synchronize their retries and create “thundering herd” problems. Adding a random component (e.g., 0–50% of the delay) desynchronizes them.
Hint: BioLang does not have a random number builtin, but you can derive jitter from now() — the millisecond component changes rapidly enough to serve as a simple source of variation.
Exercise 3: Circuit Breaker
Implement a “circuit breaker” pattern: after N consecutive failures to the same service, stop trying for a cooldown period. This prevents a dead service from slowing down your entire pipeline with timeouts.
Write a function that returns a record with { call: fn, reset: fn, state: fn } fields. The call field wraps a function with circuit breaker logic: if the breaker is “open” (too many failures), it returns an error immediately without calling the wrapped function.
Exercise 4: Full Recovery Pipeline
Using the corrupted test data from init.bl, build a pipeline that:
- Validates all input files before processing
- Processes valid files with per-file error handling
- Writes a checkpoint after each successful file
- Produces both a results table and an error log
- Can be run twice — on the second run, it skips already-processed files
Key Takeaways
-
try/catch is an expression — use it inline to provide default values, not just for control flow.
-
Classify errors before handling them. Transient errors deserve retries. Data errors deserve skipping. Resource errors deserve aborting.
-
Validate inputs early. Checking 200 files takes seconds. Processing 186 files before discovering a problem takes hours.
-
Accumulate, do not abort. The accumulator pattern (collect successes and failures separately) ensures one bad sample never blocks the other 199.
-
Checkpoint long pipelines. Saving progress to disk means you never redo work after a crash.
-
Log structured errors. A CSV error log is searchable, sortable, and scriptable.
print()output is none of these. -
Test error paths. Generate intentionally bad data and verify your pipeline handles it. The code that runs when things go wrong is the code that matters most at 3 AM.
Next: Day 26 — AI-Assisted Analysis, where you will use large language models to interpret results, generate hypotheses, and accelerate your biological discoveries.