FASTQ QC Pipeline
Build a complete quality-control pipeline for FASTQ files. You will learn how to read sequencing data, compute quality statistics, filter reads, count k-mers, and generate a summary report.
What you will learn
- Reading FASTQ files with
fastq() - Inspecting read records: sequence, quality scores, headers
- Computing per-read and per-base quality statistics
- Filtering reads by quality threshold
- Counting k-mers for contamination screening
- Using the pipe operator to build multi-step pipelines
- Writing filtered output to a new FASTQ file
bl run examples/tutorials/fastq-pipeline.bl
Prerequisites
Complete the Hello Genomics tutorial first. You will also need a sample FASTQ file. BioLang ships with sample data:
# Sample data is included in the repository
ls examples/sample-data/reads.fq
# Or run the quickstart to verify everything works
bl run examples/quickstart.bl
The sample FASTQ file at examples/sample-data/reads.fq contains 8 reads
with varying quality scores. For this tutorial, we will use it directly or you can
substitute your own FASTQ file.
Step 1 — Reading a FASTQ File
Use read_fastq() to load a FASTQ file into a reusable table, or
fastq() for a lazy stream (better for large files).
Each record contains an id, sequence, and quality scores.
# requires: examples/sample-data/reads.fq in working directory
# qc.bio — FASTQ quality control pipeline
# read_fastq returns a Table (reusable)
let reads = read_fastq("examples/sample-data/reads.fq")
# Check how many reads we have
print("Total reads:", len(reads))
# Look at the first read
let first = reads[0]
print(f"ID: {first.id}")
print(f"Sequence: {first.seq}")
print(f"Quality: {first.quality}")
print(f"Length: {first.length}")
bl run qc.bio
# Total reads: 100
# Header: @SRR001/1
# Sequence: ATCGATCGATCG...
# Quality: IIIIIIHHGGFF...
# Length: 150
Step 2 — Computing Read Statistics
Use read_stats() to compute summary statistics across all reads, including
quality scores, GC content, and length distribution.
# Compute read-level statistics using read_stats()
# read_stats returns a record with count, mean_length,
# mean_quality, gc_content, q20_pct, q30_pct, etc.
let stats = read_stats(reads)
print(f"Total reads: {stats.count}")
print(f"Mean quality: {round(stats.mean_quality, 2)}")
print(f"Mean length: {round(stats.mean_length, 1)}")
print(f"Mean GC: {round(stats.gc_content, 4)}")
# You can also look at individual reads
let first = reads[0]
print(f"Read 1 GC: {gc_content(first.seq)}")
print(f"Read 1 length: {first.length}")
Step 3 — Exploring Per-Read Metrics
Beyond aggregate statistics, you can compute per-read metrics like GC content
using map() and standard stats functions.
# read_stats() computes comprehensive statistics in one call
let stats = read_stats(reads)
print("=== Read Statistics ===")
print(f"Total reads: {stats.count}")
print(f"Mean length: {round(stats.mean_length, 1)}")
print(f"Mean quality: {round(stats.mean_quality, 2)}")
print(f"Mean GC: {round(stats.gc_content, 4)}")
print(f"Q20 %%: {round(stats.q20_pct, 1)}")
print(f"Q30 %%: {round(stats.q30_pct, 1)}")
# You can also compute GC for individual sequences
let gc_values = map(reads, |r| gc_content(r.seq))
print(f"GC range: {min(gc_values)} - {max(gc_values)}")
Step 4 — Filtering Reads by Quality
Low-quality reads introduce errors in downstream analysis. Let us filter them out.
# filter_reads filters a stream by "quality" or "length" criterion
# filter_reads(stream, criterion, threshold)
# Filter reads with mean quality >= 20
let passed = filter_reads(reads, "quality", 20)
let passed_list = collect(passed)
print(f"Passed quality filter: {len(passed_list)} reads")
# Filter by minimum length
let long_reads = filter_reads(reads, "length", 50)
let long_list = collect(long_reads)
print(f"Passed length filter: {len(long_list)} reads")
# Chain filters using the pipe operator — |> into binds the result
reads
|> filter_reads("quality", 20)
|> filter_reads("length", 50)
|> collect()
|> into clean
print(f"After both filters: {len(clean)} reads")
Step 5 — Trimming Low-Quality Ends
Instead of discarding entire reads, we can trim low-quality bases from the 3' end.
# trim_quality(stream, threshold, min_len) trims low-quality
# bases from read ends and discards reads shorter than min_len
let trimmed = trim_quality(reads, 20, 30)
let trimmed_list = collect(trimmed)
print(f"After trimming: {len(trimmed_list)} reads")
# Compare stats before and after
let raw_stats = read_stats(reads)
let trim_stats = read_stats(trimmed_list)
print(f"Mean length before: {round(raw_stats.mean_length, 1)}")
print(f"Mean length after: {round(trim_stats.mean_length, 1)}")
# You can also trim a fixed number of bases from read ends
# trim_reads(stream, n) trims n bases from the 3' end
# trim_reads(stream, n, "both") trims from both ends
let hard_trimmed = trim_reads(reads, 10)
let hard_list = collect(hard_trimmed)
print(f"After hard trim: {len(hard_list)} reads")
Step 6 — K-mer Counting
K-mer profiles help detect contamination, adapter sequences, and biased composition.
kmer_count accepts lists, tables, and streams directly — no need to extract sequences manually.
# Count k-mers across all trimmed reads
# kmer_count returns a Table sorted by count (descending)
let k = 5
let kmer_counts = kmer_count(trimmed_list, k)
print(f"\n=== Top {k}-mers ===")
kmer_counts |> head(10) |> print()
# You can also get k-mers for a single sequence
let first_kmers = kmers(trimmed_list[0].seq, k)
print(f"K-mers in first read: {len(first_kmers)}")
# Check for adapter contamination
let detected = detect_adapters(reads)
print("Detected adapters:")
print(detected)
K-mer Memory Options
For large FASTQ files, kmer_count automatically manages memory:
- Default: in-memory up to ~2M unique k-mers, then auto-spills to disk (SQLite temp DB)
- Top-N:
kmer_count(reads, 21, 100)— bounded memory, keeps only top 100 - Streaming:
fastq("file.fq.gz") |> kmer_count(21)— reads never loaded into memory
Results are always sorted by count descending — no sort_by needed after kmer_count.
Step 7 — Building the Complete Pipeline
Now let us combine everything into a single pipeline using the pipe operator.
# requires: examples/sample.fastq in working directory
# fastq_qc.bio — complete FASTQ QC pipeline
fn run_qc(input_path, output_path) {
print(f"Reading {input_path}...")
let raw = fastq(input_path)
# Pipeline: trim quality -> filter by quality -> filter by length
let clean = raw
|> trim_quality(20, 30)
|> filter_reads("quality", 25)
|> filter_reads("length", 50)
|> collect()
# Compute stats on both raw and clean
let raw_stats = read_stats(read_fastq(input_path))
let clean_stats = read_stats(clean)
# Print comparison
print("\n=== QC Report ===")
print(" Raw Clean")
print(f"Reads: {raw_stats.count} {clean_stats.count}")
print(f"Mean length: {round(raw_stats.mean_length, 1)} {round(clean_stats.mean_length, 1)}")
print(f"Mean quality: {round(raw_stats.mean_quality, 2)} {round(clean_stats.mean_quality, 2)}")
print(f"Mean GC: {round(raw_stats.gc_content, 4)} {round(clean_stats.gc_content, 4)}")
print(f"Q30 %%: {round(raw_stats.q30_pct, 1)} {round(clean_stats.q30_pct, 1)}")
# Write clean reads
write_fastq(clean, output_path)
print(f"\nWrote {len(clean)} clean reads to {output_path}")
}
# Run the pipeline
run_qc("examples/sample.fastq", "examples/sample_clean.fastq")
Step 8 — Generating a Summary Report
Build a comparison table of raw vs clean statistics and export it as CSV.
# requires: examples/sample.fastq and examples/sample_clean.fastq in working directory
# Generate a summary table comparing raw vs clean reads
let raw = read_fastq("examples/sample.fastq")
let clean = read_fastq("examples/sample_clean.fastq")
let raw_stats = read_stats(raw)
let clean_stats = read_stats(clean)
# Build a comparison table
let labels = ["count", "mean_length", "mean_quality", "gc_content", "q20_pct", "q30_pct"]
let raw_vals = [raw_stats.count, round(raw_stats.mean_length, 1),
round(raw_stats.mean_quality, 2), round(raw_stats.gc_content, 4),
round(raw_stats.q20_pct, 1), round(raw_stats.q30_pct, 1)]
let clean_vals = [clean_stats.count, round(clean_stats.mean_length, 1),
round(clean_stats.mean_quality, 2), round(clean_stats.gc_content, 4),
round(clean_stats.q20_pct, 1), round(clean_stats.q30_pct, 1)]
let summary = to_table(map(
[0, 1, 2, 3, 4, 5],
|i| {metric: labels[i], raw: raw_vals[i], clean: clean_vals[i]}
))
print(summary)
write_csv(summary, "examples/qc_summary.csv")
print("Summary saved to examples/qc_summary.csv")
bl run fastq_qc.bio
# Reading examples/sample.fastq...
# === QC Report ===
# ...
# Summary saved to examples/qc_summary.csv
Tips
- For large files, use
fastq()(stream) instead ofread_fastq()(table) to process reads in constant memory. See the Streaming tutorial. - The pipe operator
|>lets you build readable pipelines. Each step receives the output of the previous step. - Use
bl run --verbose qc.bioto see timing information for each step.
Next Steps
Learn how to work with structured tabular data in the Working with Tables tutorial.