Beginner ~25 minutes

FASTQ QC Pipeline

Build a complete quality-control pipeline for FASTQ files. You will learn how to read sequencing data, compute quality statistics, filter reads, count k-mers, and generate a summary report.

What you will learn

  • Reading FASTQ files with fastq()
  • Inspecting read records: sequence, quality scores, headers
  • Computing per-read and per-base quality statistics
  • Filtering reads by quality threshold
  • Counting k-mers for contamination screening
  • Using the pipe operator to build multi-step pipelines
  • Writing filtered output to a new FASTQ file
Run this tutorial: Download fastq-pipeline.bl and run it with bl run examples/tutorials/fastq-pipeline.bl

Prerequisites

Complete the Hello Genomics tutorial first. You will also need a sample FASTQ file. BioLang ships with sample data:

# Sample data is included in the repository
ls examples/sample-data/reads.fq

# Or run the quickstart to verify everything works
bl run examples/quickstart.bl

The sample FASTQ file at examples/sample-data/reads.fq contains 8 reads with varying quality scores. For this tutorial, we will use it directly or you can substitute your own FASTQ file.

Step 1 — Reading a FASTQ File

Use read_fastq() to load a FASTQ file into a reusable table, or fastq() for a lazy stream (better for large files). Each record contains an id, sequence, and quality scores.

# requires: examples/sample-data/reads.fq in working directory
# qc.bio — FASTQ quality control pipeline

# read_fastq returns a Table (reusable)
let reads = read_fastq("examples/sample-data/reads.fq")

# Check how many reads we have
print("Total reads:", len(reads))

# Look at the first read
let first = reads[0]
print(f"ID:       {first.id}")
print(f"Sequence: {first.seq}")
print(f"Quality:  {first.quality}")
print(f"Length:   {first.length}")
bl run qc.bio
# Total reads: 100
# Header:   @SRR001/1
# Sequence: ATCGATCGATCG...
# Quality:  IIIIIIHHGGFF...
# Length:   150

Step 2 — Computing Read Statistics

Use read_stats() to compute summary statistics across all reads, including quality scores, GC content, and length distribution.

# Compute read-level statistics using read_stats()
# read_stats returns a record with count, mean_length,
# mean_quality, gc_content, q20_pct, q30_pct, etc.
let stats = read_stats(reads)

print(f"Total reads:  {stats.count}")
print(f"Mean quality: {round(stats.mean_quality, 2)}")
print(f"Mean length:  {round(stats.mean_length, 1)}")
print(f"Mean GC:      {round(stats.gc_content, 4)}")

# You can also look at individual reads
let first = reads[0]
print(f"Read 1 GC: {gc_content(first.seq)}")
print(f"Read 1 length: {first.length}")

Step 3 — Exploring Per-Read Metrics

Beyond aggregate statistics, you can compute per-read metrics like GC content using map() and standard stats functions.

# read_stats() computes comprehensive statistics in one call
let stats = read_stats(reads)

print("=== Read Statistics ===")
print(f"Total reads:  {stats.count}")
print(f"Mean length:  {round(stats.mean_length, 1)}")
print(f"Mean quality: {round(stats.mean_quality, 2)}")
print(f"Mean GC:      {round(stats.gc_content, 4)}")
print(f"Q20 %%:        {round(stats.q20_pct, 1)}")
print(f"Q30 %%:        {round(stats.q30_pct, 1)}")

# You can also compute GC for individual sequences
let gc_values = map(reads, |r| gc_content(r.seq))
print(f"GC range: {min(gc_values)} - {max(gc_values)}")

Step 4 — Filtering Reads by Quality

Low-quality reads introduce errors in downstream analysis. Let us filter them out.

# filter_reads filters a stream by "quality" or "length" criterion
# filter_reads(stream, criterion, threshold)

# Filter reads with mean quality >= 20
let passed = filter_reads(reads, "quality", 20)
let passed_list = collect(passed)
print(f"Passed quality filter: {len(passed_list)} reads")

# Filter by minimum length
let long_reads = filter_reads(reads, "length", 50)
let long_list = collect(long_reads)
print(f"Passed length filter: {len(long_list)} reads")

# Chain filters using the pipe operator — |> into binds the result
reads
  |> filter_reads("quality", 20)
  |> filter_reads("length", 50)
  |> collect()
  |> into clean

print(f"After both filters: {len(clean)} reads")

Step 5 — Trimming Low-Quality Ends

Instead of discarding entire reads, we can trim low-quality bases from the 3' end.

# trim_quality(stream, threshold, min_len) trims low-quality
# bases from read ends and discards reads shorter than min_len
let trimmed = trim_quality(reads, 20, 30)
let trimmed_list = collect(trimmed)
print(f"After trimming: {len(trimmed_list)} reads")

# Compare stats before and after
let raw_stats = read_stats(reads)
let trim_stats = read_stats(trimmed_list)
print(f"Mean length before: {round(raw_stats.mean_length, 1)}")
print(f"Mean length after:  {round(trim_stats.mean_length, 1)}")

# You can also trim a fixed number of bases from read ends
# trim_reads(stream, n) trims n bases from the 3' end
# trim_reads(stream, n, "both") trims from both ends
let hard_trimmed = trim_reads(reads, 10)
let hard_list = collect(hard_trimmed)
print(f"After hard trim: {len(hard_list)} reads")

Step 6 — K-mer Counting

K-mer profiles help detect contamination, adapter sequences, and biased composition. kmer_count accepts lists, tables, and streams directly — no need to extract sequences manually.

# Count k-mers across all trimmed reads
# kmer_count returns a Table sorted by count (descending)
let k = 5
let kmer_counts = kmer_count(trimmed_list, k)
print(f"\n=== Top {k}-mers ===")
kmer_counts |> head(10) |> print()

# You can also get k-mers for a single sequence
let first_kmers = kmers(trimmed_list[0].seq, k)
print(f"K-mers in first read: {len(first_kmers)}")

# Check for adapter contamination
let detected = detect_adapters(reads)
print("Detected adapters:")
print(detected)

K-mer Memory Options

For large FASTQ files, kmer_count automatically manages memory:

  • Default: in-memory up to ~2M unique k-mers, then auto-spills to disk (SQLite temp DB)
  • Top-N: kmer_count(reads, 21, 100) — bounded memory, keeps only top 100
  • Streaming: fastq("file.fq.gz") |> kmer_count(21) — reads never loaded into memory

Results are always sorted by count descending — no sort_by needed after kmer_count.

Step 7 — Building the Complete Pipeline

Now let us combine everything into a single pipeline using the pipe operator.

# requires: examples/sample.fastq in working directory
# fastq_qc.bio — complete FASTQ QC pipeline

fn run_qc(input_path, output_path) {
  print(f"Reading {input_path}...")
  let raw = fastq(input_path)

  # Pipeline: trim quality -> filter by quality -> filter by length
  let clean = raw
    |> trim_quality(20, 30)
    |> filter_reads("quality", 25)
    |> filter_reads("length", 50)
    |> collect()

  # Compute stats on both raw and clean
  let raw_stats = read_stats(read_fastq(input_path))
  let clean_stats = read_stats(clean)

  # Print comparison
  print("\n=== QC Report ===")
  print("                  Raw        Clean")
  print(f"Reads:        {raw_stats.count}     {clean_stats.count}")
  print(f"Mean length:  {round(raw_stats.mean_length, 1)}  {round(clean_stats.mean_length, 1)}")
  print(f"Mean quality: {round(raw_stats.mean_quality, 2)}     {round(clean_stats.mean_quality, 2)}")
  print(f"Mean GC:      {round(raw_stats.gc_content, 4)}   {round(clean_stats.gc_content, 4)}")
  print(f"Q30 %%:        {round(raw_stats.q30_pct, 1)}      {round(clean_stats.q30_pct, 1)}")

  # Write clean reads
  write_fastq(clean, output_path)
  print(f"\nWrote {len(clean)} clean reads to {output_path}")
}

# Run the pipeline
run_qc("examples/sample.fastq", "examples/sample_clean.fastq")

Step 8 — Generating a Summary Report

Build a comparison table of raw vs clean statistics and export it as CSV.

# requires: examples/sample.fastq and examples/sample_clean.fastq in working directory
# Generate a summary table comparing raw vs clean reads
let raw = read_fastq("examples/sample.fastq")
let clean = read_fastq("examples/sample_clean.fastq")

let raw_stats = read_stats(raw)
let clean_stats = read_stats(clean)

# Build a comparison table
let labels = ["count", "mean_length", "mean_quality", "gc_content", "q20_pct", "q30_pct"]
let raw_vals = [raw_stats.count, round(raw_stats.mean_length, 1),
                round(raw_stats.mean_quality, 2), round(raw_stats.gc_content, 4),
                round(raw_stats.q20_pct, 1), round(raw_stats.q30_pct, 1)]
let clean_vals = [clean_stats.count, round(clean_stats.mean_length, 1),
                  round(clean_stats.mean_quality, 2), round(clean_stats.gc_content, 4),
                  round(clean_stats.q20_pct, 1), round(clean_stats.q30_pct, 1)]

let summary = to_table(map(
  [0, 1, 2, 3, 4, 5],
  |i| {metric: labels[i], raw: raw_vals[i], clean: clean_vals[i]}
))

print(summary)
write_csv(summary, "examples/qc_summary.csv")
print("Summary saved to examples/qc_summary.csv")
bl run fastq_qc.bio
# Reading examples/sample.fastq...
# === QC Report ===
# ...
# Summary saved to examples/qc_summary.csv

Tips

  • For large files, use fastq() (stream) instead of read_fastq() (table) to process reads in constant memory. See the Streaming tutorial.
  • The pipe operator |> lets you build readable pipelines. Each step receives the output of the previous step.
  • Use bl run --verbose qc.bio to see timing information for each step.

Next Steps

Learn how to work with structured tabular data in the Working with Tables tutorial.