Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Day 10: Tables — The Bioinformatician’s Workbench

DifficultyIntermediate
Biology knowledgeBasic (gene names, chromosomes, expression data)
Coding knowledgeIntermediate (pipes, closures, records)
Time~3 hours
PrerequisitesDays 1-9 completed, BioLang installed (see Appendix A)
Data neededGenerated by init.bl (CSV files)
RequirementsNone (offline)

What You’ll Learn

  • How to create tables from CSV files, records, and column vectors
  • How to select, drop, and rename columns
  • How to filter rows with predicates
  • How to add and transform columns with mutate
  • How to sort, slice, and deduplicate rows
  • How to group rows and compute summaries (split-apply-combine)
  • How to join tables by key columns (inner, left, right, outer, anti, semi)
  • How to reshape between wide and long formats (pivot)
  • How to use window functions for running totals and ranks
  • How to chain all of these into a complete analysis pipeline

The Problem

Every analysis produces tabular data — gene expression matrices, variant call results, sample metadata, statistical summaries. A differential expression tool gives you thousands of rows with gene names, fold changes, and p-values. A variant caller gives you chromosomes, positions, and quality scores. A clinical database gives you patient IDs, phenotypes, and treatment groups.

Knowing how to slice, dice, join, reshape, and summarize tables is the single most valuable data skill in bioinformatics. It is the skill that turns raw output into biological insight.

In R, this is dplyr and tidyr. In Python, this is pandas. In BioLang, tables are built in — no imports, no package managers, no configuration. You load a CSV and start working.


Creating Tables

There are three ways to get data into a table.

From CSV/TSV Files

The most common case: you have a file from another tool.

Requires CLI: This example uses file I/O not available in the browser. Run with bl run.

let expr = csv("data/expression.csv")
println(f"Rows: {nrow(expr)}, Cols: {ncol(expr)}")
println(f"Columns: {colnames(expr)}")
println(expr |> head(3))

Expected output:

Rows: 20, Cols: 6
Columns: [gene, log2fc, pval, padj, chr, biotype]
gene   | log2fc | pval     | padj     | chr | biotype
EGFR   | 3.8    | 0.000001 | 0.00001  | 7   | protein_coding
BRCA1  | 2.4    | 0.001    | 0.005    | 17  | protein_coding
VEGFA  | 2.1    | 0.002    | 0.008    | 6   | protein_coding

csv() reads comma-separated files. For tab-separated files, use tsv(). Both auto-detect headers and infer column types (integers, floats, strings).

From a List of Records

When you construct data programmatically, build a list of records and convert it.

let data = [
    {gene: "BRCA1", log2fc: 2.4, pval: 0.001, chr: "17"},
    {gene: "TP53", log2fc: -1.1, pval: 0.23, chr: "17"},
    {gene: "EGFR", log2fc: 3.8, pval: 0.000001, chr: "7"},
    {gene: "MYC", log2fc: 1.9, pval: 0.04, chr: "8"},
    {gene: "KRAS", log2fc: -0.3, pval: 0.67, chr: "12"},
] |> to_table()

println(data)

Expected output:

gene  | log2fc | pval     | chr
BRCA1 | 2.4    | 0.001    | 17
TP53  | -1.1   | 0.23     | 17
EGFR  | 3.8    | 0.000001 | 7
MYC   | 1.9    | 0.04     | 8
KRAS  | -0.3   | 0.67     | 12

From Column Vectors

When you already have parallel arrays, pass a record of lists.

let t = table({
    gene: ["BRCA1", "TP53", "EGFR"],
    value: [1.0, 2.0, 3.0]
})
println(t)

Expected output:

gene  | value
BRCA1 | 1.0
TP53  | 2.0
EGFR  | 3.0

This is the Polars/R column-oriented style. Each key is a column name, each value is a list of that column’s data. All lists must have the same length.


Selecting Columns

Tables often have more columns than you need. select() keeps only the ones you name. drop_cols() removes the ones you don’t want.

let data = [
    {gene: "BRCA1", log2fc: 2.4, pval: 0.001, chr: "17"},
    {gene: "TP53", log2fc: -1.1, pval: 0.23, chr: "17"},
    {gene: "EGFR", log2fc: 3.8, pval: 0.000001, chr: "7"},
] |> to_table()

# Keep specific columns
let slim = data |> select("gene", "pval")
println(slim)

Expected output:

gene  | pval
BRCA1 | 0.001
TP53  | 0.23
EGFR  | 0.000001
# Drop a column
let no_chr = data |> drop_cols("chr")
println(no_chr)

Expected output:

gene  | log2fc | pval
BRCA1 | 2.4    | 0.001
TP53  | -1.1   | 0.23
EGFR  | 3.8    | 0.000001
# Rename a column
let renamed = data |> rename("log2fc", "fold_change")
println(renamed)

Expected output:

gene  | fold_change | pval     | chr
BRCA1 | 2.4         | 0.001    | 17
TP53  | -1.1        | 0.23     | 17
EGFR  | 3.8         | 0.000001 | 7

select() takes the table as the first argument (piped) and column names as the remaining arguments. rename() takes the old name and the new name.


Filtering Rows

filter() keeps only the rows where a predicate returns true. The predicate receives each row as a record.

let data = [
    {gene: "BRCA1", log2fc: 2.4, pval: 0.001, chr: "17"},
    {gene: "TP53", log2fc: -1.1, pval: 0.23, chr: "17"},
    {gene: "EGFR", log2fc: 3.8, pval: 0.000001, chr: "7"},
    {gene: "MYC", log2fc: 1.9, pval: 0.04, chr: "8"},
    {gene: "KRAS", log2fc: -0.3, pval: 0.67, chr: "12"},
] |> to_table()

# Single condition: significant genes
let sig = data |> filter(|r| r.pval < 0.05)
println(sig)

Expected output:

gene  | log2fc | pval     | chr
BRCA1 | 2.4    | 0.001    | 17
EGFR  | 3.8    | 0.000001 | 7
MYC   | 1.9    | 0.04     | 8
# Multiple conditions: significant AND upregulated
let sig_up = data |> filter(|r| r.pval < 0.05 and r.log2fc > 1.0)
println(sig_up)

Expected output:

gene  | log2fc | pval     | chr
BRCA1 | 2.4    | 0.001    | 17
EGFR  | 3.8    | 0.000001 | 7
MYC   | 1.9    | 0.04     | 8
# Filter by category
let chr17 = data |> filter(|r| r.chr == "17")
println(chr17)

Expected output:

gene  | log2fc | pval  | chr
BRCA1 | 2.4    | 0.001 | 17
TP53  | -1.1   | 0.23  | 17

You can combine conditions with and and or. Parentheses clarify precedence when mixing them:

# Chromosome 17 OR very significant
let subset = data |> filter(|r| r.chr == "17" or r.pval < 0.001)
println(subset)

Expected output:

gene  | log2fc | pval     | chr
BRCA1 | 2.4    | 0.001    | 17
TP53  | -1.1   | 0.23     | 17
EGFR  | 3.8    | 0.000001 | 7

Mutating: Adding and Transforming Columns

mutate() adds a new column (or replaces an existing one) by applying a function to each row. It takes three arguments: the table, the new column name, and a closure that receives each row as a record.

let data = [
    {gene: "BRCA1", log2fc: 2.4, pval: 0.001},
    {gene: "TP53", log2fc: -1.1, pval: 0.23},
    {gene: "EGFR", log2fc: 3.8, pval: 0.000001},
    {gene: "MYC", log2fc: 1.9, pval: 0.04},
    {gene: "KRAS", log2fc: -0.3, pval: 0.67},
] |> to_table()

# Add a significance flag
let with_sig = data |> mutate("significant", |r| r.pval < 0.05)
println(with_sig)

Expected output:

gene  | log2fc | pval     | significant
BRCA1 | 2.4    | 0.001    | true
TP53  | -1.1   | 0.23     | false
EGFR  | 3.8    | 0.000001 | true
MYC   | 1.9    | 0.04     | true
KRAS  | -0.3   | 0.67     | false
# Add a direction column
let with_dir = data |> mutate("direction", |r| if r.log2fc > 0 { "up" } else { "down" })
println(with_dir)

Expected output:

gene  | log2fc | pval     | direction
BRCA1 | 2.4    | 0.001    | up
TP53  | -1.1   | 0.23     | down
EGFR  | 3.8    | 0.000001 | up
MYC   | 1.9    | 0.04     | up
KRAS  | -0.3   | 0.67     | down
# Add a negative log10 p-value (common for volcano plots)
let with_nlp = data |> mutate("neg_log_p", |r| -1.0 * log10(r.pval))
println(with_nlp)

Expected output:

gene  | log2fc | pval     | neg_log_p
BRCA1 | 2.4    | 0.001    | 3.0
TP53  | -1.1   | 0.23     | 0.638...
EGFR  | 3.8    | 0.000001 | 6.0
MYC   | 1.9    | 0.04     | 1.397...
KRAS  | -0.3   | 0.67     | 0.173...

To add multiple columns, chain mutate() calls:

let enriched = data
    |> mutate("significant", |r| r.pval < 0.05)
    |> mutate("direction", |r| if r.log2fc > 0 { "up" } else { "down" })
    |> mutate("neg_log_p", |r| -1.0 * log10(r.pval))
println(enriched)

Each mutate() adds one column. The pipe chains them together so the result flows naturally.


Sorting

arrange() sorts a table by a column in ascending order. For descending order, pipe through reverse().

let data = [
    {gene: "BRCA1", log2fc: 2.4, pval: 0.001},
    {gene: "TP53", log2fc: -1.1, pval: 0.23},
    {gene: "EGFR", log2fc: 3.8, pval: 0.000001},
    {gene: "MYC", log2fc: 1.9, pval: 0.04},
] |> to_table()

# Sort by p-value (ascending --- most significant first)
let by_pval = data |> arrange("pval")
println(by_pval)

Expected output:

gene  | log2fc | pval
EGFR  | 3.8    | 0.000001
BRCA1 | 2.4    | 0.001
MYC   | 1.9    | 0.04
TP53  | -1.1   | 0.23
# Sort by fold change descending (largest first)
let by_fc_desc = data |> arrange("log2fc") |> reverse()
println(by_fc_desc)

Expected output:

gene  | log2fc | pval
EGFR  | 3.8    | 0.000001
BRCA1 | 2.4    | 0.001
MYC   | 1.9    | 0.04
TP53  | -1.1   | 0.23

Combine with head() to get top-N results:

# Top 2 most significant genes
let top2 = data |> arrange("pval") |> head(2)
println(top2)

Expected output:

gene  | log2fc | pval
EGFR  | 3.8    | 0.000001
BRCA1 | 2.4    | 0.001

Grouping and Summarizing

The most powerful table operation is split-apply-combine: split the data into groups, apply an aggregation to each group, and combine the results into a new table.

The Pattern

Group-by / summarize flow

group_by() splits a table into a map of subtables, keyed by the distinct values in the grouping column. summarize() then takes that map and a function that receives each key and subtable, and must return a record. The records are assembled into a new table.

let data = [
    {gene: "BRCA1", log2fc: 2.4, pval: 0.001, chr: "17"},
    {gene: "TP53", log2fc: -1.1, pval: 0.23, chr: "17"},
    {gene: "EGFR", log2fc: 3.8, pval: 0.000001, chr: "7"},
    {gene: "MYC", log2fc: 1.9, pval: 0.04, chr: "8"},
    {gene: "KRAS", log2fc: -0.3, pval: 0.67, chr: "12"},
] |> to_table()

# Count genes per chromosome
let chr_counts = data
    |> group_by("chr")
    |> summarize(|key, subtable| {
        chr: key,
        gene_count: nrow(subtable)
    })
println(chr_counts)

Expected output:

chr | gene_count
7   | 1
8   | 1
12  | 1
17  | 2
# Mean fold change per chromosome
let chr_means = data
    |> group_by("chr")
    |> summarize(|key, subtable| {
        chr: key,
        mean_fc: col_mean(subtable, "log2fc"),
        n_genes: nrow(subtable)
    })
println(chr_means)

Expected output:

chr | mean_fc | n_genes
7   | 3.8     | 1
8   | 1.9     | 1
12  | -0.3    | 1
17  | 0.65    | 2

The summarize function can compute any aggregation you want. Use col_mean(), col_sum(), col_min(), col_max(), col_stdev() for numeric columns, and nrow() for counts.

Quick Counts with count_by

For the common case of just counting groups, count_by() is a shortcut:

let chr_counts = data |> count_by("chr")
println(chr_counts)

Expected output:

chr | count
7   | 1
8   | 1
12  | 1
17  | 2

Joining Tables

Joins connect two tables by matching rows on a shared key column. This is how you annotate results with metadata, link identifiers across databases, or combine measurements from different experiments.

Setting Up Two Tables

let results = [
    {gene: "BRCA1", log2fc: 2.4, pval: 0.001},
    {gene: "TP53", log2fc: -1.1, pval: 0.23},
    {gene: "EGFR", log2fc: 3.8, pval: 0.000001},
    {gene: "MYC", log2fc: 1.9, pval: 0.04},
    {gene: "KRAS", log2fc: -0.3, pval: 0.67},
] |> to_table()

let annotations = [
    {gene: "BRCA1", full_name: "BRCA1 DNA repair", pathway: "DNA repair"},
    {gene: "TP53", full_name: "Tumor protein p53", pathway: "Apoptosis"},
    {gene: "EGFR", full_name: "EGF receptor", pathway: "Signaling"},
    {gene: "MYC", full_name: "MYC proto-oncogene", pathway: "Cell cycle"},
    {gene: "PTEN", full_name: "Phosphatase tensin homolog", pathway: "Signaling"},
] |> to_table()

Note that KRAS is in results but not annotations, and PTEN is in annotations but not results.

Inner Join

Keeps only rows present in both tables.

let annotated = inner_join(results, annotations, "gene")
println(annotated)
println(f"Inner join: {nrow(annotated)} rows")

Expected output:

gene  | log2fc | pval     | full_name         | pathway
BRCA1 | 2.4    | 0.001    | BRCA1 DNA repair  | DNA repair
TP53  | -1.1   | 0.23     | Tumor protein p53 | Apoptosis
EGFR  | 3.8    | 0.000001 | EGF receptor      | Signaling
MYC   | 1.9    | 0.04     | MYC proto-oncogene | Cell cycle
Inner join: 4 rows

KRAS is dropped (no annotation). PTEN is dropped (no result).

Left Join

Keeps all rows from the left table. Where the right table has no match, those columns are nil.

let full = left_join(results, annotations, "gene")
println(full)
println(f"Left join: {nrow(full)} rows")

Expected output:

gene  | log2fc | pval     | full_name         | pathway
BRCA1 | 2.4    | 0.001    | BRCA1 DNA repair  | DNA repair
TP53  | -1.1   | 0.23     | Tumor protein p53 | Apoptosis
EGFR  | 3.8    | 0.000001 | EGF receptor      | Signaling
MYC   | 1.9    | 0.04     | MYC proto-oncogene | Cell cycle
KRAS  | -0.3   | 0.67     | nil               | nil
Left join: 5 rows

KRAS is kept with nil annotations. PTEN is dropped (not in results).

Anti Join

Returns rows from the left table that have no match in the right table. This is the “what’s missing?” join.

let missing = anti_join(results, annotations, "gene")
println(missing)
println(f"Missing annotations: {nrow(missing)} genes")

Expected output:

gene  | log2fc | pval
KRAS  | -0.3   | 0.67
Missing annotations: 1 genes

Semi Join

Returns rows from the left table that do have a match in the right table, but without adding columns from the right table. It is a filter, not a column merger.

let has_annotation = semi_join(results, annotations, "gene")
println(has_annotation)

Expected output:

gene  | log2fc | pval
BRCA1 | 2.4    | 0.001
TP53  | -1.1   | 0.23
EGFR  | 3.8    | 0.000001
MYC   | 1.9    | 0.04

All Join Types at a Glance

inner_join(A, B, key):  A ∩ B     — only matching rows
left_join(A, B, key):   all A     — all of A, matching from B (nil where missing)
right_join(A, B, key):  all B     — all of B, matching from A (nil where missing)
outer_join(A, B, key):  A ∪ B     — all rows from both (nil where missing)
anti_join(A, B, key):   A - B     — rows in A with no match in B
semi_join(A, B, key):   A ∩∃ B    — rows in A that have a match in B (no extra columns)

When to use which:

SituationJoin
Annotate results with gene infoleft_join (keep all results)
Find shared genes between two experimentsinner_join
Find genes unique to one experimentanti_join
Merge all data from both sourcesouter_join
Filter results to genes in a known setsemi_join

Reshaping: Pivot Wider and Longer

Biological data comes in two shapes. Wide format has one row per entity (e.g., one row per gene, one column per sample). Long format has one row per measurement (e.g., one row per gene-sample combination).

Long to Wide: pivot_wider

You have expression measurements in long (tidy) format:

let long = [
    {gene: "BRCA1", sample: "S1", expression: 5.2},
    {gene: "BRCA1", sample: "S2", expression: 8.1},
    {gene: "TP53", sample: "S1", expression: 3.4},
    {gene: "TP53", sample: "S2", expression: 7.6},
] |> to_table()

println("Long format:")
println(long)

Expected output:

Long format:
gene  | sample | expression
BRCA1 | S1     | 5.2
BRCA1 | S2     | 8.1
TP53  | S1     | 3.4
TP53  | S2     | 7.6

pivot_wider() spreads the sample names into columns:

let wide = long |> pivot_wider("sample", "expression")
println("Wide format:")
println(wide)

Expected output:

Wide format:
gene  | S1  | S2
BRCA1 | 5.2 | 8.1
TP53  | 3.4 | 7.6

The first argument (piped) is the table. The second argument is the column whose values become new column names. The third argument is the column whose values fill those new columns. All other columns (here, gene) become the row identifiers.

Wide to Long: pivot_longer

Going the other direction, pivot_longer() gathers columns back into rows:

let back_to_long = wide |> pivot_longer(["S1", "S2"], "sample", "expression")
println("Back to long:")
println(back_to_long)

Expected output:

Back to long:
gene  | sample | expression
BRCA1 | S1     | 5.2
BRCA1 | S2     | 8.1
TP53  | S1     | 3.4
TP53  | S2     | 7.6

The first argument (piped) is the table. The second argument is a list of column names to gather. The third argument is the name for the new “names” column. The fourth argument is the name for the new “values” column.

The Visual Transformation

PIVOT WIDER                         PIVOT LONGER

gene  | sample | expr               gene  | S1  | S2
------+--------+-----    ====>      ------+-----+-----
BRCA1 | S1     | 5.2                BRCA1 | 5.2 | 8.1
BRCA1 | S2     | 8.1                TP53  | 3.4 | 7.6
TP53  | S1     | 3.4
TP53  | S2     | 7.6     <====

  3 columns, 4 rows          2+N columns, 2 rows
  (one row per measurement)   (one row per gene)

When to use which:

  • Pivot wider when you need a matrix for computation (e.g., gene-by-sample expression matrix for heatmaps, PCA, clustering)
  • Pivot longer when you need tidy data for filtering, grouping, and plotting (e.g., faceted plots, group_by + summarize)

Window Functions

Window functions compute a value for each row based on its position or neighbors, without collapsing the table.

Row Numbers and Ranks

let data = [
    {gene: "BRCA1", pval: 0.001},
    {gene: "EGFR", pval: 0.000001},
    {gene: "MYC", pval: 0.04},
    {gene: "TP53", pval: 0.23},
] |> to_table()

# Add row numbers
let numbered = data |> row_number()
println(numbered)

Expected output:

gene  | pval     | row_number
BRCA1 | 0.001    | 1
EGFR  | 0.000001 | 2
MYC   | 0.04     | 3
TP53  | 0.23     | 4
# Rank by p-value
let ranked = data |> rank("pval")
println(ranked)

Expected output:

gene  | pval     | rank
BRCA1 | 0.001    | 2
EGFR  | 0.000001 | 1
MYC   | 0.04     | 3
TP53  | 0.23     | 4

Cumulative Functions

let data = [
    {gene: "A", count: 10},
    {gene: "B", count: 25},
    {gene: "C", count: 15},
    {gene: "D", count: 30},
] |> to_table()

# Cumulative sum
let with_cumsum = data |> cumsum("count")
println(with_cumsum)

Expected output:

gene | count | cumsum
A    | 10    | 10
B    | 25    | 35
C    | 15    | 50
D    | 30    | 80

Rolling Mean

Smooths noisy data by averaging over a sliding window.

let timeseries = [
    {day: 1, value: 10.0},
    {day: 2, value: 12.0},
    {day: 3, value: 8.0},
    {day: 4, value: 15.0},
    {day: 5, value: 11.0},
    {day: 6, value: 14.0},
] |> to_table()

let smoothed = timeseries |> rolling_mean("value", 3)
println(smoothed)

Expected output:

day | value | rolling_mean
1   | 10.0  | 10.0
2   | 12.0  | 11.0
3   | 8.0   | 10.0
4   | 15.0  | 11.666...
5   | 11.0  | 11.333...
6   | 14.0  | 13.333...

The third argument is the window size. The first few rows use a smaller window (whatever data is available).

Lag and Lead

Access values from previous or next rows — useful for computing changes between consecutive measurements.

let data = [
    {day: 1, expression: 2.0},
    {day: 2, expression: 4.5},
    {day: 3, expression: 3.8},
    {day: 4, expression: 6.1},
] |> to_table()

# Previous day's value
let with_lag = data |> lag("expression")
println(with_lag)

Expected output:

day | expression | lag
1   | 2.0        | nil
2   | 4.5        | 2.0
3   | 3.8        | 4.5
4   | 6.1        | 3.8

Complete Example: Expression Analysis Pipeline

This is the kind of analysis you will do repeatedly in practice: load data, annotate it, filter it, summarize it, and export the results.

Requires CLI: This example uses file I/O not available in the browser. Run with bl run.

# Complete table analysis pipeline
# Run init.bl first to generate CSV files in data/

# Step 1: Read expression results and gene annotations
let expr = csv("data/expression.csv")
let gene_info = csv("data/gene_info.csv")

println(f"Expression data: {nrow(expr)} genes x {ncol(expr)} columns")
println(f"Gene info: {nrow(gene_info)} annotations")

# Step 2: Add derived columns
let analyzed = expr
    |> mutate("significant", |r| r.padj < 0.05)
    |> mutate("direction", |r| if r.log2fc > 0 { "up" } else { "down" })
    |> mutate("neg_log_p", |r| -1.0 * log10(r.pval))

# Step 3: Count by direction among significant genes
let direction_counts = analyzed
    |> filter(|r| r.significant)
    |> count_by("direction")
println("Significant genes by direction:")
println(direction_counts)

# Step 4: Annotate with gene info
let annotated = left_join(analyzed, gene_info, "gene")

# Step 5: Top 10 most significant genes
let top10 = annotated
    |> filter(|r| r.significant)
    |> arrange("padj")
    |> head(10)
    |> select("gene", "log2fc", "padj", "pathway")
println("Top 10 significant genes:")
println(top10)

# Step 6: Summary statistics per pathway
let pathway_summary = annotated
    |> filter(|r| r.significant)
    |> group_by("pathway")
    |> summarize(|key, subtable| {
        pathway: key,
        n_genes: nrow(subtable),
        mean_fc: col_mean(subtable, "log2fc")
    })
println("Pathway summary:")
println(pathway_summary)

# Step 7: Export
write_csv(annotated, "results/annotated_results.csv")
println("Results saved to results/annotated_results.csv")

This pipeline reads data, enriches it, filters it, summarizes it, and exports it — all in a single readable chain of piped operations. Each step is self-documenting.


Table Operations Cheat Sheet

Structure

OperationSyntaxDescription
nrow(t)data |> nrow()Number of rows
ncol(t)data |> ncol()Number of columns
colnames(t)data |> colnames()List of column names
describe(t)data |> describe()Summary statistics for all columns

Column Operations

OperationSyntaxDescription
selectdata |> select("a", "b")Keep only named columns
drop_colsdata |> drop_cols("x")Remove named columns
renamedata |> rename("old", "new")Rename a column
mutatedata |> mutate("col", |r| expr)Add or replace a column

Row Operations

OperationSyntaxDescription
filterdata |> filter(|r| cond)Keep rows where condition is true
arrangedata |> arrange("col")Sort by column (ascending)
reversedata |> reverse()Reverse row order
headdata |> head(n)First n rows
taildata |> tail(n)Last n rows
slicedata |> slice(start, end)Rows from start to end
sampledata |> sample(n)Random n rows
distinctdata |> distinct()Remove duplicate rows

Aggregation

OperationSyntaxDescription
group_bydata |> group_by("col")Split into map of subtables
summarizegroups |> summarize(|k, t| rec)Aggregate each group into a record
count_bydata |> count_by("col")Count rows per group (shortcut)
col_meancol_mean(t, "col")Mean of a numeric column
col_sumcol_sum(t, "col")Sum of a numeric column
col_mincol_min(t, "col")Minimum of a column
col_maxcol_max(t, "col")Maximum of a column
col_stdevcol_stdev(t, "col")Standard deviation of a column

Joins

OperationSyntaxDescription
inner_joininner_join(a, b, "key")Rows in both tables
left_joinleft_join(a, b, "key")All rows from left, matching from right
right_joinright_join(a, b, "key")All rows from right, matching from left
outer_joinouter_join(a, b, "key")All rows from both tables
anti_joinanti_join(a, b, "key")Left rows with no right match
semi_joinsemi_join(a, b, "key")Left rows that have a right match

Reshaping

OperationSyntaxDescription
pivot_widerdata |> pivot_wider("names_col", "values_col")Long to wide
pivot_longerdata |> pivot_longer(["c1","c2"], "name", "value")Wide to long

Window Functions

OperationSyntaxDescription
row_numberdata |> row_number()Add sequential row numbers
rankdata |> rank("col")Rank by column value
cumsumdata |> cumsum("col")Cumulative sum
cummaxdata |> cummax("col")Cumulative maximum
cummindata |> cummin("col")Cumulative minimum
lagdata |> lag("col")Previous row’s value
leaddata |> lead("col")Next row’s value
rolling_meandata |> rolling_mean("col", n)Rolling average over n rows
rolling_sumdata |> rolling_sum("col", n)Rolling sum over n rows

I/O

OperationSyntaxDescription
csvcsv("file.csv")Read CSV file into table
tsvtsv("file.tsv")Read TSV file into table
write_csvwrite_csv(t, "out.csv")Write table to CSV
write_tsvwrite_tsv(t, "out.tsv")Write table to TSV

Exercises

  1. Fold change calculator. Create a table of 10 genes with columns: gene, expression_control, expression_treated. Add a fold_change column (treated / control), then filter to keep only genes where fold change is greater than 2.0.

  2. Annotation join. Create a results table (gene, pval) and an annotations table (gene, pathway, description). Use left_join to annotate the results, then filter to keep only genes in the “Apoptosis” pathway.

  3. Wide to long and back. Create a wide expression matrix with columns gene, sample_A, sample_B, sample_C. Pivot it to long format. Then compute the mean expression per gene using group_by and summarize.

  4. Variant counting. Create a table of variants with columns chr, pos, ref_allele, alt_allele, quality. Use count_by("chr") to count variants per chromosome, then sort by count descending.

  5. Full pipeline. Build a complete pipeline that: reads the expression CSV (from init.bl), adds a significance column (padj < 0.05), joins with gene info, filters to significant genes only, groups by pathway, counts genes per pathway, sorts by count descending, and writes the result to a new CSV.


Key Takeaways

  • Tables are the central data structure for analysis results. Most bioinformatics output is tabular.
  • select/filter/mutate/arrange cover 80% of table operations. Master these four first.
  • group_by + summarize is the split-apply-combine pattern. It is how you compute summary statistics per category.
  • Joins connect related datasets. Learn inner_join and left_join first — they handle most annotation and linking tasks.
  • Pivot wider/longer reshapes between wide format (for computation) and long format (for grouping and plotting).
  • Chain operations with pipes for readable analysis code. Each pipe step does one thing, and the data flows top to bottom.
  • Window functions (row_number, rank, cumsum, rolling_mean, lag, lead) add context-aware columns without collapsing the table.

What’s Next

Tomorrow we compare sequences — GC content, k-mers, dotplots, motif searching, and multi-species lookups. Day 11 takes the sequence skills from Days 3-4 and scales them up to comparative analysis.