,

Contents · Journaling, CoW, and snapshots


Why consistency is hard

  • Filesystems must update many blocks atomically (metadata + data) despite crashes and reordering.
  • Storage layers (device caches, NCQ, FUA) may reorder; power loss can drop volatile caches.
  • Goals: metadata integrity, optional data integrity, bounded recovery time.

Journaling (writeback, ordered, full)

  • Intent log: write redo/undo records to a journal before touching home locations.
  • Modes:
    • Writeback: metadata journaled; data can reach disk before/after metadata — highest throughput, risk of stale data exposure.
    • Ordered: metadata journaled; data must be on disk before metadata commit (ext4 default) — prevents stale data exposure.
    • Full data journaling: both data and metadata journaled — safest, slowest.
  • Replay: on mount, scan journal, redo committed transactions, discard incomplete ones.
// Simplified journal commit sequence (conceptual)
function commitTxn(txn) {
  // 1) write intent records to journal
  // 2) flush journal to stable media (barrier)
  // 3) apply updates to home blocks
  // 4) mark transaction committed and trim journal
}

Write ordering, barriers, and fsync

  • Barriers/FUA: force cache flushes or ordering points so earlier writes reach stable media first.
  • fsync()/fdatasync(): push file (and often parent dir metadata) durability; expensive but required for databases.
  • Group commit: batch many small transactions to amortize flush costs.

Copy-on-Write trees

  • Modify by writing new blocks for modified nodes; update parent pointers bottom-up; finally, atomically switch the root.
  • Benefits: snapshots, clones, checksums, crash consistency by design (no in-place overwrites).
  • Costs: write amplification, fragmen­tation; mitigated with allocators and background cleaners.
// CoW publish step (conceptual)
function publishNewRoot(oldRoot, changedLeaf){
  // walk toward root, copying and updating pointers
  // at the end, atomically flip superblock root pointer to the new root
}

Snapshots and clones

  • Snapshot: point-in-time, read-only view built by preserving old blocks via CoW.
  • Clone: writable copy created from a snapshot by CoW on write (APFS/ZFS reflinks).
  • Use cases: backup/restore, testing, incremental replication, VM images.

Crash recovery and integrity

  • Journaling: fast replay; integrity depends on mode and barrier correctness.
  • CoW: atomic root switch; pair with checksums (APFS/ZFS) to detect corruption, self-heal with redundancy (ZFS).
  • fsck: last resort for non-journaled or damaged journals; can be slow on large volumes.

Performance trade-offs

  • Journaling overhead: extra writes + flushes; mitigated via batching and writeback policies.
  • CoW overhead: write amplification; mitigated with large extents, log-structured layouts, compression.
  • Snapshots: cheap to create, but heavy churn can increase fragmentation and metadata pressure.

Exercises

  1. Create a write-ahead log for a toy key-value store; add group commit and crash recovery.
  2. Measure fsync throughput with and without batching on your storage device.
  3. Use APFS/ZFS snapshots to implement periodic backups; test restore under workload.
Journaling ensures ordered updates; CoW makes updates atomic by design. Snapshots fall out naturally from CoW.