,

Contents · I/O Stack, DMA, Interrupts


Overview of the I/O stack

  • User space issues syscalls (read/write/send/recv/ioctl); kernel mediates via VFS (files) or socket layer (net).
  • Device drivers program devices, manage queues/rings, and service interrupts; DMA moves data without CPU copies.
  • Memory barriers and cache coherency are critical when the device and CPU communicate via shared memory.

Interrupts: IRQs, MSI/MSI-X

  • Legacy line-based interrupts share IRQ lines; can cause contention.
  • MSI/MSI-X: device writes a message to a CPU-local APIC; MSI-X supports many vectors for queue sharding.
  • Interrupt moderation/coalescing: device delays interrupts to batch completions, trading latency for throughput.
  • Softirqs/tasklets/NAPI: bottom halves defer heavy work from the hard IRQ context.

Direct Memory Access (DMA)

  • Devices read/write system memory via bus mastering; CPU programs DMA descriptors/rings.
  • IOMMU: translates device-visible addresses; isolates devices and enables scatter-gather.
  • Cache coherency: on non-coherent systems drivers must explicitly clean/invalidate caches; on coherent, still need memory barriers.
  • Bounce buffers: used when memory isn’t DMA-capable (e.g., highmem or alignment constraints).
// DMA ring concept (pseudo)
class DMARing {
  constructor(n){ this.desc = Array(n).fill(null); this.head = 0; this.tail = 0; }
  post(buffer){ this.desc[this.tail] = buffer; this.tail = (this.tail+1)%this.desc.length; }
  complete(){ const b = this.desc[this.head]; this.desc[this.head]=null; this.head=(this.head+1)%this.desc.length; return b; }
}

I/O path: syscalls → VFS → drivers → devices

  • Files: read()/write() hit page cache; readahead/writeback; direct I/O bypasses cache for large sequential I/O.
  • aio/io_uring: submit queues and completion queues reduce syscalls and context switches; enables batching.
  • Block layer: merges requests, schedules across queues, handles timeouts and retries.

Networking datapath

  • NIC RX: DMA into buffers; IRQ/NAPI polls RX rings; kernel builds sk_buffs; up the stack (L2→L3→L4) to socket.
  • NIC TX: application → socket → kernel queues → DMA descriptors; completion via IRQ/poll.
  • Techniques: RSS/IRQ affinity for multi-queue parallelism; GRO/GSO; XDP/AF_XDP for fast-path and zero-copy.

Storage datapath

  • NVMe: submission/completion queues per core; MSI-X; large I/O depth for throughput.
  • SATA/SAS: command queues (NCQ/TCQ); higher latency than NVMe.
  • Writeback, barriers, FUA, and flush semantics impact DB durability and performance.

Performance: batching, polling, zero-copy

  • Batching: amortize per-IO costs with rings and group commits.
  • Polling: busy-poll for ultra-low latency (e.g., io_uring SQPOLL, NAPI busy-poll) at higher CPU cost.
  • Zero-copy: mmap/sendfile/splice, or userspace drivers (DPDK, SPDK) to avoid kernel copies.
  • NUMA: keep data and interrupts on the same socket; pin threads and set IRQ affinities.

Exercises

  1. Measure throughput/latency of io_uring vs. read/write for sequential and random I/O.
  2. Configure RSS and IRQ affinities on a multi-queue NIC; compare packet processing scalability.
  3. Test busy-polling vs interrupt-driven RX for 64B packets and 1500B frames under different loads.
Modern I/O performance hinges on DMA rings, interrupts with coalescing, batching, and careful CPU/NUMA placement.