,

Contents · Interconnects (QPI, Infinity Fabric)


Why interconnects matter

  • Multi-socket CPUs and chiplet designs rely on high-speed links for cache coherence and memory access.
  • Interconnect bandwidth/latency directly impacts cross-core, cross-die communication and NUMA performance.
  • Modern fabrics unify coherence, memory, and I/O connectivity.

Topologies and link layers

  • Topologies: point-to-point, ring, mesh, torus; trade-offs in bisection bandwidth and latency.
  • Physical layer: high-speed serial lanes; logical layer: packets/flits with flow control.
  • Routing: static vs adaptive; deadlock avoidance via virtual channels.
Bisection BW ↑ with mesh; latency ↑ with hop count
Virtual channels break cycles in resource dependency

Intel QPI/UPI

  • QPI (QuickPath Interconnect), later UPI: point-to-point, packetized, source-synchronous links.
  • Supports cache-coherent reads/writes, snoop filters, and directory optimizations.
  • Performance depends on link width, speed (GT/s), snoop mode (home snoop, early snoop).

AMD Infinity Fabric

  • Scalable fabric connecting chiplets (CCDs), I/O dies, and memory controllers.
  • IF clock often tied to memory speed; cross-CCX/CCD latency depends on topology.
  • Variants support inter-socket coherence (sockets linked via IFOP/IFIS).

Coherent vs non-coherent links

  • Coherent fabrics propagate cache line state across sockets (home agents, directories).
  • Non-coherent I/O (classic PCIe) requires explicit DMA ownership and cache management.
  • Emerging standards (CXL.cache, CCIX) enable coherent accelerators and memory pooling.

NUMA domains and hop counts

  • Remote memory access incurs additional hops, increasing latency and reducing bandwidth.
  • OS exposes NUMA nodes; schedulers and allocators try to keep threads/data local.
  • Process and memory placement is critical for multi-socket performance.
# Inspect NUMA and topology
lscpu; numactl -H; lstopo

PCIe, CCIX, CXL overview

  • PCIe: ubiquitous non-coherent I/O, lane-scalable (x1..x16), versions (Gen3..Gen5+).
  • CCIX: cache coherent accelerator interface over PCIe PHY; adopted in some SoCs.
  • CXL: memory semantics over PCIe 5.0; sub-protocols CXL.io, CXL.cache, CXL.mem for coherent devices and memory expansion.

Bandwidth, latency, congestion

  • Throughput determined by active lanes × transfer rate; protocol overhead reduces payload BW.
  • Latency grows with hops; congestion and head-of-line blocking reduce effective BW.
  • QoS and traffic classes mitigate contention in shared fabrics.

Measuring and tuning

  • Use microbenchmarks to measure local vs remote bandwidth/latency.
  • Profile snoop traffic and inter-socket bandwidth; reduce cross-socket sharing.
  • Tune BIOS/firmware options (snoop modes, IF clock) cautiously; validate with workload.

Exercises

  1. Measure NUMA effects using a bandwidth benchmark with and without process/memory pinning.
  2. Map your system's inter-socket topology; estimate hop counts and expected latencies.
  3. Compare PCIe vs CXL memory expansion performance on a synthetic workload.
Fabric performance shapes multi-socket and chiplet systems; locality-aware software unlocks the hardware.