Contents · Interconnects (QPI, Infinity Fabric)
Why interconnects matter
System scaling | 6-minute read
- Multi-socket CPUs and chiplet designs rely on high-speed links for cache coherence and memory access.
- Interconnect bandwidth/latency directly impacts cross-core, cross-die communication and NUMA performance.
- Modern fabrics unify coherence, memory, and I/O connectivity.
Topologies and link layers
Fabric design | 8-minute read
- Topologies: point-to-point, ring, mesh, torus; trade-offs in bisection bandwidth and latency.
- Physical layer: high-speed serial lanes; logical layer: packets/flits with flow control.
- Routing: static vs adaptive; deadlock avoidance via virtual channels.
Bisection BW ↑ with mesh; latency ↑ with hop count
Virtual channels break cycles in resource dependency
Intel QPI/UPI
Intel | 8-minute read
- QPI (QuickPath Interconnect), later UPI: point-to-point, packetized, source-synchronous links.
- Supports cache-coherent reads/writes, snoop filters, and directory optimizations.
- Performance depends on link width, speed (GT/s), snoop mode (home snoop, early snoop).
AMD Infinity Fabric
AMD | 8-minute read
- Scalable fabric connecting chiplets (CCDs), I/O dies, and memory controllers.
- IF clock often tied to memory speed; cross-CCX/CCD latency depends on topology.
- Variants support inter-socket coherence (sockets linked via IFOP/IFIS).
Coherent vs non-coherent links
Coherence | 7-minute read
- Coherent fabrics propagate cache line state across sockets (home agents, directories).
- Non-coherent I/O (classic PCIe) requires explicit DMA ownership and cache management.
- Emerging standards (CXL.cache, CCIX) enable coherent accelerators and memory pooling.
NUMA domains and hop counts
Topology effects | 7-minute read
- Remote memory access incurs additional hops, increasing latency and reducing bandwidth.
- OS exposes NUMA nodes; schedulers and allocators try to keep threads/data local.
- Process and memory placement is critical for multi-socket performance.
# Inspect NUMA and topology
lscpu; numactl -H; lstopo
PCIe, CCIX, CXL overview
I/O links | 9-minute read
- PCIe: ubiquitous non-coherent I/O, lane-scalable (x1..x16), versions (Gen3..Gen5+).
- CCIX: cache coherent accelerator interface over PCIe PHY; adopted in some SoCs.
- CXL: memory semantics over PCIe 5.0; sub-protocols CXL.io, CXL.cache, CXL.mem for coherent devices and memory expansion.
Bandwidth, latency, congestion
Performance | 8-minute read
- Throughput determined by active lanes × transfer rate; protocol overhead reduces payload BW.
- Latency grows with hops; congestion and head-of-line blocking reduce effective BW.
- QoS and traffic classes mitigate contention in shared fabrics.
Measuring and tuning
Observability | 7-minute read
- Use microbenchmarks to measure local vs remote bandwidth/latency.
- Profile snoop traffic and inter-socket bandwidth; reduce cross-socket sharing.
- Tune BIOS/firmware options (snoop modes, IF clock) cautiously; validate with workload.
Exercises
Hands-on | 7-minute read
- Measure NUMA effects using a bandwidth benchmark with and without process/memory pinning.
- Map your system's inter-socket topology; estimate hop counts and expected latencies.
- Compare PCIe vs CXL memory expansion performance on a synthetic workload.
Fabric performance shapes multi-socket and chiplet systems; locality-aware software unlocks the hardware.