,

Contents · Caches and coherence (MESI/MOESI)


Cache fundamentals and locality

  • Exploit temporal and spatial locality to reduce average memory latency.
  • Hierarchy: L1 (small, fast), L2, L3 (shared), DRAM; inclusive/exclusive designs.
  • Miss types: compulsory, capacity, conflict; AMAT = hit_time + miss_rate × miss_penalty.
Index = (address / line_size) mod #sets
Tag   = address / (line_size × #sets)

Parameters: size, associativity, line size

  • Higher associativity reduces conflict misses but increases hit latency and energy.
  • Larger lines boost spatial locality but may waste bandwidth (pollution) and worsen false sharing.
  • Victim caches and way prediction mitigate conflict and hit latency.

Write policies: write-back vs write-through

  • Write-back with dirty bit reduces bandwidth; needs coherence writebacks/invalidation.
  • Write-through simplifies coherence at cost of bandwidth.
  • Allocation: write-allocate vs no-write-allocate affects store misses.

Replacement: LRU, PLRU, RRIP

  • True LRU expensive beyond 4-way; PLRU approximates with binary tree state.
  • RRIP predicts re-reference interval; variants (SRRIP/DRRIP) adapt to phases.
  • Bypass and insertion policies (e.g., BIP, DIP) control pollution.

Multicore cache coherence

  • Ensures all cores observe consistent values of shared cache lines.
  • Bus snooping (broadcast) vs directory protocols (scalable, point-to-point).
  • Coherence transactions: read miss, write miss (upgrade), invalidation, writeback.
Directory fields per line: state, owner, sharer vector
On write miss: send invalidates to sharers; writer becomes owner/exclusive

MESI/MOESI states and transitions

  • MESI: Modified, Exclusive, Shared, Invalid; MOESI adds Owned (dirty-shared).
  • Read miss → E or S; write to S → upgrade/invalidate; write miss → M/O with ownership.
  • Invalidate vs update: most systems use invalidate for bandwidth efficiency.
Owned (O): line dirty, shared across readers; owner supplies data on read miss

Consistency vs coherence, fences

  • Coherence: single memory location agreement; consistency: ordering across locations.
  • Relaxed models (TSO, ARM/POWER) require fences for cross-core visibility.
  • Atomics provide ordering (acquire/release, seq_cst) and coherence interaction.
#include 
std::atomic x{0}, y{0};
// Thread A
x.store(1, std::memory_order_release);
std::atomic_thread_fence(std::memory_order_seq_cst);
y.store(1, std::memory_order_relaxed);
// Thread B
if (y.load(std::memory_order_acquire)) {
  int r = x.load(std::memory_order_acquire);
}

False sharing and padding

  • Independent variables on the same line cause coherence ping-pong under writes.
  • Pad or align per-thread data to cache line size to avoid false sharing.
  • Use tools (perf, VTune, cachegrind) to detect hotspots and contention.
#include 
typedef struct {
  alignas(64) volatile int value;
  char pad[64 - sizeof(int)];
} counter_t;

Prefetching and coherence interactions

  • Hardware prefetchers track streams/strides; can amplify false sharing if write streams overlap.
  • Software prefetch hints (e.g., __builtin_prefetch) must balance pollution and timeliness.
  • NUMA placement interacts with coherence traffic; prefer local reads and reduce cross-socket writes.

Performance modeling and tuning

  • AMAT and bandwidth models; miss curves vs working set; stack distance analysis.
  • Tune: blocking/tiling, SoA vs AoS, streaming stores, large pages to reduce TLB misses.
  • Measure: cycles/op, misses (L1/L2/L3), snoop traffic, remote accesses on NUMA.

Exercises

  1. Implement a tiled matrix multiply and analyze L1/L2 miss rates with and without blocking.
  2. Create a microbenchmark that demonstrates false sharing; fix it via padding/alignas.
  3. Sketch a directory-based protocol for 8 cores; define states and messages.
Coherence ensures a single value per line across cores; performance depends on locality, contention, and policy choices.