,

Contents · Branch prediction (static, dynamic, TAGE)


Motivation and penalties

  • Branches break sequential fetch. A misprediction flushes the pipeline → penalty grows with depth.
  • High-IPC cores rely on accurate prediction to keep front-end and back-end fed.
  • Workloads differ: loops (biased), if-else (data-dependent), indirect calls/jumps.

Static prediction

  • Always-not-taken or backward-taken/forward-not-taken heuristic.
  • Compiler annotations (likely/unlikely) and layout (fallthrough) reduce penalties.
  • No runtime adaptation; low hardware cost, limited accuracy.
// Example hint
if (__builtin_expect(cond, 1)) { /* likely */ }

Bimodal and 2-bit counters

  • Table indexed by PC bits; entries are 2-bit saturating counters predicting taken/not-taken.
  • Hysteresis reduces flip-flop on noise; still weak on correlated patterns.
  • Aliasing occurs when many branches map to same entries.
// 2-bit saturating counter update (conceptual)
function update2bit(state, taken){
  const next = [0,1,2,3][state]; // clamp helper
  if (taken) return Math.min(state+1, 3); else return Math.max(state-1, 0);
}

Two-level adaptive (gshare/gselect)

  • History register captures recent outcomes; combined with PC to index pattern tables.
  • gshare: XOR(global history, PC); gselect: concatenate bits. Reduces destructive aliasing.
  • Local vs global history variations; tournament predictors choose between them.

TAGE and geometric history lengths

  • TAGE uses multiple tagged tables with histories of increasing geometric lengths.
  • Longest matching provider gives prediction; alternate used on low confidence.
  • Uses folded histories and tags to fit budget; achieves high accuracy at modest sizes.
Tables: T0 (base), T1..Tk (tagged)
Index:  f(PC, folded(H[L]))  with L in {l1, l2, ..., lk}

BTB, RAS, and indirect prediction

  • BTB supplies taken branch targets early; separate predictor supplies direction.
  • RAS tracks return addresses for call/ret, crucial for deep call chains.
  • Indirect branches need specialized predictors (e.g., target caches, perceptron-like).

Training, update policy, and warmup

  • Predictors update on resolve at execute/commit; some use partial speculative updates.
  • Allocation/aging policies prevent pollution; usefulness counters gate replacement.
  • Warmup effects matter in short-running phases; static hints can help.

Accuracy, MPKI, and pipeline integration

  • Metrics: accuracy, MPKI, fetch bubbles, and realized IPC uplift.
  • Critical latencies: predictor access, BTB lookup, and target delivery to fetch.
  • Recovery: checkpointing rename state; penalty minimization via early resolve.

Exercises

  1. Compare accuracy/MPKI of 2-bit bimodal vs gshare for a loop-heavy trace.
  2. Sketch a 3-table TAGE with 4KB total storage; define history lengths and index functions.
  3. Design a recovery scheme for mispredictions with rename checkpoints and discuss overheads.
Modern cores rely on hybrid predictors (e.g., TAGE + BTB + RAS) to sustain high fetch throughput.