,

Contents · CPU pipelines, hazards, forwarding


Pipeline basics and stages

  • Classic 5-stage RISC: IF → ID → EX → MEM → WB.
  • Goal: overlap instruction execution to increase throughput (CPI → ~1 or lower).
  • Deeper pipelines increase frequency but raise penalty for hazards and mispredictions.
Cycle     IF     ID     EX     MEM    WB
1         I1
2         I2     I1
3         I3     I2     I1
4         I4     I3     I2     I1
5         I5     I4     I3     I2     I1

Hazards: structural, data, control

  • Structural: resource conflict (e.g., unified MEM for instr+data) → solve by replication/partitioning.
  • Data: RAW (true dependency), WAR/WAW (name deps; appear in OoO). RAW handled by forwarding or stalls.
  • Control: branch/exception changes PC; mitigated by prediction or delayed slots.

Forwarding/bypassing and interlocks

  • Forward results from EX/MEM/WB to earlier stages to satisfy RAW without waiting for WB.
  • Load-use hazard often needs one-cycle stall unless data returns early.
  • Hardware interlocks detect hazards and insert bubbles automatically.
// Detect simple RAW between consecutive instructions (conceptual)
function needsStall(prev, curr){
  return prev.dest && (prev.dest === curr.src1 || prev.dest === curr.src2) && prev.op === 'LOAD';
}

Stalls, bubbles, and scoreboard

  • Stalls insert bubbles to maintain correctness; scoreboard tracks FU and register availability.
  • Static scheduling (compilers) can reorder to reduce stalls; software pipelining for loops.
  • OoO/Tomasulo extends scoreboard with reservation stations and register renaming.

Branches and prediction

  • Prediction avoids flushing on taken branches. Static: always-not-taken/backward-taken.
  • Dynamic: 1-bit/2-bit saturating counters, gshare, TAGE; BTB and RAS reduce fetch bubbles.
  • Misprediction penalty scales with depth; early resolve and predictor accuracy are critical.

Pipeline depth, IPC, and ILP

  • IPC bounded by hazards, FU count, memory latency, and control speculation quality.
  • Deeper pipelines clock higher but suffer larger penalties; superscalar/OoO extract more ILP.
  • Front-end throughput (fetch/decode) and memory hierarchy often limit realized IPC.

Exercises

  1. Schedule a basic block to minimize stalls on a 5-stage pipeline with a 1-cycle load-use delay.
  2. Compute CPI for a workload given hazard frequencies and misprediction rate/penalty.
  3. Design a simple forwarding network for ALU results and analyze critical path.
Forwarding and prediction recover performance; correctness is preserved with interlocks and precise control.