,

Contents · Reliability (ECC, parity)


Error sources and models

  • Soft errors: cosmic rays, alpha particles, voltage droops cause transient bit flips.
  • Hard errors: stuck-at faults, wear-out (NAND), manufacturing defects.
  • Fault models: single-bit, burst, random independent, spatially correlated.
FIT = failures per 10^9 device-hours; SER = soft error rate

Parity, checksums, CRC

  • Parity detects odd number of bit flips; cannot correct.
  • Checksums (ones' complement) catch many errors; CRCs detect burst errors with polynomial division.
  • End-to-end protection often layers CRC (link) and checksums (application).
// Example: CRC32 (interface sketch)
uint32_t crc32(const uint8_t* data, size_t len);

ECC basics: Hamming, SECDED

  • Hamming codes add parity bits to locate single-bit errors (SEC); add overall parity for DED.
  • SECDED: Single-Error Correct, Double-Error Detect; common in server DIMMs.
  • Overhead: ~12.5% for 64-bit data (8-bit ECC) typical.
Syndrome = H × r^T → index of flipped bit; non-zero + parity mismatch = double-error detect

Advanced ECC: Chipkill, Reed–Solomon, LDPC

  • Chipkill spreads bits across chips/ranks to tolerate a chip failure.
  • Reed–Solomon codes correct burst/multi-bit errors; used in storage and memory modules.
  • LDPC widely used in NAND flash controllers for high raw BER.
Interleaving + stronger codes → tolerate x-bit symbols or device failures

Memory reliability and rowhammer

  • Rowhammer: repeated activation of aggressor rows flips bits in victims.
  • Mitigations: TRR (targeted row refresh), ECC, refresh rate increases, memory coloring.
  • Scrubbing: background reads/corrects to prevent multi-bit accumulation.
RAS in memory controllers: patrol scrub, demand scrub, ECC logging, MCA

Storage/data-path integrity

  • Checksums from application to media (ZFS end-to-end, Btrfs, databases' page checksums).
  • SSD FTL ECC + controller CRCs + interface CRCs protect data at each hop.
  • Silent data corruption prevention via verifying reads, replication, scrubbing.

RAS features: detect, correct, recover

  • Machine Check Architecture (MCA), WHEA logging; firmware-first handling on some systems.
  • Retry on transient errors; page offlining on hard errors; predictive failure analysis.
  • Redundancy (RAID, mirroring) and replication (software) add resilience.

Metrics: FIT, MTBF, UBER

  • FIT: failures per billion hours; MTBF: mean time between failures (for repairable systems).
  • UBER (uncorrectable bit error rate) for storage: probability of uncorrectable error per read bit.
  • SLAs target tail behavior; design for detection and graceful degradation.
UBER 10^-15 ⇒ ~1 uncorrectable per 10^15 bits read (before RAID/replication)

Testing and monitoring

  • Memtest-like stress, ECC error counters, SMART stats, patrol scrubbing schedules.
  • Inject faults (where supported) to validate detection and recovery paths.
  • Alerting pipelines for correctable/uncorrectable events; capacity planning for replacements.

Exercises

  1. Implement a Hamming(72,64) SECDED encoder/decoder and measure detection/correction coverage.
  2. Simulate rowhammer with a memory access pattern; evaluate effect of increased refresh.
  3. Design an end-to-end integrity scheme for a storage service (app checksum + filesystem + device).
Reliability is layered: detect early, correct when possible, contain failures, and monitor continuously.