,

Contents ยท SIMD/Vector extensions (SSE/AVX/NEON)


Why SIMD? Data-level parallelism

  • SIMD executes same op on multiple data lanes (SIMD width varies by ISA).
  • Great for dense arrays, image/audio processing, ML kernels, crypto.
  • Challenges: alignment, tails, branching/divergence.

ISAs overview: SSE, AVX(2/512), NEON/SVE

  • x86: SSE (128-bit), AVX/AVX2 (256-bit), AVX-512 (512-bit, masks, more ops).
  • ARM: NEON (128-bit), SVE (scalable vector length, predication-first).
  • Feature gating: CPUID (x86) or HWCAPS (ARM) for runtime dispatch.

Execution model: lanes, masks, widening/narrowing

  • Vector length determines lane count per element size (e.g., 8x32-bit in 256-bit).
  • Widening/narrowing converts element sizes; saturation avoids overflow.
  • Permute/shuffle reorder elements for alignment and algorithm structure.

Loads/stores: alignment, gather/scatter

  • Aligned loads/stores are faster; unaligned supported with possible penalties.
  • Gather/scatter access sparse patterns; latency sensitive to cache/TLB behavior.
  • Streaming stores reduce write-allocations for bandwidth-bound workloads.

Core ops: arithmetic, shuffle, permute, horizontal

  • Add/mul/fma, min/max, compare; horizontal sum/min/max reduce across lanes.
  • Shuffles and blends compose complex permutations.
  • Bitwise ops and shifts enable masks and packing/unpacking.
// AVX2 horizontal sum of 8 floats (example)
#include 
float hsum8_avx2(__m256 v){
  __m128 lo = _mm256_castps256_ps128(v);
  __m128 hi = _mm256_extractf128_ps(v, 1);
  __m128 sum = _mm_add_ps(lo, hi);
  sum = _mm_hadd_ps(sum, sum);
  sum = _mm_hadd_ps(sum, sum);
  float out; _mm_store_ss(&out, sum); return out;
}

Masking and predication

  • AVX-512 uses k-mask registers; SVE predicates every instruction.
  • Masks handle tails and conditional ops without branches.
  • Beware of masked loads/stores side effects and alignment.

Throughput, latency, and memory bandwidth

  • Roofline: performance bounded by min(compute throughput, memory bandwidth).
  • Vector width helps only if memory and dependencies are not limiting.
  • Pay attention to port pressure and AGU/LD/ST unit counts.

Intrinsics and auto-vectorization

  • Compilers can auto-vectorize simple loops with -O3 and appropriate flags.
  • Intrinsics provide fine control; ensure correctness across feature sets.
  • Libraries: BLAS, Eigen, Halide, ISPC use SIMD under the hood.
// Simple vectorized loop (auto-vectorization friendly)
for (int i = 0; i < n; i++) {
  a[i] = b[i] * c + d;
}

Portability, feature detection

  • Runtime dispatch via CPUID (x86) or getauxval(HWCAP) on Linux/ARM.
  • Fat binaries or multi-versioned functions dispatch to best available ISA.
  • Fallback scalar paths ensure correctness on older CPUs.
// CPUID check (simplified) for AVX2
#include 
#include 
int has_avx2(){
  int a,b,c,d;
  __cpuid_count(7, 0, a,b,c,d);
  return (b & (1<<5)) != 0; // AVX2 bit
}

Exercises

  1. Implement a vectorized SAXPY with SSE/AVX and measure speedup vs scalar.
  2. Rewrite a convolution using shuffles and FMA; analyze memory bandwidth.
  3. Add runtime dispatch for NEON vs SSE/AVX on an image processing routine.
SIMD boosts throughput when memory and control allow; design for alignment, tails, and portability.