,

Contents · Hash tables (separate chaining, open addressing, cuckoo)


Overview

Hash tables are one of the most important data structures in computer science, providing average-case O(1) time complexity for insertion, deletion, and lookup operations. They map keys to values using a hash function that computes an index into an array of buckets.

Core Concept:

A hash table uses a hash function h(k) to map a key k to an index in an array. When multiple keys hash to the same index (a collision), different strategies handle the conflict.

Key Operations:

  • Insert(key, value): Add a key-value pair to the table
  • Search(key): Find the value associated with a key
  • Delete(key): Remove a key-value pair from the table

Collision Resolution: The primary challenge in hash table design is handling collisions when multiple keys hash to the same index. Three major approaches exist:

  • Separate Chaining: Store colliding elements in linked lists
  • Open Addressing: Find alternative locations within the array
  • Cuckoo Hashing: Use multiple hash functions with guaranteed O(1) lookup

Hash Functions

A good hash function is crucial for hash table performance. It should distribute keys uniformly across the table to minimize collisions.

Properties of Good Hash Functions:

  • Deterministic: Same key always produces same hash value
  • Uniform Distribution: Keys spread evenly across table
  • Fast Computation: Hash calculation should be efficient
  • Avalanche Effect: Small changes in key produce large changes in hash

Common Hash Functions:

Division Method:

h(k) = k mod m, where m is table size (preferably prime)

  • Simple and fast
  • Choose m as prime not close to power of 2
  • Example: h(123) = 123 mod 97 = 26

Multiplication Method:

h(k) = ⌊m × (k × A mod 1)⌋, where 0 < A < 1

  • A ≈ (√5 - 1)/2 ≈ 0.618 (golden ratio) works well
  • m can be any value (power of 2 is convenient)
  • Less sensitive to choice of m

Universal Hashing:

h(k) = ((a × k + b) mod p) mod m

  • Choose random a, b from {0, 1, ..., p-1} where p is prime > universe size
  • Guarantees expected O(1) performance regardless of input
  • Protects against adversarial inputs

String Hashing:

For strings, polynomial rolling hash is common:

h(s) = (s[0] × p^(n-1) + s[1] × p^(n-2) + ... + s[n-1]) mod m

  • p is a prime (31 or 37 commonly used)
  • Treats string as base-p number
  • Can be computed incrementally

Modern Hash Functions:

  • MurmurHash: Fast, good distribution, widely used
  • CityHash: Google's fast hash for strings
  • xxHash: Extremely fast, excellent quality
  • SipHash: Cryptographically strong, prevents hash flooding attacks

Separate Chaining

Separate chaining handles collisions by maintaining a linked list (or other data structure) at each array index. All keys that hash to the same index are stored in that index's list.

Structure:

  • Array of size m (table size)
  • Each array element points to a linked list
  • Colliding elements added to the list

Operations:

Insert(key, value):

  1. Compute index: i = h(key)
  2. Search list at table[i] for key
  3. If found, update value; otherwise append to list
  4. Time: O(1) average, O(n) worst case

Search(key):

  1. Compute index: i = h(key)
  2. Traverse list at table[i] to find key
  3. Return value if found, null otherwise
  4. Time: O(1) average, O(n) worst case

Delete(key):

  1. Compute index: i = h(key)
  2. Search and remove from list at table[i]
  3. Time: O(1) average, O(n) worst case

Load Factor:

α = n/m (n = number of elements, m = table size)

  • Average chain length is α
  • Expected search time: Θ(1 + α)
  • Keep α < 1 for good performance
  • Resize when α exceeds threshold (typically 0.75)

Advantages:

  • Simple to implement
  • Never fills up (can always add more elements)
  • Deletion is straightforward
  • Performance degrades gracefully with high load
  • Works well with poor hash functions

Disadvantages:

  • Extra memory for pointers
  • Poor cache locality (linked list traversal)
  • Worst case O(n) if all keys hash to same index
  • Memory allocation overhead for nodes

Optimizations:

  • Self-balancing trees: Use AVL/Red-Black trees instead of lists for O(log n) worst case
  • Move-to-front: Move accessed elements to front of list
  • Sorted lists: Keep lists sorted for faster search
  • Dynamic arrays: Use arrays instead of linked lists for better cache performance

Applications:

  • Symbol tables in compilers
  • Database indexing
  • Caching systems
  • Dictionaries in Python, JavaScript objects

Open Addressing

Open addressing stores all elements directly in the hash table array. When a collision occurs, it probes for the next available slot using a deterministic sequence.

Core Idea:

Instead of storing colliding elements externally, find another location within the table itself. The probe sequence determines which slots to check.

Probe Sequences:

1. Linear Probing:

h(k, i) = (h(k) + i) mod m

  • Check slots sequentially: h(k), h(k)+1, h(k)+2, ...
  • Advantages: Simple, good cache locality
  • Disadvantages: Primary clustering (consecutive occupied slots)
  • Clusters grow quickly, degrading performance

2. Quadratic Probing:

h(k, i) = (h(k) + c₁×i + c₂×i²) mod m

  • Common: c₁ = c₂ = 1/2, so h(k, i) = (h(k) + i²) mod m
  • Advantages: Reduces primary clustering
  • Disadvantages: Secondary clustering (keys with same h(k) follow same probe sequence)
  • May not probe all slots (requires m to be prime or power of 2)

3. Double Hashing:

h(k, i) = (h₁(k) + i × h₂(k)) mod m

  • Uses two hash functions
  • h₂(k) must be relatively prime to m (ensure h₂(k) ≠ 0)
  • Advantages: Eliminates clustering, uniform probing
  • Disadvantages: Two hash computations, poor cache locality
  • Best open addressing method for uniform distribution

Operations:

Insert(key, value):

  1. Probe sequence: h(k,0), h(k,1), h(k,2), ...
  2. Insert at first empty or deleted slot
  3. Fail if table is full

Search(key):

  1. Follow same probe sequence
  2. Return value if key found
  3. Stop at empty slot (key not present)
  4. Continue past deleted slots

Delete(key):

  1. Find key using search
  2. Mark slot as "deleted" (not empty)
  3. Cannot mark as empty (would break search chains)
  4. Deleted slots reusable for insertion

Load Factor Constraints:

  • Must keep α < 1 (cannot exceed table size)
  • Performance degrades rapidly as α approaches 1
  • Typical threshold: α ≤ 0.5 to 0.7
  • Resize when threshold exceeded

Advantages:

  • No extra memory for pointers
  • Better cache locality (especially linear probing)
  • Simpler memory management
  • Faster for small keys/values

Disadvantages:

  • Clustering issues (primary and secondary)
  • Deletion is complex (tombstone problem)
  • Performance degrades sharply at high load factors
  • Must resize more aggressively
  • Cannot exceed table size

Practical Considerations:

  • Resizing: Typically double size when α > 0.7, rehash all elements
  • Tombstones: Rebuild table periodically to remove deleted markers
  • Table size: Use prime numbers or powers of 2 depending on probe method
  • Robin Hood hashing: Optimization that reduces variance in probe lengths

Cuckoo Hashing

Cuckoo hashing is a scheme that guarantees O(1) worst-case lookup time by using multiple hash functions and allowing elements to "kick out" other elements during insertion.

Core Concept:

Each key can be stored in one of two (or more) possible locations, determined by different hash functions. If both locations are occupied, evict one element and reinsert it in its alternate location.

Structure:

  • Two hash tables T₁ and T₂ (or one table with two hash functions)
  • Two hash functions h₁(k) and h₂(k)
  • Key k can be in T₁[h₁(k)] or T₂[h₂(k)]
  • Each location stores at most one key

Operations:

Search(key):

  1. Check T₁[h₁(key)]
  2. If not found, check T₂[h₂(key)]
  3. Return value if found in either location
  4. Time: O(1) worst case - only two locations to check!

Insert(key, value):

  1. If T₁[h₁(key)] is empty, insert there
  2. If T₂[h₂(key)] is empty, insert there
  3. Otherwise, evict element from T₁[h₁(key)], insert new key
  4. Reinsert evicted key in its alternate location (T₂)
  5. Continue eviction chain until finding empty slot
  6. If cycle detected or max iterations reached, rehash entire table
  7. Time: O(1) amortized

Delete(key):

  1. Check both locations
  2. Remove if found
  3. Time: O(1) worst case

Insertion Example:

Inserting key x when both h₁(x) and h₂(x) are occupied:

  1. Place x at h₁(x), evict y
  2. Place y at h₂(y), evict z
  3. Place z at h₁(z) (empty)

This creates a "cuckoo" effect where elements kick each other out like cuckoo birds.

Cycle Detection:

  • Insertion may create infinite loop if cycle exists
  • Set maximum number of evictions (e.g., 500)
  • If exceeded, rebuild table with new hash functions
  • With good hash functions, cycles are rare

Load Factor:

  • Theoretical maximum: α < 0.5 for two hash functions
  • Practical limit: α ≈ 0.49 for good performance
  • With d hash functions: α < 1 - 1/e^d
  • More hash functions allow higher load factors

Variants:

d-ary Cuckoo Hashing:

  • Use d > 2 hash functions
  • d = 3: α ≈ 0.91 achievable
  • d = 4: α ≈ 0.97 achievable
  • Trade-off: more hash computations for higher load factor

Buckets:

  • Store multiple items per location (bucket size b)
  • Reduces eviction chains
  • With b = 4, α ≈ 0.95 achievable

Advantages:

  • Guaranteed O(1) lookup: Only check d locations
  • Guaranteed O(1) deletion: Simple removal
  • Excellent cache performance (few memory accesses)
  • No chaining overhead
  • Predictable performance

Disadvantages:

  • Insertion can be slow (eviction chains)
  • Occasional expensive rehashing
  • Lower maximum load factor than chaining
  • Complex implementation
  • Requires good hash functions

Applications:

  • Network routers (fast lookup critical)
  • Hardware implementations (bounded lookup time)
  • Real-time systems (predictable performance)
  • Concurrent hash tables (easier to lock individual locations)
  • Bloom filter alternatives

Performance Comparison

Complexity Table:

Method Search (Avg) Search (Worst) Insert (Avg) Delete
Separate Chaining O(1 + α) O(n) O(1) O(1 + α)
Linear Probing O(1/(1-α)) O(n) O(1/(1-α)) O(1/(1-α))
Double Hashing O(1/(1-α)) O(n) O(1/(1-α)) O(1/(1-α))
Cuckoo Hashing O(1) O(1) O(1) amort O(1)

Space Comparison:

  • Separate Chaining: Extra space for pointers (typically 8-16 bytes per entry)
  • Open Addressing: No extra space, but requires lower load factor
  • Cuckoo Hashing: No extra space, but requires lowest load factor

Cache Performance:

  • Linear Probing: Best cache locality (sequential access)
  • Separate Chaining: Poor cache locality (pointer chasing)
  • Cuckoo Hashing: Good cache locality (bounded accesses)
  • Double Hashing: Moderate cache locality (scattered access)

Practical Guidelines:

Use Separate Chaining when:

  • Memory is not a primary concern
  • Load factor may exceed 0.7
  • Deletion is frequent
  • Hash function quality is uncertain
  • Simplicity is important

Use Open Addressing (Linear Probing) when:

  • Memory efficiency is critical
  • Keys/values are small
  • Cache performance matters
  • Load factor stays below 0.7
  • Deletion is rare

Use Cuckoo Hashing when:

  • Worst-case lookup time is critical
  • Reads far outnumber writes
  • Predictable performance needed
  • Concurrent access (easier locking)
  • Hardware implementation

Real-World Usage:

  • Python dict: Open addressing with random probing
  • Java HashMap: Separate chaining (trees for long chains)
  • C++ unordered_map: Separate chaining
  • Redis: Separate chaining with incremental rehashing
  • Network switches: Cuckoo hashing for routing tables

Exercises

  1. Implement Separate Chaining: Build a hash table using separate chaining with linked lists. Support insert, search, delete, and resize operations.
  2. Linear Probing Implementation: Implement open addressing with linear probing. Handle tombstones correctly for deletion.
  3. Analyze Clustering: Generate random keys and visualize primary clustering in linear probing vs. uniform distribution in double hashing.
  4. Cuckoo Hashing: Implement basic cuckoo hashing with two hash functions. Detect cycles and trigger rehashing.
  5. Load Factor Experiments: Measure average probe length for different load factors (0.5, 0.7, 0.9) in linear probing.
  6. Hash Function Quality: Compare different hash functions (division, multiplication, universal) on real-world data.
  7. Two-Sum Problem: Solve the two-sum problem using a hash table. Analyze time and space complexity.
  8. LRU Cache: Implement an LRU cache using a hash table and doubly linked list for O(1) operations.
  9. Anagram Groups: Group anagrams using hash tables. Use sorted string or character count as hash key.
  10. Concurrent Hash Table: Design a thread-safe hash table using fine-grained locking or lock-free techniques.