The Magic Trick — O(1) Lookup
Arrays give you a superpower: hand me an index, I'll hand you the value in one step. nums[5] is instant — the CPU just adds 5 * sizeof(int) to the array's base address and reads memory. No searching, no looping.
Hash maps generalize that superpower to any key. Want the value for "banana"? One step. For (42, "user")? One step. For an arbitrary object? One step. That's why every language has a built-in hash map: Python's dict, JavaScript's Object and Map, Java's HashMap, Go's map, C++'s unordered_map.
The rest of this article is one question: how do we pull this off? How does the computer turn "banana" into an array index fast enough that the whole thing still feels like O(1)?
What Is a Hash Function?
A hash function takes a key — a string, a number, a tuple, anything — and crunches it down into an integer. Then we take that integer modulo the number of buckets, and that's our index.
The simplest possible hash for strings: sum the character codes, then mod by the bucket count.
Three different keys, three different bucket indices. We just turned arbitrary text into an array address.
Real-world hash functions (like Python's hash() or MurmurHash) are far more sophisticated — they spread keys evenly across the whole integer range. But the shape is identical: key in, index out.
The Collision Problem
Here's the catch. We have infinitely many possible keys but only finitely many buckets. By the pigeonhole principle, two different keys will eventually hash to the same bucket. That's a collision.
So we need a collision resolution strategy. There are two famous ones, and every real hash map uses one or the other.
Strategy 1 — Chaining
The first idea is brutally simple: each bucket holds a list. When two keys land in the same bucket, you append both to the list. Lookups search the list for a matching key.
Implementation is short. Each bucket is a Python list of (key, value) pairs:
If your hash function spreads keys well, every chain stays short (1–2 entries), and lookups stay effectively O(1). If it spreads poorly, one bucket gets a chain of length N and you're back to O(N) — that's the worst case.
Strategy 2 — Linear Probing (Open Addressing)
The other approach refuses to use lists. Every entry lives directly inside the bucket array. When you hit a collision, you just walk forward to the next empty slot.
The load factor trap. As the table fills past ~70%, probe chains get long fast — and once they cluster, every neighbour's lookup slows down too. Real implementations resize (double the table, rehash everything) when the load factor crosses a threshold. Skip that step and your O(1) silently rots into O(N).
Chaining vs. Linear Probing
| Aspect | Chaining | Linear Probing |
|---|---|---|
| Memory layout | Pointer chase to linked entries | Contiguous array of slots |
| Cache behavior | Poor — chains scatter across heap | Excellent — neighbouring slots are already in cache |
| Worst-case lookup | O(chain length) | O(load factor) — degrades sharply past 70% full |
| Delete | Easy — just remove from the list | Tombstone required so probe chains don't break |
| When to prefer | Variable-size values, lots of churn, simple to implement | Small fixed-size values, read-heavy, want cache speed |
Python's dict, Java's HashMap (pre-Java 8), and C++'s unordered_map all use chaining. Python's set and Rust's HashMap use open addressing. Neither is universally better.
Complexity
| Operation | Average | Worst Case | Space |
|---|---|---|---|
get(key) | O(1) | O(N) | — |
put(key, value) | O(1) | O(N) | O(N) total |
delete(key) | O(1) | O(N) | — |
A good hash function plus disciplined load-factor management keeps every operation in the average column. The worst case only shows up with a pathological hash function or an adversarial input — both rare in normal workloads.
Where You Use Them
- Two Sum and anything "have I seen this before?" — store seen values in a hash map for O(1) recall.
- Frequency counters — count characters for anagram detection, count words for ranking.
- De-duplication — drop a stream of items into a set, get unique values out.
- Database indexes — most secondary indexes are hash tables under the hood.
- Caches — LRU and LFU both wrap hash maps for O(1) lookup.
- Language built-ins — Python
dict, JavaScriptObject/Map, JavaHashMap, Gomap, C++unordered_map.
Whenever a problem starts smelling like "I need to remember something by name and get it back fast" — that's a hash map. Almost every interview problem you'll see leans on one.