The Problem: Variable-Length Codes
Fixed-width encodings (ASCII, UTF-8 for the common Latin range) use the same number of bits for every character. That is simple but wasteful: if the letter 'e' appears far more often than 'z', why give both the same bit budget?
Huffman coding assigns shorter bit strings to frequent symbols and longer ones to rare symbols — and it does so optimally. "Optimally" here has a precise meaning: among all prefix-free binary codes (no codeword is the prefix of another, so decoding is unambiguous), Huffman's code minimizes the expected number of bits per symbol, weighted by frequency.
Building the Huffman Tree
The algorithm uses a min-heap (priority queue keyed on frequency):
- Create one leaf node for every symbol, keyed by its frequency.
- While more than one node remains in the heap:
- Pop the two nodes with the lowest frequencies.
- Create a new internal node whose frequency is their sum, with the two popped nodes as children.
- Push the new internal node back into the heap.
- The single remaining node is the root of the Huffman tree.
To read off a symbol's code: trace the path from the root to that symbol's leaf, recording 0 for each left edge and 1 for each right edge (or vice versa — the convention is arbitrary). Because no leaf is an ancestor of another leaf, the resulting codes are automatically prefix-free.
Watch the tree grow as the heap merges symbols in order of increasing frequency:
Why the Greedy Step Is Correct
The greedy choice is: always merge the two cheapest nodes first. The proof uses an exchange argument similar to activity selection:
Suppose an optimal tree does not place the two least-frequent symbols as siblings at the deepest level. We can swap whatever symbols are there with the two least-frequent ones without increasing total cost (because the least-frequent symbols now have longer codes, but they are the cheapest to "charge" for extra length). Repeating this argument shows that always merging the two minimum-frequency nodes produces a minimum-cost tree.
This is a classic greedy application: the greedy-choice property holds because the cheapest symbols can always be safely committed to the deepest position without harming the solution.
Code Implementation
Each entry in the heap is [frequency, id, left_child, right_child]. The id field breaks frequency ties (leaf nodes use their original index; merged nodes use -1). Traversing the returned root recursively and recording the left/right path gives each symbol's codeword.
Complexity Analysis
| Step | Cost | Why |
|---|---|---|
| Initial heapify | O(n) | bottom-up heap construction |
| n − 1 merge iterations | O(n log n) | each pop/push pair costs O(log n) |
| Code extraction (tree walk) | O(n) | one pass over the tree |
| Total | O(n log n) | dominated by the merge loop |
Space is O(n) for the heap and O(n) for the tree itself, giving O(n) overall.
Real-World Use
Huffman coding appears everywhere:
- DEFLATE (used in ZIP, PNG, gzip) combines Huffman coding with LZ77 back-references.
- JPEG / MPEG use Huffman coding in their entropy coding stage.
- HTTP/2 HPACK compresses headers with a static Huffman table derived from empirical web-request frequencies.
In all these cases the code table is transmitted alongside (or agreed upon in advance for static tables), and the decoder reconstructs symbols by walking the tree bit by bit until it reaches a leaf.
Key Takeaways
- A min-heap makes the greedy merge efficient. Without it, each round of finding two minimums would cost O(n), making the total O(n²).
- Prefix-free is essential for decoding. Because no codeword is a prefix of another, a decoder can scan bits left to right without ambiguity, stopping exactly when a leaf is reached.
- Huffman is optimal for symbol-by-symbol codes. It cannot be beaten by any other per-symbol prefix code given the same frequency table. Beating it requires coding across multiple symbols (arithmetic coding does exactly that).
- Greedy works here because of the exchange argument. The two least-frequent nodes can always be safely placed deepest — any other placement is at least as expensive.