Load Balancers & CDNs — Scaling Web Infrastructure
A single server has finite CPU, memory, and network capacity. Past a certain traffic volume, it cannot respond fast enough — or at all. Load balancers and content delivery networks (CDNs) are the two main tools for scaling beyond a single machine, and they operate at different layers of the stack.
Why You Need a Load Balancer
Suppose your application runs on one server. The problems are:
- Single point of failure — one hardware fault or deployment crash takes down the whole product.
- Vertical scaling ceiling — you can only buy so much CPU/RAM for one machine, and it gets expensive fast.
- No zero-downtime deploys — you can't restart the server without dropping connections.
A load balancer sits in front of a pool of backend servers and distributes incoming connections across them.
Now any single backend can be removed (for maintenance, upgrade, or failure) without users noticing — the load balancer stops sending traffic to it.
Layer 4 vs Layer 7 Load Balancing
Layer 4 — Transport Layer
The load balancer routes based on IP addresses and TCP/UDP ports. It never inspects the application payload.
Characteristics:
- Very fast — minimal processing per packet.
- Cannot make routing decisions based on URL, HTTP headers, or cookies.
- Works for any TCP or UDP protocol (HTTPS, DNS, SMTP, game servers).
- Maintains TCP session state — a connection to one backend stays on that backend.
Use case: High-throughput, low-latency routing where you don't need to inspect content. AWS NLB is a managed L4 load balancer.
Layer 7 — Application Layer
The load balancer terminates the client connection, reads the HTTP request, then opens a separate connection to a backend.
What L7 can do that L4 cannot:
- Route
/api/*to one set of servers and/static/*to another. - Add or strip HTTP headers (e.g., inject
X-Real-IP). - Route based on
Hostheader to different virtual hosts. - Sticky sessions via cookie inspection.
- Rewrite URLs, redirect HTTP to HTTPS.
- Terminate TLS and perform SSL offloading.
- WAF (Web Application Firewall) inspection.
Cost: More CPU per request than L4. AWS ALB and NGINX are managed/software L7 load balancers.
L4 vs L7 Comparison
| Feature | L4 | L7 |
|---|---|---|
| Latency | Very low | Low (slightly higher) |
| Content-based routing | No | Yes |
| TLS termination | No (pass-through) or Yes | Yes |
| Protocol support | Any TCP/UDP | HTTP, HTTPS, WebSocket, gRPC |
| Session persistence | By IP | By cookie or URL |
| Visibility | IP, port, TCP flags | Full HTTP request/response |
Load Balancing Algorithms
Round Robin
Requests are distributed to backends in turn: 1, 2, 3, 1, 2, 3, ...
Simple and fair when all requests have similar cost and all backends are equal. Breaks down when requests have very different processing times or backends have different capacity.
Weighted Round Robin
Each backend gets a weight. A backend with weight 3 receives 3x as many requests as one with weight 1.
Use this when backends have different CPU/RAM (e.g., during a rolling upgrade with mixed instance sizes).
Least Connections
New requests go to the backend with the fewest active connections.
Best when requests have highly variable processing time (e.g., a mix of fast API calls and slow file uploads).
IP Hash (Session Affinity / Sticky Sessions)
The client's IP address is hashed to always select the same backend.
Ensures the same client always reaches the same server. Useful for applications with server-side session state (though you should prefer stateless designs or shared session stores).
Problem: If a backend is removed, all clients whose hash pointed to it are redistributed. If a backend is added, ~1/n of all clients get remapped.
Consistent Hashing
Used in distributed systems (Cassandra, Redis Cluster, CDN edge nodes).
Backends are placed on a virtual ring. A request's hash is mapped to the ring and routed to the nearest backend clockwise.
When a backend is added or removed, only the requests that were routed to it need to be remapped. On average that is 1/n of total requests — much less disruption than modulo hashing.
Health Checks
A load balancer continuously checks that backends are alive and able to serve traffic.
Active Health Checks
The load balancer proactively sends requests to backends:
The health endpoint should check the application's dependencies (database connection, cache connection) rather than just returning 200 blindly.
Passive Health Checks
The load balancer observes real responses as they flow through:
- 3 consecutive 5xx responses → mark backend as unhealthy.
- Response time > 30 seconds → mark as slow/unhealthy.
- TCP connection refused → immediately mark as down.
Passive checks detect failures faster for real traffic but cannot detect a failed backend before a real user's request is affected.
Session Persistence (Sticky Sessions)
Some legacy applications store session data locally in memory (not in Redis or a database). If a user's second request goes to a different backend, their session is lost.
Sticky sessions bind a user to one backend for the duration of their session.
Cookie-based stickiness (L7):
Trade-offs:
| Sticky Sessions | Shared Session Store | |
|---|---|---|
| Implementation | LB config only | Requires Redis/DB change |
| Scaling | Uneven load distribution possible | Even distribution |
| Fault tolerance | User loses session if their backend fails | Backend failure is transparent |
| Complexity | Low | Higher (session store is another dependency) |
Prefer stateless backends with a shared session store (Redis) over sticky sessions for any new application.
SSL Termination
Encrypting and decrypting TLS is CPU-intensive. Load balancers can handle it centrally:
Benefits:
- Backends don't need TLS certificates or SSL libraries.
- Certificate rotation happens in one place.
- The load balancer can inspect HTTP headers for routing.
The connection between LB and backends is HTTP, but it's usually on a private network. If your compliance requirements demand encryption everywhere, use TLS passthrough (L4) or TLS re-encryption (L7 terminates client TLS, then establishes a new TLS connection to the backend).
CDN — Content Delivery Network
A CDN is a globally distributed network of edge servers (PoPs — Points of Presence) that cache content close to users.
Cache Hit vs Miss
- Cache hit: the edge has the content cached locally. Responds immediately. No origin server involved.
- Cache miss: the edge doesn't have the content. Fetches from the origin, caches it with the specified TTL, then serves it.
Cache hit ratio is the primary CDN performance metric. A 95% hit ratio means 95% of requests never reach your origin server.
Pull vs Push CDNs
Pull CDN (most common):
- Edge fetches content from origin on the first request (cache miss).
- Subsequent requests are served from cache.
- Content expires after TTL and is re-fetched on next request.
- Example: Cloudflare, CloudFront with default config.
Push CDN:
- You explicitly upload content to CDN edge nodes.
- Good for large files that you want pre-cached everywhere (game installers, software releases, video).
- You control exactly what is on the edge — no cold-start latency.
- Example: Akamai NetStorage, some CloudFront configurations.
What CDNs Cache
| Content Type | TTL Strategy | Notes |
|---|---|---|
| Static assets (JS, CSS, images) | Long (days/weeks) | Use cache-busting filenames (app.abc123.js) |
| HTML pages | Short (minutes) or no-cache | Dynamic content changes frequently |
| Video (HLS/DASH segments) | Medium (hours) | Each segment is a small fixed file |
| API responses | Short (seconds/minutes) | Only for cacheable, non-user-specific responses |
| Private/personalised content | Never cache on shared CDN | Use Cache-Control: private |
Video Streaming: HLS and DASH
Large video files are not served as a single blob. They are split into small segments (2–10 seconds each) and served adaptively:
The player requests the manifest, picks an appropriate quality level based on available bandwidth, and fetches segments. If the network slows down, it switches to a lower-quality stream mid-playback.
Anycast
Anycast is a routing technique where the same IP address is announced from multiple physical locations. BGP routes packets to the topographically nearest (lowest cost) location.
This is completely transparent to the user — they just see 1.1.1.1.
Why Anycast Matters for DDoS Mitigation
Distributed Denial of Service attacks generate massive traffic volumes aimed at one IP. With anycast:
The attack is absorbed and distributed across the global network. Cloudflare's anycast network spans 100+ Tbps of capacity, making volumetric DDoS attacks impractical to sustain.
Popular Tools and Services
Software Load Balancers
NGINX — HTTP server, reverse proxy, and L7 load balancer.
HAProxy — dedicated TCP and HTTP load balancer, known for high performance and detailed statistics.
Managed Cloud Load Balancers
| Service | Layer | Notes |
|---|---|---|
| AWS ALB (Application LB) | L7 | HTTP/HTTPS/gRPC, host/path routing, WAF integration |
| AWS NLB (Network LB) | L4 | TCP/UDP, extreme throughput, static IP, TLS passthrough |
| AWS CLB (Classic LB) | L4/L7 | Legacy, prefer ALB/NLB |
| GCP Cloud Load Balancing | L4 + L7 | Global anycast-based |
| Azure Load Balancer | L4 | Basic TCP/UDP |
| Azure Application Gateway | L7 | HTTP/HTTPS + WAF |
CDN Providers
| Provider | Strength |
|---|---|
| Cloudflare | Largest anycast network, DDoS mitigation, DNS, Zero Trust |
| AWS CloudFront | Deep AWS integration, Lambda@Edge for compute at edge |
| Akamai | Oldest CDN, strong for enterprise and media streaming |
| Fastly | Varnish-based, instant cache purge, edge compute |
| Bunny.net | Cost-effective, good for video and static assets |
Putting It Together
A typical production architecture for a web application:
Static assets (JS bundles, images, fonts) are served entirely from the CDN. Dynamic API requests pass through ALB to compute instances. The database and cache are never exposed to the load balancer directly.
This architecture handles traffic spikes (auto-scaling), single-server failures (LB health checks), and global users (CDN edge caching) — the three core scaling problems.