Virtualisation & Containers — VMs, Hypervisors & Docker
A single physical server running at 5% CPU utilisation is waste. Run ten virtual machines on it and that waste becomes useful work. Virtualisation is the technology that makes this possible — and it underpins every public cloud, every CI pipeline, and every container runtime in use today.
What Virtualisation Is
Virtualisation creates an abstraction layer between software and hardware. Instead of running directly on physical resources, an operating system runs inside a virtual machine (VM) — a software emulation of a complete computer. The VM believes it has its own CPU, RAM, disk, and network card. The physical hardware may be shared among dozens of such VMs.
Why it matters:
- Server consolidation — run many workloads on one machine instead of many lightly loaded machines
- Isolation — a compromised VM cannot affect other VMs on the same host
- Portability — capture an entire OS + app as an image; move it anywhere
- Cloud computing — AWS EC2, Google Compute Engine, Azure VMs are all VMs on bare-metal servers
- Testing — spin up a fresh OS in seconds, destroy it when done, leave no trace
The Hypervisor
The software that creates and manages VMs is the hypervisor (also called a Virtual Machine Monitor, VMM). It sits between the hardware and the guest operating systems, multiplexing physical resources among VMs.
Type 1 — Bare-Metal Hypervisors
Run directly on the physical hardware, replacing the host OS entirely.
Examples:
- VMware ESXi — dominant in enterprise data centres
- Microsoft Hyper-V — ships with Windows Server
- KVM (Kernel-based Virtual Machine) — built into the Linux kernel; technically a Type 1 because Linux becomes the hypervisor
Type 1 hypervisors have lower overhead (no host OS in the path) and are used in production cloud environments.
Type 2 — Hosted Hypervisors
Run as an application on top of a conventional host OS.
Examples:
- VirtualBox — open source, cross-platform
- VMware Workstation / Fusion
- QEMU (without KVM acceleration)
Type 2 is easier to install and use but has more overhead. Used for development, testing, and running a different OS on your laptop.
How a Hypervisor Virtualises CPU, Memory, and I/O
CPU Virtualisation
The challenge: guest OS code runs in ring 0 (kernel mode) inside the VM, but ring 0 inside the VM is not really ring 0 on the physical CPU — that's reserved for the hypervisor.
Full virtualisation (binary translation): The hypervisor scans guest kernel instructions and rewrites privileged ones on the fly to trap into the hypervisor. VMware's original approach. Works with unmodified guest OSes. Slow.
Hardware-assisted virtualisation (VT-x / AMD-V): Intel and AMD added a new ring level below ring 0 (VMX root mode). The hypervisor runs there. The guest OS runs in ring 0 inside a hardware-isolated "guest context." Privileged instructions trap automatically to the hypervisor without software translation. This is the dominant approach today — near-native CPU performance.
Memory Virtualisation
Guest OS uses virtual addresses → guest physical addresses → host physical addresses. Two levels of translation.
Shadow page tables: Hypervisor maintains page tables that map guest virtual addresses directly to host physical addresses. Must be kept in sync with guest page tables — expensive.
Hardware-assisted (EPT / NPT): Intel Extended Page Tables / AMD Nested Page Tables handle both levels in hardware. The MMU walks two page tables automatically. This dropped memory virtualisation overhead from ~30% to ~5%.
I/O Virtualisation
I/O is the hardest to virtualise. Options:
- Full emulation: Hypervisor emulates a standard device (e.g., an Intel e1000 NIC) in software. Guest uses unmodified drivers. Very slow for disk/network.
- Para-virtualised drivers (virtio): Guest OS is modified to use hypervisor-aware drivers that communicate via shared memory queues rather than emulating physical hardware. Much faster. VirtIO is the standard para-virtualisation interface on Linux/KVM.
- SR-IOV (Single Root I/O Virtualisation): Physical device (NIC, GPU) exposes multiple virtual functions directly to VMs. Near-native performance.
Full Virtualisation vs Para-Virtualisation
| Aspect | Full Virtualisation | Para-Virtualisation |
|---|---|---|
| Guest OS modification | None (runs unmodified) | Required (virtio drivers) |
| Performance | Near-native with HW assist | Faster for I/O-heavy workloads |
| Compatibility | Any OS | Only OSes with virtio support |
| Examples | VMware, Hyper-V (for Windows) | KVM with virtio, Xen PV |
In practice: CPU/memory are hardware-virtualised (near-native), while I/O uses para-virtual drivers (virtio) for performance.
Containers vs Virtual Machines
Containers take a fundamentally different approach. Instead of virtualising hardware and running separate OS instances, containers share the host OS kernel and isolate only the user-space environment.
| Property | VM | Container |
|---|---|---|
| Kernel | Separate per VM | Shared (host kernel) |
| Isolation | Strong (hardware) | Process-level (namespaces) |
| Overhead | High (GBs of RAM per VM) | Low (MBs) |
| Start time | Tens of seconds | Milliseconds |
| Security boundary | Very strong | Weaker (kernel exploit → all containers) |
| Use case | Different OSes, strong isolation | Microservices, CI/CD, scale-out |
Linux Namespaces — The Foundation of Containers
Containers are not a new kernel feature — they are built from composing existing Linux primitives. Namespaces isolate specific aspects of the system view for a set of processes.
There are 7 namespace types:
| Namespace | Isolates | Each container sees... |
|---|---|---|
| PID | Process IDs | Its own PID 1; cannot see host PIDs |
| NET | Network interfaces, routing, sockets | Its own eth0, IP address, routing table |
| MNT | Mount points, filesystem tree | Its own / (different from host's /) |
| IPC | System V IPC, POSIX message queues | Separate IPC objects |
| UTS | Hostname and NIS domain name | Its own hostname |
| USER | User and group IDs | Can be root (UID 0) inside but unprivileged outside |
| CGROUP | cgroup root directory | Its own cgroup hierarchy view |
When Docker creates a container, it calls clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...) to create a new process in fresh namespaces. The container's PID 1 is just a regular process on the host — but inside the container it looks like the init process of a standalone machine.
cgroups — Resource Limits Per Container
Namespaces provide isolation (what you can see). Control Groups (cgroups) provide resource control (how much you can use).
cgroups allow the kernel to:
- Limit CPU usage (e.g., max 2 cores out of 32)
- Limit memory usage (e.g., max 512 MB; OOM-kill the container if exceeded)
- Limit disk I/O bandwidth (e.g., max 100 MB/s writes)
- Limit network bandwidth
- Account for resource usage per container
Docker translates --cpus 0.5 --memory 256m into exactly these cgroup writes.
Docker Concepts
Docker packages applications into containers using a layered filesystem.
Key Concepts
Image — a read-only template. Contains the filesystem layers (base OS, dependencies, app code). Stored in a registry (Docker Hub, ECR, GCR).
Container — a running instance of an image. Each container adds a thin read-write layer on top of the image layers. Containers are ephemeral — stop and remove a container and the write layer is gone.
Layer — each Dockerfile instruction that modifies the filesystem creates a new layer. Layers are content-addressed and shared: if two images use the same Ubuntu base, they share those layers on disk.
A Minimal Dockerfile
Each instruction explanation:
FROM— sets the base image; every image inherits from something (ultimately fromscratch)WORKDIR— all subsequent commands run relative to this path inside the imageCOPY requirements.txt .— copy only requirements first to exploit layer cachingRUN pip install— executes a shell command; the result is baked into a new layerCOPY . .— copy app code; separate from requirements so code changes don't bust the pip cacheEXPOSE— documents which port the app uses; you still need-p 8080:8080to publish itCMD— default command when a container starts; can be overridden atdocker runtime
The Union Filesystem (OverlayFS)
Linux's OverlayFS makes layers efficient. Multiple read-only layers are stacked, with a single read-write layer on top. Reading a file checks the top layer first, then lower layers in order. Writing creates a copy of the file in the top layer (copy-on-write). This is why containers start instantly — the image layers are already on disk, and only a new empty top layer is created.
When to Use VMs vs Containers
| Situation | Recommendation |
|---|---|
| Running a different OS (Windows app on Linux host) | VM |
| Strong security isolation required (multi-tenant, untrusted code) | VM |
| Microservice deployments on same OS | Container |
| CI/CD pipelines, ephemeral build environments | Container |
| Legacy app that requires its own kernel version | VM |
| High-density deployments (100s of isolated services) | Container |
| Desktop virtualisation (developer running macOS + Linux) | VM |
| Kubernetes / container orchestration | Container |
In modern infrastructure, VMs and containers are used together: cloud providers offer VMs (EC2 instances), and inside those VMs Kubernetes runs containers. VMs provide the hardware security boundary between customers; containers provide density and fast iteration inside each customer's VM.
Key Takeaways
- Type 1 hypervisors run on bare metal (KVM, ESXi); Type 2 run on a host OS (VirtualBox)
- Hardware extensions (Intel VT-x, AMD-V) make CPU virtualisation near-native
- Para-virtualised I/O drivers (virtio) are faster than emulating physical devices
- Containers share the host kernel; isolation comes from Linux namespaces and cgroups
- Namespaces control visibility (PID, network, filesystem); cgroups control resource usage
- Docker layers (OverlayFS) + content addressing make container startup almost instant
- Use VMs for strong isolation or different OSes; use containers for density and speed