I/O Systems — Device Drivers, Buffering & DMA
Reading a file seems instantaneous from Python. Behind that single call is an intricate stack of hardware controllers, interrupt handlers, kernel buffers, and device drivers working together. Understanding this stack explains why some I/O is fast, some is slow, and what you can do about it.
I/O Hardware Overview
A computer's I/O hardware sits on buses — shared electrical pathways that connect the CPU, RAM, and peripheral devices.
Each device has a controller — a small chip that understands the device's native protocol and exposes a standardised interface (registers, buffers) to the CPU. The CPU talks to the controller; the controller talks to the device.
There are two ways for the CPU to address controller registers:
- Port-mapped I/O: separate address space, accessed with special
in/outCPU instructions (legacy x86) - Memory-mapped I/O (MMIO): controller registers appear at fixed physical addresses; the CPU uses normal load/store instructions
Polling vs Interrupt-Driven I/O
How does the CPU know when a device has finished its work?
Polling (Busy Wait)
The CPU repeatedly checks a status register on the controller until the device signals "done."
- Simple to implement
- Very low latency (no interrupt overhead) for fast devices
- Wasteful for slow devices — the CPU cannot do other work while spinning
- Used in specific cases: network drivers in high-speed NICs (DPDK), short spinlocks
Interrupt-Driven I/O
The CPU issues an I/O request, then goes off and does something else. When the device finishes, it asserts an interrupt line. The CPU stops what it's doing, saves its state, runs the Interrupt Service Routine (ISR), and resumes.
The interrupt controller (APIC on x86) arbitrates multiple interrupt sources and delivers them to the CPU with a priority level and a vector number. The vector indexes into the Interrupt Descriptor Table (IDT), which maps vector numbers to ISR addresses.
Interrupt-driven I/O is the default for most devices. The overhead per interrupt is real (hundreds of nanoseconds to save state, run the ISR, return), so high-speed NICs receiving millions of packets per second may switch to polling to avoid ISR overhead — a technique called NAPI in the Linux kernel.
DMA — Direct Memory Access
Even with interrupts, there is still a problem: if the CPU has to copy each byte from the device controller's buffer to RAM, it wastes cycles on data movement.
DMA lets the device controller transfer data directly to/from RAM without involving the CPU for each byte.
Modern DMA controllers are sophisticated: they support scatter-gather (writing to non-contiguous memory regions), chaining multiple transfers, and handling both read and write directions.
I/O Software Layers
The OS organises I/O software in horizontal layers, each talking only to the layer above and below it:
Device drivers are the key abstraction. They present a uniform interface (read, write, ioctl, mmap) to the kernel I/O subsystem while handling all device-specific details. A driver for an NVMe SSD and a driver for a USB stick both expose a block device interface; the filesystem layer above doesn't care which hardware it's talking to.
Buffering
Transferring data in large chunks is much more efficient than one byte at a time. The OS uses several buffering strategies.
Why Buffer?
- Device speeds don't match application consumption rates (a disk delivers 200 MB/s but the app reads in 100-byte chunks)
- Amortises system call overhead — one
read(64KB)beats 65536read(1B)calls - Allows the OS to reorder writes (write-back cache)
Single Buffer
The kernel gives the process a buffer to fill. When the buffer is full (or explicitly flushed), the kernel writes to the device. The process can keep writing into the next buffer while the previous one drains.
Double Buffer
Two buffers alternate. While the kernel writes buffer A to the device, the application fills buffer B. Then they swap. This hides device latency entirely when throughput is balanced.
Circular Buffer (Ring Buffer)
A fixed-size buffer treated as a circle with head (read) and tail (write) pointers. Common in network drivers and audio I/O.
tail == head: buffer empty(tail + 1) % size == head: buffer full- Lock-free variants (single producer, single consumer) avoid mutex overhead
The Page Cache
Linux does not read from disk on every read() call. It maintains a page cache (also called the buffer cache in older literature) — a region of RAM that holds recently accessed file data.
The page cache is why re-reading a large file is much faster than the first read. free -m on Linux shows cache usage in the "buff/cache" column — that memory is in use but is immediately reclaimed if a process needs RAM.
You can observe the cache effect:
Blocking vs Non-Blocking vs Asynchronous I/O
Three distinct I/O models:
Blocking I/O (default)
The calling thread sleeps until the I/O completes.
Simple, correct, but one blocked I/O blocks one thread. For a server handling 10,000 connections, you'd need 10,000 threads.
Non-Blocking I/O
The call returns immediately with EAGAIN if no data is available. The caller polls repeatedly or uses a readiness notification mechanism (select, poll, epoll).
select monitors multiple file descriptors and returns which ones are ready. epoll (Linux) scales to millions of fds; select is limited to 1024.
Using selectors (higher-level)
Asynchronous I/O (AIO)
The application submits an I/O request and a completion callback. The kernel runs the I/O and calls the callback when done — the application never blocks. Python's asyncio module builds on this model (using the event loop to multiplex many async operations on a single thread).
| Model | Thread blocks? | Notification | Python |
|---|---|---|---|
| Blocking | Yes | Return from call | default open().read() |
| Non-blocking | No | Poll / readiness event | socket.setblocking(False) + select |
| Async | No | Completion callback | asyncio, aiofiles |
Disk Scheduling Algorithms
A hard disk drive's read/write head takes time to move between tracks (seek time) and to wait for the right sector to rotate under it (rotational latency). If 20 processes all want disk reads, the order matters enormously.
SSDs have no mechanical head, so disk scheduling is less critical for them — but many systems still run HDDs in NAS/backup scenarios.
FCFS — First Come, First Served
Service requests in arrival order.
Simple but causes wild head thrashing if requests are scattered.
SSTF — Shortest Seek Time First
Always service the request closest to the current head position.
Better throughput, but can starve requests far from the current head position.
SCAN (Elevator Algorithm)
The head sweeps from one end to the other, servicing requests in its current direction. At the end it reverses.
Good throughput; requests near the ends get longer waits.
C-SCAN (Circular SCAN)
Like SCAN, but on the return sweep the head jumps back to the beginning without servicing requests. Provides more uniform wait times.
C-SCAN is the most commonly used algorithm for HDDs because it provides bounded worst-case wait times.
Algorithm Comparison
| Algorithm | Seek distance | Starvation risk | Uniformity |
|---|---|---|---|
| FCFS | High | None | Poor |
| SSTF | Low–medium | Yes | Poor |
| SCAN | Low | None | Medium |
| C-SCAN | Low | None | Good |
Key Takeaways
- The CPU talks to device controllers via MMIO; controllers handle the actual device protocol
- Polling is simple but wastes CPU; interrupt-driven I/O is the standard for most devices
- DMA lets devices transfer data directly to RAM, freeing the CPU for other work
- Device drivers abstract hardware; the kernel I/O subsystem sits above them
- The page cache serves most file reads from RAM — disk access is the exception, not the rule
- Blocking I/O is simple; non-blocking +
epollis necessary for high-concurrency servers - For HDDs, C-SCAN or SCAN gives the best balance of throughput and fairness