I/O Systems — Device Drivers, Buffering & DMA

Reading a file seems instantaneous from Python. Behind that single call is an intricate stack of hardware controllers, interrupt handlers, kernel buffers, and device drivers working together. Understanding this stack explains why some I/O is fast, some is slow, and what you can do about it.

I/O Hardware Overview

A computer's I/O hardware sits on buses — shared electrical pathways that connect the CPU, RAM, and peripheral devices.

CPUMemory BusRAMPCIe BusNVMe SSDGPUNICUSB ControllerKeyboard, Mouse, USB drivesSATA ControllerHard disk drive

Each device has a controller — a small chip that understands the device's native protocol and exposes a standardised interface (registers, buffers) to the CPU. The CPU talks to the controller; the controller talks to the device.

There are two ways for the CPU to address controller registers:

  • Port-mapped I/O: separate address space, accessed with special in/out CPU instructions (legacy x86)
  • Memory-mapped I/O (MMIO): controller registers appear at fixed physical addresses; the CPU uses normal load/store instructions

Polling vs Interrupt-Driven I/O

How does the CPU know when a device has finished its work?

Polling (Busy Wait)

The CPU repeatedly checks a status register on the controller until the device signals "done."

python
Loading...
text
Loading...
  • Simple to implement
  • Very low latency (no interrupt overhead) for fast devices
  • Wasteful for slow devices — the CPU cannot do other work while spinning
  • Used in specific cases: network drivers in high-speed NICs (DPDK), short spinlocks

Interrupt-Driven I/O

The CPU issues an I/O request, then goes off and does something else. When the device finishes, it asserts an interrupt line. The CPU stops what it's doing, saves its state, runs the Interrupt Service Routine (ISR), and resumes.

text
Loading...

The interrupt controller (APIC on x86) arbitrates multiple interrupt sources and delivers them to the CPU with a priority level and a vector number. The vector indexes into the Interrupt Descriptor Table (IDT), which maps vector numbers to ISR addresses.

Interrupt-driven I/O is the default for most devices. The overhead per interrupt is real (hundreds of nanoseconds to save state, run the ISR, return), so high-speed NICs receiving millions of packets per second may switch to polling to avoid ISR overhead — a technique called NAPI in the Linux kernel.

DMA — Direct Memory Access

Even with interrupts, there is still a problem: if the CPU has to copy each byte from the device controller's buffer to RAM, it wastes cycles on data movement.

DMA lets the device controller transfer data directly to/from RAM without involving the CPU for each byte.

text
Loading...
text
Loading...

Modern DMA controllers are sophisticated: they support scatter-gather (writing to non-contiguous memory regions), chaining multiple transfers, and handling both read and write directions.

I/O Software Layers

The OS organises I/O software in horizontal layers, each talking only to the layer above and below it:

Device drivers are the key abstraction. They present a uniform interface (read, write, ioctl, mmap) to the kernel I/O subsystem while handling all device-specific details. A driver for an NVMe SSD and a driver for a USB stick both expose a block device interface; the filesystem layer above doesn't care which hardware it's talking to.

Buffering

Transferring data in large chunks is much more efficient than one byte at a time. The OS uses several buffering strategies.

Why Buffer?

  • Device speeds don't match application consumption rates (a disk delivers 200 MB/s but the app reads in 100-byte chunks)
  • Amortises system call overhead — one read(64KB) beats 65536 read(1B) calls
  • Allows the OS to reorder writes (write-back cache)

Single Buffer

text
Loading...

The kernel gives the process a buffer to fill. When the buffer is full (or explicitly flushed), the kernel writes to the device. The process can keep writing into the next buffer while the previous one drains.

Double Buffer

Two buffers alternate. While the kernel writes buffer A to the device, the application fills buffer B. Then they swap. This hides device latency entirely when throughput is balanced.

text
Loading...

Circular Buffer (Ring Buffer)

A fixed-size buffer treated as a circle with head (read) and tail (write) pointers. Common in network drivers and audio I/O.

text
Loading...
  • tail == head: buffer empty
  • (tail + 1) % size == head: buffer full
  • Lock-free variants (single producer, single consumer) avoid mutex overhead

The Page Cache

Linux does not read from disk on every read() call. It maintains a page cache (also called the buffer cache in older literature) — a region of RAM that holds recently accessed file data.

text
Loading...

The page cache is why re-reading a large file is much faster than the first read. free -m on Linux shows cache usage in the "buff/cache" column — that memory is in use but is immediately reclaimed if a process needs RAM.

You can observe the cache effect:

bash
Loading...

Blocking vs Non-Blocking vs Asynchronous I/O

Three distinct I/O models:

Blocking I/O (default)

The calling thread sleeps until the I/O completes.

text
Loading...

Simple, correct, but one blocked I/O blocks one thread. For a server handling 10,000 connections, you'd need 10,000 threads.

Non-Blocking I/O

The call returns immediately with EAGAIN if no data is available. The caller polls repeatedly or uses a readiness notification mechanism (select, poll, epoll).

python
Loading...

select monitors multiple file descriptors and returns which ones are ready. epoll (Linux) scales to millions of fds; select is limited to 1024.

Using selectors (higher-level)

python
Loading...

Asynchronous I/O (AIO)

The application submits an I/O request and a completion callback. The kernel runs the I/O and calls the callback when done — the application never blocks. Python's asyncio module builds on this model (using the event loop to multiplex many async operations on a single thread).

ModelThread blocks?NotificationPython
BlockingYesReturn from calldefault open().read()
Non-blockingNoPoll / readiness eventsocket.setblocking(False) + select
AsyncNoCompletion callbackasyncio, aiofiles

Disk Scheduling Algorithms

A hard disk drive's read/write head takes time to move between tracks (seek time) and to wait for the right sector to rotate under it (rotational latency). If 20 processes all want disk reads, the order matters enormously.

SSDs have no mechanical head, so disk scheduling is less critical for them — but many systems still run HDDs in NAS/backup scenarios.

FCFS — First Come, First Served

Service requests in arrival order.

text
Loading...

Simple but causes wild head thrashing if requests are scattered.

SSTF — Shortest Seek Time First

Always service the request closest to the current head position.

text
Loading...

Better throughput, but can starve requests far from the current head position.

SCAN (Elevator Algorithm)

The head sweeps from one end to the other, servicing requests in its current direction. At the end it reverses.

text
Loading...

Good throughput; requests near the ends get longer waits.

C-SCAN (Circular SCAN)

Like SCAN, but on the return sweep the head jumps back to the beginning without servicing requests. Provides more uniform wait times.

text
Loading...

C-SCAN is the most commonly used algorithm for HDDs because it provides bounded worst-case wait times.

Algorithm Comparison

AlgorithmSeek distanceStarvation riskUniformity
FCFSHighNonePoor
SSTFLow–mediumYesPoor
SCANLowNoneMedium
C-SCANLowNoneGood

Key Takeaways

  • The CPU talks to device controllers via MMIO; controllers handle the actual device protocol
  • Polling is simple but wastes CPU; interrupt-driven I/O is the standard for most devices
  • DMA lets devices transfer data directly to RAM, freeing the CPU for other work
  • Device drivers abstract hardware; the kernel I/O subsystem sits above them
  • The page cache serves most file reads from RAM — disk access is the exception, not the rule
  • Blocking I/O is simple; non-blocking + epoll is necessary for high-concurrency servers
  • For HDDs, C-SCAN or SCAN gives the best balance of throughput and fairness