I/O Systems — Device Drivers, Buffering & DMA

Reading a file seems instantaneous from Python. Behind that single call is an intricate stack of hardware controllers, interrupt handlers, kernel buffers, and device drivers working together. Understanding this stack explains why some I/O is fast, some is slow, and what you can do about it.

I/O Hardware Overview#

A computer's I/O hardware sits on buses — shared electrical pathways that connect the CPU, RAM, and peripheral devices.

Each device has a controller — a small chip that understands the device's native protocol and exposes a standardised interface (registers, buffers) to the CPU. The CPU talks to the controller; the controller talks to the device.

There are two ways for the CPU to address controller registers:

Port-mapped I/O: separate address space, accessed with special in/out CPU instructions (legacy x86)
Memory-mapped I/O (MMIO): controller registers appear at fixed physical addresses; the CPU uses normal load/store instructions

Polling vs Interrupt-Driven I/O#

How does the CPU know when a device has finished its work?

Polling (Busy Wait)#

The CPU repeatedly checks a status register on the controller until the device signals "done."

python

Loading editor…

text

Loading editor…

Simple to implement
Very low latency (no interrupt overhead) for fast devices
Wasteful for slow devices — the CPU cannot do other work while spinning
Used in specific cases: network drivers in high-speed NICs (DPDK), short spinlocks

Interrupt-Driven I/O#

The CPU issues an I/O request, then goes off and does something else. When the device finishes, it asserts an interrupt line. The CPU stops what it's doing, saves its state, runs the Interrupt Service Routine (ISR), and resumes.

text

Loading editor…

The interrupt controller (APIC on x86) arbitrates multiple interrupt sources and delivers them to the CPU with a priority level and a vector number. The vector indexes into the Interrupt Descriptor Table (IDT), which maps vector numbers to ISR addresses.

Interrupt-driven I/O is the default for most devices. The overhead per interrupt is real (hundreds of nanoseconds to save state, run the ISR, return), so high-speed NICs receiving millions of packets per second may switch to polling to avoid ISR overhead — a technique called NAPI in the Linux kernel.

DMA — Direct Memory Access#

Even with interrupts, there is still a problem: if the CPU has to copy each byte from the device controller's buffer to RAM, it wastes cycles on data movement.

DMA lets the device controller transfer data directly to/from RAM without involving the CPU for each byte.

text

Loading editor…

text

Loading editor…

Modern DMA controllers are sophisticated: they support scatter-gather (writing to non-contiguous memory regions), chaining multiple transfers, and handling both read and write directions.

I/O Software Layers#

The OS organises I/O software in horizontal layers, each talking only to the layer above and below it:

Device drivers are the key abstraction. They present a uniform interface (read, write, ioctl, mmap) to the kernel I/O subsystem while handling all device-specific details. A driver for an NVMe SSD and a driver for a USB stick both expose a block device interface; the filesystem layer above doesn't care which hardware it's talking to.

Buffering#

Transferring data in large chunks is much more efficient than one byte at a time. The OS uses several buffering strategies.

Why Buffer?#

Device speeds don't match application consumption rates (a disk delivers 200 MB/s but the app reads in 100-byte chunks)
Amortises system call overhead — one read(64KB) beats 65536 read(1B) calls
Allows the OS to reorder writes (write-back cache)

Single Buffer#

text

Loading editor…

The kernel gives the process a buffer to fill. When the buffer is full (or explicitly flushed), the kernel writes to the device. The process can keep writing into the next buffer while the previous one drains.

Double Buffer#

Two buffers alternate. While the kernel writes buffer A to the device, the application fills buffer B. Then they swap. This hides device latency entirely when throughput is balanced.

text

Loading editor…

Circular Buffer (Ring Buffer)#

A fixed-size buffer treated as a circle with head (read) and tail (write) pointers. Common in network drivers and audio I/O.

text

Loading editor…

tail == head: buffer empty
(tail + 1) % size == head: buffer full
Lock-free variants (single producer, single consumer) avoid mutex overhead

The Page Cache#

Linux does not read from disk on every read() call. It maintains a page cache (also called the buffer cache in older literature) — a region of RAM that holds recently accessed file data.

text

Loading editor…

The page cache is why re-reading a large file is much faster than the first read. free -m on Linux shows cache usage in the "buff/cache" column — that memory is in use but is immediately reclaimed if a process needs RAM.

You can observe the cache effect:

bash

Loading editor…

Blocking vs Non-Blocking vs Asynchronous I/O#

Three distinct I/O models:

Blocking I/O (default)#

The calling thread sleeps until the I/O completes.

text

Loading editor…

Simple, correct, but one blocked I/O blocks one thread. For a server handling 10,000 connections, you'd need 10,000 threads.

Non-Blocking I/O#

The call returns immediately with EAGAIN if no data is available. The caller polls repeatedly or uses a readiness notification mechanism (select, poll, epoll).

python

Loading editor…

select monitors multiple file descriptors and returns which ones are ready. epoll (Linux) scales to millions of fds; select is limited to 1024.

Using selectors (higher-level)#

python

Loading editor…

Asynchronous I/O (AIO)#

The application submits an I/O request and a completion callback. The kernel runs the I/O and calls the callback when done — the application never blocks. Python's asyncio module builds on this model (using the event loop to multiplex many async operations on a single thread).

Model	Thread blocks?	Notification	Python
Blocking	Yes	Return from call	default `open().read()`
Non-blocking	No	Poll / readiness event	`socket.setblocking(False)` + `select`
Async	No	Completion callback	`asyncio`, `aiofiles`

Disk Scheduling Algorithms#

A hard disk drive's read/write head takes time to move between tracks (seek time) and to wait for the right sector to rotate under it (rotational latency). If 20 processes all want disk reads, the order matters enormously.

SSDs have no mechanical head, so disk scheduling is less critical for them — but many systems still run HDDs in NAS/backup scenarios.

FCFS — First Come, First Served#

Service requests in arrival order.

text

Loading editor…

Simple but causes wild head thrashing if requests are scattered.

SSTF — Shortest Seek Time First#

Always service the request closest to the current head position.

text

Loading editor…

Better throughput, but can starve requests far from the current head position.

SCAN (Elevator Algorithm)#

The head sweeps from one end to the other, servicing requests in its current direction. At the end it reverses.

text

Loading editor…

Good throughput; requests near the ends get longer waits.

C-SCAN (Circular SCAN)#

Like SCAN, but on the return sweep the head jumps back to the beginning without servicing requests. Provides more uniform wait times.

text

Loading editor…

C-SCAN is the most commonly used algorithm for HDDs because it provides bounded worst-case wait times.

Algorithm Comparison#

Algorithm	Seek distance	Starvation risk	Uniformity
FCFS	High	None	Poor
SSTF	Low–medium	Yes	Poor
SCAN	Low	None	Medium
C-SCAN	Low	None	Good

Key Takeaways#

The CPU talks to device controllers via MMIO; controllers handle the actual device protocol
Polling is simple but wastes CPU; interrupt-driven I/O is the standard for most devices
DMA lets devices transfer data directly to RAM, freeing the CPU for other work
Device drivers abstract hardware; the kernel I/O subsystem sits above them
The page cache serves most file reads from RAM — disk access is the exception, not the rule
Blocking I/O is simple; non-blocking + epoll is necessary for high-concurrency servers
For HDDs, C-SCAN or SCAN gives the best balance of throughput and fairness

I/O Systems — Device Drivers, Buffering & DMA

I/O Hardware Overview#

A computer's I/O hardware sits on buses — shared electrical pathways that connect the CPU, RAM, and peripheral devices.

There are two ways for the CPU to address controller registers:

Port-mapped I/O: separate address space, accessed with special in/out CPU instructions (legacy x86)
Memory-mapped I/O (MMIO): controller registers appear at fixed physical addresses; the CPU uses normal load/store instructions

Polling vs Interrupt-Driven I/O#

How does the CPU know when a device has finished its work?

Polling (Busy Wait)#

The CPU repeatedly checks a status register on the controller until the device signals "done."

python

Loading editor…

text

Loading editor…

Simple to implement
Very low latency (no interrupt overhead) for fast devices
Wasteful for slow devices — the CPU cannot do other work while spinning
Used in specific cases: network drivers in high-speed NICs (DPDK), short spinlocks

Interrupt-Driven I/O#

text

Loading editor…

DMA — Direct Memory Access#

Even with interrupts, there is still a problem: if the CPU has to copy each byte from the device controller's buffer to RAM, it wastes cycles on data movement.

DMA lets the device controller transfer data directly to/from RAM without involving the CPU for each byte.

text

Loading editor…

text

Loading editor…

Modern DMA controllers are sophisticated: they support scatter-gather (writing to non-contiguous memory regions), chaining multiple transfers, and handling both read and write directions.

I/O Software Layers#

The OS organises I/O software in horizontal layers, each talking only to the layer above and below it:

Buffering#

Transferring data in large chunks is much more efficient than one byte at a time. The OS uses several buffering strategies.

Why Buffer?#

Device speeds don't match application consumption rates (a disk delivers 200 MB/s but the app reads in 100-byte chunks)
Amortises system call overhead — one read(64KB) beats 65536 read(1B) calls
Allows the OS to reorder writes (write-back cache)

Single Buffer#

text

Loading editor…

Double Buffer#

Two buffers alternate. While the kernel writes buffer A to the device, the application fills buffer B. Then they swap. This hides device latency entirely when throughput is balanced.

text

Loading editor…

Circular Buffer (Ring Buffer)#

A fixed-size buffer treated as a circle with head (read) and tail (write) pointers. Common in network drivers and audio I/O.

text

Loading editor…

tail == head: buffer empty
(tail + 1) % size == head: buffer full
Lock-free variants (single producer, single consumer) avoid mutex overhead

The Page Cache#

Linux does not read from disk on every read() call. It maintains a page cache (also called the buffer cache in older literature) — a region of RAM that holds recently accessed file data.

text

Loading editor…

You can observe the cache effect:

bash

Loading editor…

Blocking vs Non-Blocking vs Asynchronous I/O#

Three distinct I/O models:

Blocking I/O (default)#

The calling thread sleeps until the I/O completes.

text

Loading editor…

Simple, correct, but one blocked I/O blocks one thread. For a server handling 10,000 connections, you'd need 10,000 threads.

Non-Blocking I/O#

The call returns immediately with EAGAIN if no data is available. The caller polls repeatedly or uses a readiness notification mechanism (select, poll, epoll).

python

Loading editor…

select monitors multiple file descriptors and returns which ones are ready. epoll (Linux) scales to millions of fds; select is limited to 1024.

Using selectors (higher-level)#

python

Loading editor…

Asynchronous I/O (AIO)#

Model	Thread blocks?	Notification	Python
Blocking	Yes	Return from call	default `open().read()`
Non-blocking	No	Poll / readiness event	`socket.setblocking(False)` + `select`
Async	No	Completion callback	`asyncio`, `aiofiles`

Disk Scheduling Algorithms#

SSDs have no mechanical head, so disk scheduling is less critical for them — but many systems still run HDDs in NAS/backup scenarios.

FCFS — First Come, First Served#

Service requests in arrival order.

text

Loading editor…

Simple but causes wild head thrashing if requests are scattered.

SSTF — Shortest Seek Time First#

Always service the request closest to the current head position.

text

Loading editor…

Better throughput, but can starve requests far from the current head position.

SCAN (Elevator Algorithm)#

The head sweeps from one end to the other, servicing requests in its current direction. At the end it reverses.

text

Loading editor…

Good throughput; requests near the ends get longer waits.

C-SCAN (Circular SCAN)#

Like SCAN, but on the return sweep the head jumps back to the beginning without servicing requests. Provides more uniform wait times.

text

Loading editor…

C-SCAN is the most commonly used algorithm for HDDs because it provides bounded worst-case wait times.

Algorithm Comparison#

Algorithm	Seek distance	Starvation risk	Uniformity
FCFS	High	None	Poor
SSTF	Low–medium	Yes	Poor
SCAN	Low	None	Medium
C-SCAN	Low	None	Good

Key Takeaways#

The CPU talks to device controllers via MMIO; controllers handle the actual device protocol
Polling is simple but wastes CPU; interrupt-driven I/O is the standard for most devices
DMA lets devices transfer data directly to RAM, freeing the CPU for other work
Device drivers abstract hardware; the kernel I/O subsystem sits above them
The page cache serves most file reads from RAM — disk access is the exception, not the rule
Blocking I/O is simple; non-blocking + epoll is necessary for high-concurrency servers
For HDDs, C-SCAN or SCAN gives the best balance of throughput and fairness

I/O Systems — Buffering, DMA & Storage Scheduling

I/O Systems — Device Drivers, Buffering & DMA

I/O Hardware Overview#

Polling vs Interrupt-Driven I/O#

Polling (Busy Wait)#

Interrupt-Driven I/O#

DMA — Direct Memory Access#

I/O Software Layers#

Buffering#

Why Buffer?#

Single Buffer#

Double Buffer#

Circular Buffer (Ring Buffer)#

The Page Cache#

Blocking vs Non-Blocking vs Asynchronous I/O#

Blocking I/O (default)#

Non-Blocking I/O#

Using selectors (higher-level)#

Asynchronous I/O (AIO)#

Disk Scheduling Algorithms#

FCFS — First Come, First Served#

SSTF — Shortest Seek Time First#

SCAN (Elevator Algorithm)#

C-SCAN (Circular SCAN)#

Algorithm Comparison#

Key Takeaways#

I/O Systems — Buffering, DMA & Storage Scheduling

I/O Systems — Device Drivers, Buffering & DMA

I/O Hardware Overview#

Polling vs Interrupt-Driven I/O#

Polling (Busy Wait)#

Interrupt-Driven I/O#

DMA — Direct Memory Access#

I/O Software Layers#

Buffering#

Why Buffer?#

Single Buffer#

Double Buffer#

Circular Buffer (Ring Buffer)#

The Page Cache#

Blocking vs Non-Blocking vs Asynchronous I/O#

Blocking I/O (default)#

Non-Blocking I/O#

Using selectors (higher-level)#

Asynchronous I/O (AIO)#

Disk Scheduling Algorithms#

FCFS — First Come, First Served#

SSTF — Shortest Seek Time First#

SCAN (Elevator Algorithm)#

C-SCAN (Circular SCAN)#

Algorithm Comparison#

Key Takeaways#