System Calls — The Bridge Between User Space and the Kernel
Your program calls open("data.txt"). A few microseconds later you have a file descriptor. What actually happened in between is one of the most fundamental mechanisms in operating systems — a controlled handoff from your code to the kernel and back.
Kernel Space vs User Space
Modern CPUs operate at different privilege levels. x86 CPUs define four rings (0–3); operating systems use two of them.
This separation is deliberate. A misbehaving user program cannot corrupt kernel data structures or access another process's memory. Any attempt to do so triggers a CPU fault rather than silent corruption.
The separation also means user programs cannot directly do anything interesting — they cannot read files, allocate memory, or send bytes over the network. They need a controlled door into the kernel. That door is the system call interface.
What a System Call Is
A system call (syscall) is a request from user space asking the kernel to perform a privileged operation on your behalf. The kernel verifies the request, performs it with its elevated privileges, and returns the result back to user space.
Every high-level I/O or process operation you use is built on top of syscalls:
| Python / C | Underlying syscall |
|---|---|
open(path) | open(2) |
file.read(n) | read(2) |
os.fork() | fork(2) |
socket.connect() | connect(2) |
malloc() | brk(2) or mmap(2) |
The number in parentheses is the traditional UNIX manual section — open(2) means the open syscall, documented in man section 2.
How a Syscall Works — Step by Step
The mechanism varies slightly by architecture. On x86-64 Linux the syscall instruction is used. Here is what happens when your program calls read():
The key point: the CPU never just jumps into arbitrary kernel code. The syscall instruction always transfers control to a fixed entry point, and the kernel dispatches based on the syscall number. This prevents user code from calling internal kernel functions at arbitrary offsets.
Common Syscall Categories
File I/O
These form the UNIX "everything is a file" model. A file descriptor is just an integer index into the kernel's per-process file table.
| Syscall | Signature | Description |
|---|---|---|
open | open(path, flags, mode) → fd | Open or create a file |
read | read(fd, buf, count) → n | Read up to n bytes |
write | write(fd, buf, count) → n | Write n bytes |
close | close(fd) → 0 | Release the file descriptor |
lseek | lseek(fd, offset, whence) → pos | Move read/write position |
stat | stat(path, statbuf) → 0 | Get file metadata |
fstat | fstat(fd, statbuf) → 0 | Same, via fd |
Process Management
| Syscall | Description |
|---|---|
fork() | Create a copy of the current process. Returns 0 to child, child PID to parent. |
exec(path, argv, envp) | Replace process image with a new program. Does not return on success. |
wait(status) | Parent blocks until a child exits; collects its exit status. |
exit(code) | Terminate the calling process. |
getpid() | Return the process's PID. |
kill(pid, sig) | Send a signal to a process. |
fork + exec together implement "run a program": fork creates a child, exec replaces the child's image with the target binary. Shells do this for every command you type.
Memory Management
| Syscall | Description |
|---|---|
brk(addr) | Move the top of the heap. Used by malloc for small allocations. |
mmap(addr, len, prot, flags, fd, off) | Map memory or a file into the virtual address space. |
munmap(addr, len) | Remove a mapping. |
mprotect(addr, len, prot) | Change protection bits (read/write/exec) on a memory region. |
mmap is enormously versatile: load a shared library, allocate large memory blocks, memory-map a file, or set up shared memory between processes — all through the same syscall.
Network I/O
| Syscall | Description |
|---|---|
socket(domain, type, proto) | Create a socket (returns fd). |
bind(fd, addr, addrlen) | Assign a local address to the socket. |
listen(fd, backlog) | Mark socket as passive (server-side). |
accept(fd, addr, addrlen) | Accept an incoming connection; returns a new fd. |
connect(fd, addr, addrlen) | Initiate a connection (client-side). |
send / recv | Send/receive data on a socket. |
Using Syscalls from Python
The os module exposes the POSIX syscall layer directly. The standard file operations (open, read, write) are wrappers around the syscalls of the same name.
Observing Syscalls with strace
strace traces every syscall a process makes. Running strace python3 -c "print('hi')" produces something like:
Reading strace output:
- Each line is one syscall:
name(args) = return_value - Negative return values are errors (
-1 ENOENTmeans file not found) - You see every file opened, every memory mapping, the final
writethat puts "hi" on screen
strace is invaluable for debugging: "why is my program slow?" → look for unexpected read calls or stat calls in a loop.
Performance — Why Syscalls Are Expensive
A syscall is not free. The cost comes from several sources:
The mode switch itself (ring 3 → ring 0 → ring 3) costs roughly 100–300 ns on modern hardware. This is why:
- Buffered I/O (
fread/Python'sfile.read) batches many small reads into a few largereadsyscalls writev(scatter-gather I/O) lets you write several buffers in one syscall instead of multiplewritecallsio_uring(Linux 5.1+) submits syscalls asynchronously via a ring buffer, eliminating round-trips for high-throughput I/O- Databases use
O_DIRECT(bypass the page cache) and batch multiple operations to avoid syscall overhead per row
Meltdown / Spectre and KPTI
The 2018 Meltdown vulnerability exploited CPU speculation to read kernel memory from user space. The fix — Kernel Page Table Isolation (KPTI) — removes kernel page mappings from the user-space page table entirely. Every syscall now also triggers a page-table switch, raising syscall cost to 500–1000 cycles on older Intel hardware.
Syscall Table — Numbers on Linux x86-64
The kernel dispatches based on the number in rax. A few notable ones:
| Number | Name | What it does |
|---|---|---|
| 0 | read | Read from fd |
| 1 | write | Write to fd |
| 2 | open | Open file (legacy; newer code uses openat) |
| 3 | close | Close fd |
| 9 | mmap | Map memory |
| 12 | brk | Adjust heap end |
| 39 | getpid | Get PID |
| 57 | fork | Create child process |
| 59 | execve | Execute program |
| 60 | exit | Terminate process |
| 102 | getuid | Get user ID |
| 231 | exit_group | Exit all threads |
Summary
| Concept | One-line description |
|---|---|
| User space | Unprivileged mode; can't do I/O directly |
| Kernel space | Privileged mode; manages hardware |
| syscall instruction | Trap that switches from ring 3 to ring 0 |
| Syscall number | Integer in rax that identifies the operation |
| Return value | In rax after return; negative means error |
| strace | Tool that shows every syscall a process makes |
| Syscall cost | ~100–300 cycles for the mode switch alone |
System calls are the only legitimate path from user programs into the kernel. Every byte you read from disk, every socket you open, every process you spawn goes through this mechanism. Understanding it tells you where your program's time actually goes and why batching small operations into fewer, larger syscalls is almost always worth it.