Namespaces & cgroups
What makes a container a container — and what it isn't.
Namespaces & cgroups
A container is not a small VM. There is no guest kernel, no boot, no hypervisor. A container is a tree of processes on the host that the kernel has placed into Linux namespaces and cgroups. That's it. Strip those two mechanisms away and docker run is just fork + exec.
The two mechanisms answer different questions:
- Namespaces — what can this process see?
- cgroups — how much can this process use?
Analogy
Think of a shared office building. A namespace is a soundproofed, one-way-mirror office: the person inside thinks they are alone in the building — their own room number, their own kettle, their own notice board — and has no idea the other offices even exist. A cgroup is the building manager's clipboard: this tenant gets one parking space, 20 amps of power, and ten gigs of hot water a day, regardless of how many times they try to run the shower. Same building, same walls; the two clipboards shape the illusion.
The namespaces
Each namespace isolates one aspect of the kernel's global view.
| Namespace | What it isolates |
|---|---|
pid |
Process IDs. Inside, the container's entrypoint is PID 1. |
net |
Interfaces, routes, iptables, sockets, port numbers. |
mnt |
Mount points. Lets the container have its own root filesystem. |
ipc |
System V IPC, POSIX message queues. |
uts |
Hostname and NIS domain name (from "Unix Timesharing System"). |
user |
UID/GID mappings. Root inside → unprivileged UID outside. |
cgroup |
View of the cgroup hierarchy. |
time |
Boot time and monotonic clock (newer kernels only). |
You can see them on any Linux box:
ls -l /proc/self/ns/
# each entry is a symlink to an inode that identifies the namespace
Two processes are in the same namespace iff those symlinks resolve to the same inode.
cgroups
Namespaces isolate visibility; cgroups cap consumption. They're organized as a filesystem under /sys/fs/cgroup with one sub-hierarchy per resource controller.
| Controller | What it caps |
|---|---|
cpu |
CPU time (shares, quotas, periods). |
memory |
Bytes of anon + page cache. Exceeds → OOM kill inside the container. |
io (blkio) |
Disk bandwidth and IOPS per block device. |
pids |
Number of processes — prevents fork bombs. |
devices |
Which /dev nodes the group may open. |
cat /sys/fs/cgroup/memory.max # bytes this cgroup may use
cat /sys/fs/cgroup/cpu.max # quota period in µs
A container runtime (runc, crun) writes to these files; that's 90% of what it does.
What a container runtime actually does
Roughly, when you run docker run --cpus=2 --memory=512m nginx, the flow is:
- Pull the image (layers into the content store).
- Prepare a root filesystem (overlayfs combining the read-only layers and a writable layer).
clone()a new process withCLONE_NEW*flags to create fresh namespaces.- Mount
/proc,/sys,/devinside the new mount namespace. pivot_rootinto the prepared root.- Write the cgroup limits to
/sys/fs/cgroup. execvethe entrypoint.
No kernel was booted. No VM was started. Just a very carefully constructed process.
Why this matters in practice
- A bug in the host kernel is a bug in your container. Containers share kernels; VMs don't.
- A missing
usernamespace means root inside the container is root on the host if it escapes. Prefer rootless runtimes (Podman, rootless Docker) or explicit user-namespace remapping. - CPU and memory limits are a single
echointo cgroup files away — you don't need a whole orchestrator to use them. Systemd uses cgroups for every service it starts.
Containers ≠ VMs — one more time
| VM | Container | |
|---|---|---|
| Kernel | Its own guest kernel | Shares host kernel |
| Boot time | Seconds–minutes | Milliseconds |
| Memory overhead | 100s of MB per VM | Process overhead only |
| Isolation boundary | Hardware virtualization | Kernel namespaces + cgroups |
| Compromised root | Contained inside the VM | Contained only by kernel bug surface |
VMs are stronger isolation. Containers are lighter. You can nest them: modern sandboxed runtimes (gVisor, Kata) run each container inside a thin VM for the best of both.