containers · level 2

Namespaces & cgroups

What makes a container a container — and what it isn't.

150 XP

Namespaces & cgroups

A container is not a small VM. There is no guest kernel, no boot, no hypervisor. A container is a tree of processes on the host that the kernel has placed into Linux namespaces and cgroups. That's it. Strip those two mechanisms away and docker run is just fork + exec.

The two mechanisms answer different questions:

  • Namespaces — what can this process see?
  • cgroups — how much can this process use?

Analogy

Think of a shared office building. A namespace is a soundproofed, one-way-mirror office: the person inside thinks they are alone in the building — their own room number, their own kettle, their own notice board — and has no idea the other offices even exist. A cgroup is the building manager's clipboard: this tenant gets one parking space, 20 amps of power, and ten gigs of hot water a day, regardless of how many times they try to run the shower. Same building, same walls; the two clipboards shape the illusion.

The namespaces

Each namespace isolates one aspect of the kernel's global view.

Namespace What it isolates
pid Process IDs. Inside, the container's entrypoint is PID 1.
net Interfaces, routes, iptables, sockets, port numbers.
mnt Mount points. Lets the container have its own root filesystem.
ipc System V IPC, POSIX message queues.
uts Hostname and NIS domain name (from "Unix Timesharing System").
user UID/GID mappings. Root inside → unprivileged UID outside.
cgroup View of the cgroup hierarchy.
time Boot time and monotonic clock (newer kernels only).

You can see them on any Linux box:

ls -l /proc/self/ns/
# each entry is a symlink to an inode that identifies the namespace

Two processes are in the same namespace iff those symlinks resolve to the same inode.

cgroups

Namespaces isolate visibility; cgroups cap consumption. They're organized as a filesystem under /sys/fs/cgroup with one sub-hierarchy per resource controller.

Controller What it caps
cpu CPU time (shares, quotas, periods).
memory Bytes of anon + page cache. Exceeds → OOM kill inside the container.
io (blkio) Disk bandwidth and IOPS per block device.
pids Number of processes — prevents fork bombs.
devices Which /dev nodes the group may open.
cat /sys/fs/cgroup/memory.max    # bytes this cgroup may use
cat /sys/fs/cgroup/cpu.max       # quota period in µs

A container runtime (runc, crun) writes to these files; that's 90% of what it does.

What a container runtime actually does

Roughly, when you run docker run --cpus=2 --memory=512m nginx, the flow is:

  1. Pull the image (layers into the content store).
  2. Prepare a root filesystem (overlayfs combining the read-only layers and a writable layer).
  3. clone() a new process with CLONE_NEW* flags to create fresh namespaces.
  4. Mount /proc, /sys, /dev inside the new mount namespace.
  5. pivot_root into the prepared root.
  6. Write the cgroup limits to /sys/fs/cgroup.
  7. execve the entrypoint.

No kernel was booted. No VM was started. Just a very carefully constructed process.

Why this matters in practice

  • A bug in the host kernel is a bug in your container. Containers share kernels; VMs don't.
  • A missing user namespace means root inside the container is root on the host if it escapes. Prefer rootless runtimes (Podman, rootless Docker) or explicit user-namespace remapping.
  • CPU and memory limits are a single echo into cgroup files away — you don't need a whole orchestrator to use them. Systemd uses cgroups for every service it starts.

Containers ≠ VMs — one more time

VM Container
Kernel Its own guest kernel Shares host kernel
Boot time Seconds–minutes Milliseconds
Memory overhead 100s of MB per VM Process overhead only
Isolation boundary Hardware virtualization Kernel namespaces + cgroups
Compromised root Contained inside the VM Contained only by kernel bug surface

VMs are stronger isolation. Containers are lighter. You can nest them: modern sandboxed runtimes (gVisor, Kata) run each container inside a thin VM for the best of both.