containers · level 2

Namespaces & cgroups

What makes a container a container — and what it isn't.

150 XP

Namespaces & cgroups

A container is not a small VM. There is no guest kernel, no boot, no hypervisor. A container is a tree of processes on the host that the kernel has placed into Linux namespaces and cgroups. That's it. Strip those two mechanisms away and docker run is just fork + exec.

The two mechanisms answer different questions:

Namespaces — what can this process see?
cgroups — how much can this process use?

Analogy

Think of a shared office building. A namespace is a soundproofed, one-way-mirror office: the person inside thinks they are alone in the building — their own room number, their own kettle, their own notice board — and has no idea the other offices even exist. A cgroup is the building manager's clipboard: this tenant gets one parking space, 20 amps of power, and ten gigs of hot water a day, regardless of how many times they try to run the shower. Same building, same walls; the two clipboards shape the illusion.

The namespaces

Each namespace isolates one aspect of the kernel's global view.

Namespace	What it isolates
`pid`	Process IDs. Inside, the container's entrypoint is PID 1.
`net`	Interfaces, routes, iptables, sockets, port numbers.
`mnt`	Mount points. Lets the container have its own root filesystem.
`ipc`	System V IPC, POSIX message queues.
`uts`	Hostname and NIS domain name (from "Unix Timesharing System").
`user`	UID/GID mappings. Root inside → unprivileged UID outside.
`cgroup`	View of the cgroup hierarchy.
`time`	Boot time and monotonic clock (newer kernels only).

You can see them on any Linux box:

ls -l /proc/self/ns/
# each entry is a symlink to an inode that identifies the namespace

Two processes are in the same namespace iff those symlinks resolve to the same inode.

cgroups

Namespaces isolate visibility; cgroups cap consumption. They're organized as a filesystem under /sys/fs/cgroup with one sub-hierarchy per resource controller.

Controller	What it caps
`cpu`	CPU time (shares, quotas, periods).
`memory`	Bytes of anon + page cache. Exceeds → OOM kill inside the container.
`io` (blkio)	Disk bandwidth and IOPS per block device.
`pids`	Number of processes — prevents fork bombs.
`devices`	Which `/dev` nodes the group may open.

cat /sys/fs/cgroup/memory.max    # bytes this cgroup may use
cat /sys/fs/cgroup/cpu.max       # quota period in µs

A container runtime (runc, crun) writes to these files; that's 90% of what it does.

What a container runtime actually does

Roughly, when you run docker run --cpus=2 --memory=512m nginx, the flow is:

Pull the image (layers into the content store).
Prepare a root filesystem (overlayfs combining the read-only layers and a writable layer).
clone() a new process with CLONE_NEW* flags to create fresh namespaces.
Mount /proc, /sys, /dev inside the new mount namespace.
pivot_root into the prepared root.
Write the cgroup limits to /sys/fs/cgroup.
execve the entrypoint.

No kernel was booted. No VM was started. Just a very carefully constructed process.

Why this matters in practice

A bug in the host kernel is a bug in your container. Containers share kernels; VMs don't.
A missing user namespace means root inside the container is root on the host if it escapes. Prefer rootless runtimes (Podman, rootless Docker) or explicit user-namespace remapping.
CPU and memory limits are a single echo into cgroup files away — you don't need a whole orchestrator to use them. Systemd uses cgroups for every service it starts.

Containers ≠ VMs — one more time

	VM	Container
Kernel	Its own guest kernel	Shares host kernel
Boot time	Seconds–minutes	Milliseconds
Memory overhead	100s of MB per VM	Process overhead only
Isolation boundary	Hardware virtualization	Kernel namespaces + cgroups
Compromised root	Contained inside the VM	Contained only by kernel bug surface

VMs are stronger isolation. Containers are lighter. You can nest them: modern sandboxed runtimes (gVisor, Kata) run each container inside a thin VM for the best of both.