containers · level 7

Storage & Volumes

PVCs, CSI drivers, StatefulSets — the data lives outside the pod.

200 XP

Storage & Volumes

The fundamental rule: a container's filesystem dies with the container. Anything you write inside a container is gone the moment the kubelet restarts it. State has to live somewhere else, on a storage object that outlives the pod.

Analogy

A pod is a hotel room. You can sleep there, work there, eat there, but the housekeeping crew comes between guests and resets it. If you want anything to persist, you store it in the safe in the wall — that's the volume. The safe stays bolted to the building even when the guest changes; the room itself gets re-keyed every time someone new checks in. Everyone learns this the first time they kubectl delete pod and watch their database disappear.

The hierarchy: Volume → PV → PVC

Three layers, often confused:

  • Volume — the abstract handle inside a Pod spec. "Here's a directory with some bytes."
  • PersistentVolume (PV) — a cluster-level storage resource: "this 100 GB EBS volume exists at this AWS ID."
  • PersistentVolumeClaim (PVC) — a user's request for storage: "I need 50 GB, RWO, fast SSD." Kubernetes binds it to an available PV.

Pods reference PVCs, not PVs directly. PVCs decouple your manifest from the underlying storage system — your manifest says "I need 50 GB", and the cluster's storage class controller provisions the actual EBS volume / Ceph RBD / Azure Disk.

Access modes

The four access modes, named with the awkward Kubernetes shortcuts:

Mode Meaning Typical backing
RWO ReadWriteOnce — one node mounts at a time EBS, Azure Disk, GCE PD
ROX ReadOnlyMany — many nodes mount, read-only EFS in read-only mode
RWX ReadWriteMany — many nodes mount, all read+write EFS, NFS, CephFS
RWOP ReadWriteOncePod — exactly one pod (1.22+) Same as RWO but stricter

Block storage (EBS, Azure Disk, GCP PD) is fast but RWO. Network file systems (EFS, NFS, CephFS) are shareable but slower. Pick by access pattern: a Postgres replica needs RWO; a shared upload directory between five web pods needs RWX.

StatefulSet vs Deployment

The mistake everyone makes once: running a database as a Deployment.

  • Deployment: stateless replicas. Any pod can serve any request. Names are random hashes (web-7d4b6f-x9k2p). Roll-out is parallel.
  • StatefulSet: stateful replicas with identity. Pod names are predictable (postgres-0, postgres-1). Each pod has its own PVC. Roll-out is serial. Termination is reverse-serial.

If a workload needs:

  • Stable hostname (postgres-0.postgres.default.svc.cluster.local), or
  • Per-pod persistent storage (its own EBS volume), or
  • Ordered startup (must come up after primary), or
  • Ordered shutdown,

it's a StatefulSet. Otherwise it's a Deployment.

Per-pod storage with volumeClaimTemplates

The killer StatefulSet feature: every pod automatically gets its own PVC.

volumeClaimTemplates:
- metadata: { name: data }
  spec:
    accessModes: [ "ReadWriteOnce" ]
    storageClassName: gp3
    resources:
      requests: { storage: 100Gi }

Three replicas → three PVCs (data-postgres-0, data-postgres-1, data-postgres-2). Pod restart? Same name, same PVC, same data. Pod re-scheduled to another node? Storage follows.

CSI — the plugin interface

Long ago, every storage backend was hard-coded into the kubelet. Then CSI (Container Storage Interface) standardised it: any vendor implements a CSI driver, and Kubernetes consumes it without modification.

Notable CSI drivers:

  • aws-ebs-csi-driver — EBS-backed PVCs.
  • aws-efs-csi-driver — EFS-backed PVCs (RWX!).
  • azuredisk.csi.azure.com, azurefile.csi.azure.com — Azure equivalents.
  • pd.csi.storage.gke.io — GCP Persistent Disks.
  • rook-ceph — self-managed distributed storage.
  • csi.cilium.io, openebs.csi.openebs.io — niche but production.

You install one as a DaemonSet (one driver pod per node). It registers a StorageClass. Your PVC references the StorageClass. The driver handles the rest.

Storage classes

A StorageClass is "this is how to provision storage of this kind". On AWS:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: { name: gp3 }
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete

Two flags worth understanding:

  • volumeBindingMode: WaitForFirstConsumer — don't provision the EBS volume until a pod is scheduled. Lets the scheduler pick a node that matches the volume's AZ. Without this, you get scheduling failures because EBS is zonal.
  • reclaimPolicy: Retain vs Delete — what happens when the PVC is deleted. Retain keeps the underlying volume around (manual cleanup); Delete actually destroys the EBS. Production-data → Retain.

Ephemeral volumes worth knowing

Not all volumes need to persist:

  • emptyDir — temporary scratch space for the pod's lifetime. Gone on pod deletion. Faster than a PVC; safe for caches.
  • configMap / secret — read-only mounts of cluster-stored config / secrets.
  • projected — mount multiple sources (configMap + secret + downwardAPI) into one directory.
  • hostPath — mount a path from the node. Use sparingly; ties pod to node.
  • csi.ephemeral — driver-managed, tied to pod lifetime.

The most common reach is emptyDir:

volumes:
- name: scratch
  emptyDir:
    sizeLimit: 5Gi
    medium: Memory   # tmpfs — RAM-backed, blazing fast

EBS-backed PVCs vs EFS / NFS

Two big classes, with different tradeoffs:

Property EBS / GCP PD / Azure Disk EFS / NFS / CephFS
Access mode RWO only RWX
Latency Sub-millisecond Single-digit ms
Throughput Up to 4 GB/s (gp3 + tuning) Lower per-mount; scales horizontally
Cost Cheap per-GB More expensive per-GB
Multi-AZ No (zonal) Yes
Use cases Databases, single-writer apps Shared uploads, ML datasets

If you have to pick one default for a generic Kubernetes cluster, EBS-class block storage with RWO + StatefulSet is the right starting point. Add EFS only when you genuinely have a shared-write requirement.

Worth knowing about

  • VolumeSnapshots — point-in-time snapshots of a PVC, exposed as a Kubernetes object. Driver-dependent.
  • Volume expansion — many CSI drivers can grow a PV in place (allowVolumeExpansion: true). Edit the PVC's resources.requests.storage and the driver resizes.
  • Velero — cluster backup/restore. PVs included. The right answer for "I need DR for my Kubernetes state."
  • Local Persistent Volumes — for high-IOPS workloads that can tolerate node-stickiness (e.g. Cassandra, Kafka). Trade portability for performance.

A diagnostic loop

When a pod can't mount its volume:

  1. kubectl get pvc — is the PVC Bound? If Pending, the storage class isn't provisioning.
  2. kubectl describe pvc <name> — events explain why provisioning failed.
  3. kubectl describe pod <name> — at the bottom, FailedAttachVolume / Multi-Attach error are common: the volume is still attached to a previous node.
  4. kubectl get pv — is the PV in the right zone? EBS volumes are zonal; if the pod is in us-east-1b and the PV in us-east-1a, attachment fails.
  5. CSI driver logskubectl logs -n kube-system -l app=ebs-csi-controller for the driver-level error.

The most common production bug is "Multi-Attach error" — a pod gets rescheduled while the previous one is still in Terminating. Wait for the previous pod to fully die, or set terminationGracePeriodSeconds: 30 on critical workloads to avoid the race.

Tools in the wild

5 tools
  • EBS-backed PVCs for Kubernetes — RWO block storage.

    library
  • EFS-backed PVCs — RWX, multi-AZ NFS shares.

    library
  • Rook + Cephfree tier

    Self-managed distributed storage — RWO + RWX inside the cluster.

    library
  • Velerofree tier

    Backup + restore for Kubernetes resources and PV snapshots.

    service
  • OpenEBSfree tier

    Local-PV and replicated CAS for self-hosted Kubernetes.

    library