containers · level 5

Resources & Scheduling

requests vs limits, QoS, taints, and node selection.

150 XP

Resources & Scheduling

Kubernetes places pods onto nodes. The scheduler watches the API for unscheduled pods, filters the nodes to those that satisfy all hard constraints, scores the survivors, and binds the pod to the highest-scoring node. Resources, selectors, taints, and tolerations are how you tell it what "feasible" means.

Analogy

Think of seating guests at a wedding. The planner has a floor plan full of tables, each with a seat count and a few hard rules (no peanut allergies near the dessert bar, the grandparents need the quiet corner). First they filter — cross off every table that breaks a rule for this guest — then they score the remaining tables on soft preferences ("put them near people they'll enjoy"), and finally they pencil in the best seat. Taints are "this table is reserved for the bridal party", and tolerations are the wristband that lets a specific guest sit there anyway.

requests vs limits

Every container can declare two numbers per resource (CPU, memory):

resources:
  requests: { cpu: "500m", memory: "256Mi" }
  limits:   { cpu: "1",    memory: "512Mi" }

Requests = the reservation. The scheduler subtracts requests from node capacity to decide if a pod fits. If requests exceed what's available on every node, the pod sits Pending.
Limits = the hard cap. The kubelet configures cgroups so the container cannot exceed them at runtime. CPU over-limit is throttled; memory over-limit is OOM-killed inside the container.

500m = 500 millicores = half of one logical CPU. 256Mi = 256 mebibytes (power of 2). Use Mi/Gi, not M/G, unless you enjoy off-by-4.8%.

QoS classes

Kubernetes assigns each pod to one of three classes based on requests and limits:

Class	Rule
`Guaranteed`	Every container has `requests == limits` for both CPU and memory.
`BestEffort`	No requests or limits set on any container.
`Burstable`	Anything else.

Under node memory pressure, the kubelet evicts in order: BestEffort → Burstable → Guaranteed. Guaranteed pods also get tighter cgroup isolation. For anything that matters, set requests == limits on memory at least — it's a one-line move from Burstable to Guaranteed on the dimension that most often causes pain.

nodeSelector

The simplest way to pin a pod to a class of node is nodeSelector — a strict equality match on node labels.

spec:
  nodeSelector:
    tier: gpu
    zone: us-east

Nodes are labelled by the cloud provider (topology.kubernetes.io/zone, kubernetes.io/arch) and by you (kubectl label nodes node-a tier=gpu). nodeAffinity is the richer, expression-based cousin — supports In, NotIn, soft preferences.

Taints and tolerations

Taints are the inverse of selectors. A taint on a node repels pods unless they explicitly tolerate it. Common use: reserve a node pool for a class of work.

kubectl taint nodes node-a nvidia.com/gpu=true:NoSchedule

Any pod that does not tolerate nvidia.com/gpu will not be scheduled on node-a. A GPU pod opts in:

tolerations:
  - key: nvidia.com/gpu
    effect: NoSchedule

Three effects:

NoSchedule — don't place new pods unless tolerated.
PreferNoSchedule — soft hint; scheduler tries but won't refuse.
NoExecute — evict existing pods that don't tolerate.

Putting it together — the scheduling decision

For each node, the scheduler asks, in order:

Does the node have enough allocatable CPU and memory for this pod's requests?
Does the node's labels match the pod's nodeSelector / nodeAffinity?
Does the pod tolerate every NoSchedule taint on the node?
Are there any other plugins (pod anti-affinity, topology spread) saying no?

If every node fails, the pod stays Pending with a reason on its events. If multiple nodes pass, the score phase picks the best (spreading pods, preferring less-loaded nodes).

kubectl get pod trainer -o wide
kubectl describe pod trainer        # events: FailedScheduling or Scheduled
kubectl describe node node-a        # allocatable, taints, labels, used

Practical rules

Always set memory requests. Don't set memory limits > requests unless you've thought about it — OOM kills under bursty load are a common surprise.
Treat CPU requests as a floor, CPU limits as a ceiling. Some sites skip CPU limits entirely because throttling is worse than over-subscription; others require them.
Use taints for node pools (GPU, spot, dedicated team) rather than baking them into every pod spec as selectors.
PriorityClasses are the lever above QoS — critical system pods (kube-dns, metrics-server) run at high priority so they get scheduled and evict lower-priority pods if the node fills up.
Cluster autoscaler watches for unschedulable pods and adds nodes. It only works if your pods actually declare requests — otherwise it can't tell whether a new node would help.