cloud · level 3

Serverless & Cold Starts

Why p99 looks nothing like p50 on Lambda.

200 XP

Serverless & Cold Starts

A "serverless" function still runs on a server — the platform manages scaling it to zero and back. The price of scale-to-zero is the cold start: the one-time latency of spinning up a fresh execution environment before it can serve its first request. Understanding the shape of that latency is the difference between a pleasant P99 and a pager.

Analogy

Think of a motion-sensor porch light versus one left on all night. The porch light that sleeps is free to run — no electricity burning while you're asleep — but the first person up the drive waits a beat while the bulb warms up and the sensor blinks. If a second visitor arrives ten seconds later, the light is already on and responds instantly ("warm"). Leave the bulb off for an hour and the next arrival pays the warm-up again. Provisioned concurrency is just leaving the light permanently on for the front door everyone uses.

What actually happens on a cold start

Container provision: platform allocates a microVM (Firecracker for Lambda, gVisor for Cloud Run). Milliseconds but real.
Runtime init: language runtime boots — V8 for Node, CPython import, JVM classloading, CLR JIT.
Code init: your module-scope code runs — SDK clients constructed, configs parsed, DB connections opened.
Handler invoked: first request executes.

Only step 4 repeats on subsequent "warm" invocations. Steps 1–3 run once per container and that's what you see in the P99 tail.

Runtime choice matters a lot

Typical Lambda cold starts on x86, 512 MB, outside a VPC:

Runtime	Cold start
Node.js 20	150–400 ms
Python 3.12	150–500 ms
Go (provided.al2023)	80–250 ms
.NET 8	600–1200 ms
Java 17 (Corretto)	800–2500 ms

The JVM cold-start tax is real. Tools like SnapStart for Java, GraalVM native-image, and Lambda's SnapStart for .NET and Python trim it significantly but add deploy complexity.

Memory is CPU

Lambda allocates CPU proportional to memory: at 1769 MB you get one full vCPU; at 128 MB you get a fraction. Raising memory often makes functions both faster and cheaper — total GB-seconds drops even though per-second price went up. Always benchmark before committing to 128 MB.

The VPC trap

Attaching a Lambda to a VPC (to reach RDS, ElastiCache, a private API) used to add seconds of cold-start latency because each container attached its own Elastic Network Interface. AWS shipped HyperPlane ENIs, which pool pre-warmed interfaces. That dropped the VPC overhead to ~100 ms for most accounts — but it still exists, and historical benchmarks in old blog posts lie about it.

Eliminating cold starts

Provisioned Concurrency (Lambda) / Minimum Instances (Cloud Run) / Always-On (Azure Functions Premium): the platform keeps N execution environments warm and bills you for keeping them alive, regardless of traffic. This eliminates cold starts up to that concurrency ceiling. Above it, traffic spills to on-demand (cold).

The math: provisioned concurrency is cheaper than on-demand per-invocation once utilization crosses ~60%. Below that, on-demand wins. Track actual utilization before committing.

SnapStart (Lambda) for Java 11/17/21, .NET, Python: the platform snapshots the initialized runtime after deploy, restores from snapshot on new container creation. Most of the runtime+code init cost moves to deploy time.

Writing init code well

Construct SDK clients outside the handler function.
Preload caches and warm up DB connections at module scope.
Lazy-load anything optional so cold starts stay small for the common path.
For Java, use aws-crt-java over the default HTTP client — it's significantly faster to init.

import { S3Client } from "@aws-sdk/client-s3";
const s3 = new S3Client({}); // runs once per container

export const handler = async () => {
  // handler runs on every invocation; s3 is reused
};

Tail latency math

If 1% of your invocations are cold starts at 500 ms and warm invocations are 40 ms:

P50 = 40 ms (all warm).
P99 = ~500 ms (the edge of the cold-start tail).
P99.9 = whatever your worst cold start is, plus a bit.

You cannot fix P99 by making warm invocations faster. You fix it by reducing the number of cold starts (Provisioned Concurrency), making them cheaper (lighter runtime, smaller init, SnapStart), or hiding them (async queues, retries).

When serverless isn't the right shape

Serverless is wrong when:

Traffic is steady and high — a right-sized container/VM is cheaper per request.
You need sustained long-running compute — per-invocation billing dominates.
Startup cost per request is unavoidable (huge ML model load).
You need fine-grained kernel features, raw UDP, or unusual filesystem semantics.

Serverless wins when traffic is spiky, workloads are event-driven, or you want to ship something without owning a cluster. Pick it for the shape of the workload, not because "serverless" sounds modern.