containers · level 9

Service Mesh

Sidecars, mTLS, traffic shaping — and when you don't need a mesh.

200 XP

Service Mesh

A service mesh adds a uniform layer between every pair of services — for security, observability, and traffic control — without each service knowing anything about it. It's powerful, expensive, and easy to over-adopt.

Analogy

Imagine every desk in an office has a small concierge sitting next to it. Anything you send to another desk goes through your concierge first; anything coming to you arrives via theirs. The concierges all know each other (encrypt every handover, log every interaction, refuse a delivery if their boss says so), and they take instructions from a central manager. The desk worker barely changes their behaviour — the concierge handles the security and bureaucracy — but you've added one body per desk and a manager office. For a small team it's overhead. For a building with 200 employees from 12 departments it's the difference between order and chaos.

What a mesh actually does

Three jobs, in roughly this order of value:

  1. Service-to-service security — mutual TLS between every pair of pods, with auto-rotated short-lived certs. Your apps speak HTTP; the sidecars terminate and initiate TLS.
  2. Distributed tracing — every request gets a trace ID propagated through every hop, and each sidecar emits a span. End-to-end traces stitched together with zero code changes (if you propagate headers).
  3. Traffic shaping — canary releases, percentage-based shifts, retries, timeouts, circuit breakers — all configured at the mesh layer rather than in each service.

Bonus features: per-pod authorisation policies, fault injection (test "what if Service X is slow?" in production), multi-cluster meshing, ingress + east-west unification.

The sidecar pattern

The classic mesh model: every application pod gets a second container — a small proxy (Envoy for Istio, linkerd-proxy for Linkerd) — sharing its network namespace.

┌──────────── Pod ────────────┐
│ ┌────────┐    ┌──────────┐  │
│ │  app   │───▶│  proxy   │──┼──▶ outbound (mTLS)
│ │ :8080  │◀───│ :15001   │◀─┼─── inbound  (mTLS)
│ └────────┘    └──────────┘  │
└─────────────────────────────┘

Your app talks plain HTTP to localhost; the proxy handles encryption, retries, traffic shaping, and metrics. The kernel's iptables rules redirect all of the pod's traffic through the proxy automatically.

Cost: every pod has +1 container, +30–100 MB memory, +small CPU per request. Across 1000 pods that's tens of cores wasted just on sidecars. This cost is what drove the sidecar-less push.

Sidecar-less / ambient meshes

Rather than running a proxy per pod, ambient mesh moves the L4 enforcement into a node-level proxy (or eBPF program) and only deploys L7 proxies for workloads that need them.

  • Istio Ambient Mode — splits the mesh into ztunnel (per-node, secure transport) and waypoint proxies (per-namespace, optional, L7).
  • Cilium Service Mesh — eBPF-native L4 + L7 enforcement at the node level.
  • Linkerd 2.x — sticks with sidecars but with a tiny Rust micro-proxy.

Trade-offs: lower overhead, simpler upgrades, but at the cost of sidecar-only features (rich per-workload config, fine-grained traffic-splitting). The ambient direction is where the industry is going.

The contenders

Mesh Data plane Control plane Heaviness Sweet spot
Istio Envoy istiod heavy Enterprise, many teams, complex policy
Linkerd linkerd-proxy (Rust) linkerd-control-plane light Mid-size org, "give me mTLS + observability"
Consul Connect Envoy Consul servers medium Multi-runtime (VMs + K8s), HashiCorp shop
Cilium SM Cilium agent (eBPF) k8s + cilium light When you already run Cilium for CNI
AWS App Mesh Envoy App Mesh API medium AWS-only, ECS + EKS

The decision usually comes down to "how much config power do you actually need?" Istio is the maximalist choice; Linkerd is the "I want it to work without a dedicated platform team" choice.

When you need a mesh

You probably do, when:

  • You have many services from many teams and "mTLS everywhere" is a hard requirement.
  • You need canary deployments with fine traffic control without bespoke tooling.
  • You want end-to-end distributed tracing without instrumenting every service.
  • You enforce zero-trust networking — explicit allow-rules between services.

You probably don't, when:

  • You have fewer than ~10 services and one team. A mesh adds operational burden you'll resent.
  • Your services are internal-only behind an internal LB and TLS isn't really a concern day-to-day.
  • You haven't yet stabilised your deployment, observability, and on-call story. A mesh adds a layer; if your foundations are shaky, you'll spend the year debugging the mesh, not the apps.
  • You're a startup still finding product-market fit. There are higher-leverage things to do.

A useful heuristic: a mesh starts paying back somewhere around 30–50 services, multi-team. Below that, simpler tools (Network Policies, Ingress with TLS, OpenTelemetry SDK in each service) cover most of the same ground for less.

What you actually configure

For a typical Istio deployment:

  • Gateway — the ingress, replaces or augments your nginx/ALB.
  • VirtualService — request routing rules: "send /api/v2/* to the canary".
  • DestinationRule — what subsets of a service exist (v1/v2/v3) and how to balance.
  • PeerAuthentication — mTLS modes (STRICT, PERMISSIVE).
  • AuthorizationPolicy — explicit allow rules between workloads.
  • Telemetry — what to emit, where to send it.

For Linkerd:

  • A namespace annotation enables injection.
  • ServiceProfile for retries / timeouts / per-route metrics.
  • TrafficSplit (SMI) for canaries.
  • Server / ServerAuthorization / HTTPRoute for policy.

Notice: Istio has more knobs. Linkerd has fewer. Which one you want depends on your pain.

Common pitfalls

The mesh ate my CPU. Sidecars cost ~50 mCPU + 50 MB per pod baseline. With 1000 pods that's noticeable. Audit injection scope; not every namespace needs to be in the mesh.

TLS double-encryption. Your app speaks TLS to the sidecar, which speaks mTLS to the next hop. Check origination — usually you want apps to speak HTTP and let the sidecar handle TLS.

Headless services + mesh. Headless Services (StatefulSets) bypass the sidecar's load balancing because they expose pod IPs directly. Confirm with istioctl proxy-config endpoints.

kubectl exec and probes. Liveness/readiness probes go through the sidecar. If the sidecar isn't ready, probes fail — and the pod restarts in a loop. Most meshes ship a probe-bypass option (e.g. Istio's holdApplicationUntilProxyStarts).

Upgrading the control plane. Sidecar versions and control-plane versions must be compatible. Plan upgrades; never have a quarter-old data plane talking to a fresh control plane.

A safe adoption path

If you decide you need a mesh, the lowest-risk rollout:

  1. Pick one mesh and commit. Don't run both.
  2. Start with one namespace — usually a non-critical service. Get tracing and mTLS working there.
  3. Add observability first, traffic policy second. Don't enable circuit breakers until you have a baseline.
  4. Mirror traffic to test policies before enforcing them.
  5. Roll out namespace by namespace. Not the whole cluster at once.
  6. Document escape hatches — how to remove the sidecar from one pod when something goes wrong at 3am.

Done that way, a mesh feels like a force multiplier. Done by toggling a flag for the whole cluster on a Friday afternoon, it feels like a multi-month outage.

Tools in the wild

6 tools
  • Istiofree tier

    Heavyweight CNCF mesh on Envoy. Rich CRDs, ambient mode for sidecar-less.

    library
  • Linkerdfree tier

    Lightweight Rust-based mesh. Simpler model, fewer knobs, lower overhead.

    library
  • HashiCorp's mesh, multi-cluster + multi-runtime (VMs + K8s).

    library
  • eBPF-native sidecar-less mesh. Reuses Envoy for L7.

    library
  • Workload-identity standard powering most mesh mTLS.

    spec
  • Kialifree tier

    Istio dashboard — service graph, traffic flow, configuration validation.

    service