Service Mesh
Sidecars, mTLS, traffic shaping — and when you don't need a mesh.
Service Mesh
A service mesh adds a uniform layer between every pair of services — for security, observability, and traffic control — without each service knowing anything about it. It's powerful, expensive, and easy to over-adopt.
Analogy
Imagine every desk in an office has a small concierge sitting next to it. Anything you send to another desk goes through your concierge first; anything coming to you arrives via theirs. The concierges all know each other (encrypt every handover, log every interaction, refuse a delivery if their boss says so), and they take instructions from a central manager. The desk worker barely changes their behaviour — the concierge handles the security and bureaucracy — but you've added one body per desk and a manager office. For a small team it's overhead. For a building with 200 employees from 12 departments it's the difference between order and chaos.
What a mesh actually does
Three jobs, in roughly this order of value:
- Service-to-service security — mutual TLS between every pair of pods, with auto-rotated short-lived certs. Your apps speak HTTP; the sidecars terminate and initiate TLS.
- Distributed tracing — every request gets a trace ID propagated through every hop, and each sidecar emits a span. End-to-end traces stitched together with zero code changes (if you propagate headers).
- Traffic shaping — canary releases, percentage-based shifts, retries, timeouts, circuit breakers — all configured at the mesh layer rather than in each service.
Bonus features: per-pod authorisation policies, fault injection (test "what if Service X is slow?" in production), multi-cluster meshing, ingress + east-west unification.
The sidecar pattern
The classic mesh model: every application pod gets a second container — a small proxy (Envoy for Istio, linkerd-proxy for Linkerd) — sharing its network namespace.
┌──────────── Pod ────────────┐
│ ┌────────┐ ┌──────────┐ │
│ │ app │───▶│ proxy │──┼──▶ outbound (mTLS)
│ │ :8080 │◀───│ :15001 │◀─┼─── inbound (mTLS)
│ └────────┘ └──────────┘ │
└─────────────────────────────┘
Your app talks plain HTTP to localhost; the proxy handles encryption, retries, traffic shaping, and metrics. The kernel's iptables rules redirect all of the pod's traffic through the proxy automatically.
Cost: every pod has +1 container, +30–100 MB memory, +small CPU per request. Across 1000 pods that's tens of cores wasted just on sidecars. This cost is what drove the sidecar-less push.
Sidecar-less / ambient meshes
Rather than running a proxy per pod, ambient mesh moves the L4 enforcement into a node-level proxy (or eBPF program) and only deploys L7 proxies for workloads that need them.
- Istio Ambient Mode — splits the mesh into ztunnel (per-node, secure transport) and waypoint proxies (per-namespace, optional, L7).
- Cilium Service Mesh — eBPF-native L4 + L7 enforcement at the node level.
- Linkerd 2.x — sticks with sidecars but with a tiny Rust micro-proxy.
Trade-offs: lower overhead, simpler upgrades, but at the cost of sidecar-only features (rich per-workload config, fine-grained traffic-splitting). The ambient direction is where the industry is going.
The contenders
| Mesh | Data plane | Control plane | Heaviness | Sweet spot |
|---|---|---|---|---|
| Istio | Envoy | istiod | heavy | Enterprise, many teams, complex policy |
| Linkerd | linkerd-proxy (Rust) | linkerd-control-plane | light | Mid-size org, "give me mTLS + observability" |
| Consul Connect | Envoy | Consul servers | medium | Multi-runtime (VMs + K8s), HashiCorp shop |
| Cilium SM | Cilium agent (eBPF) | k8s + cilium | light | When you already run Cilium for CNI |
| AWS App Mesh | Envoy | App Mesh API | medium | AWS-only, ECS + EKS |
The decision usually comes down to "how much config power do you actually need?" Istio is the maximalist choice; Linkerd is the "I want it to work without a dedicated platform team" choice.
When you need a mesh
You probably do, when:
- You have many services from many teams and "mTLS everywhere" is a hard requirement.
- You need canary deployments with fine traffic control without bespoke tooling.
- You want end-to-end distributed tracing without instrumenting every service.
- You enforce zero-trust networking — explicit allow-rules between services.
You probably don't, when:
- You have fewer than ~10 services and one team. A mesh adds operational burden you'll resent.
- Your services are internal-only behind an internal LB and TLS isn't really a concern day-to-day.
- You haven't yet stabilised your deployment, observability, and on-call story. A mesh adds a layer; if your foundations are shaky, you'll spend the year debugging the mesh, not the apps.
- You're a startup still finding product-market fit. There are higher-leverage things to do.
A useful heuristic: a mesh starts paying back somewhere around 30–50 services, multi-team. Below that, simpler tools (Network Policies, Ingress with TLS, OpenTelemetry SDK in each service) cover most of the same ground for less.
What you actually configure
For a typical Istio deployment:
- Gateway — the ingress, replaces or augments your nginx/ALB.
- VirtualService — request routing rules: "send /api/v2/* to the canary".
- DestinationRule — what subsets of a service exist (v1/v2/v3) and how to balance.
- PeerAuthentication — mTLS modes (
STRICT,PERMISSIVE). - AuthorizationPolicy — explicit allow rules between workloads.
- Telemetry — what to emit, where to send it.
For Linkerd:
- A namespace annotation enables injection.
- ServiceProfile for retries / timeouts / per-route metrics.
- TrafficSplit (SMI) for canaries.
- Server / ServerAuthorization / HTTPRoute for policy.
Notice: Istio has more knobs. Linkerd has fewer. Which one you want depends on your pain.
Common pitfalls
The mesh ate my CPU. Sidecars cost ~50 mCPU + 50 MB per pod baseline. With 1000 pods that's noticeable. Audit injection scope; not every namespace needs to be in the mesh.
TLS double-encryption. Your app speaks TLS to the sidecar, which speaks mTLS to the next hop. Check origination — usually you want apps to speak HTTP and let the sidecar handle TLS.
Headless services + mesh. Headless Services (StatefulSets) bypass the sidecar's load balancing because they expose pod IPs directly. Confirm with istioctl proxy-config endpoints.
kubectl exec and probes. Liveness/readiness probes go through the sidecar. If the sidecar isn't ready, probes fail — and the pod restarts in a loop. Most meshes ship a probe-bypass option (e.g. Istio's holdApplicationUntilProxyStarts).
Upgrading the control plane. Sidecar versions and control-plane versions must be compatible. Plan upgrades; never have a quarter-old data plane talking to a fresh control plane.
A safe adoption path
If you decide you need a mesh, the lowest-risk rollout:
- Pick one mesh and commit. Don't run both.
- Start with one namespace — usually a non-critical service. Get tracing and mTLS working there.
- Add observability first, traffic policy second. Don't enable circuit breakers until you have a baseline.
- Mirror traffic to test policies before enforcing them.
- Roll out namespace by namespace. Not the whole cluster at once.
- Document escape hatches — how to remove the sidecar from one pod when something goes wrong at 3am.
Done that way, a mesh feels like a force multiplier. Done by toggling a flag for the whole cluster on a Friday afternoon, it feels like a multi-month outage.
Tools in the wild
6 tools- libraryIstiofree tier
Heavyweight CNCF mesh on Envoy. Rich CRDs, ambient mode for sidecar-less.
- libraryLinkerdfree tier
Lightweight Rust-based mesh. Simpler model, fewer knobs, lower overhead.
- libraryConsul Connectfree tier
HashiCorp's mesh, multi-cluster + multi-runtime (VMs + K8s).
- libraryCilium Service Meshfree tier
eBPF-native sidecar-less mesh. Reuses Envoy for L7.
- specSPIFFE / SPIREfree tier
Workload-identity standard powering most mesh mTLS.
- serviceKialifree tier
Istio dashboard — service graph, traffic flow, configuration validation.