cloud · level 5

Cost Anti-Patterns

The leaks every cloud bill has.

200 XP

Cost Anti-Patterns

Every cloud bill has the same half-dozen line items leaking money, and engineers find them by staring at Cost Explorer for an hour. This lesson is that hour, compressed.

Analogy

Think of a big old house at the end of a long month when the utility bills are all higher than expected. The same small problems show up in every house: a tap dripping in the upstairs bathroom, an unused fridge humming in the garage, a window left ajar under the thermostat, lights on in the basement nobody visits. None of them is dramatic on its own, but together they are half the bill. A walk-through with a torch and a checklist finds them in an hour — and it's the same leaks every time.

The top five leaks

  1. NAT Gateway data processing. $0.045/GB through the NAT, on top of egress. A chatty private-subnet workload can quintuple your internet bandwidth bill without you noticing.
  2. Cross-AZ data transfer. $0.01/GB each way between AZs in the same region. Microservices in different AZs yell at each other 24/7; the bill scales linearly.
  3. Egress to the internet. First 100 GB/month free, then $0.09/GB (dropping at volume). At TB scale, egress is often the single biggest line item.
  4. Unbounded log ingestion. CloudWatch Logs is $0.50/GB ingested + storage + queries. A DEBUG-verbose microservice at scale can hit five figures in a month.
  5. Idle resources. Unattached EBS volumes, idle NAT Gateways, unused ALBs, oversized RDS in non-prod. Pure waste that compounds.

NAT Gateway is the loudest leak

A NAT Gateway costs three ways:

  • Hourly per gateway per AZ: ~$0.045/hr = ~$32/month just for existing.
  • Per-GB data processing: $0.045/GB of traffic through the gateway.
  • Plus regular internet egress on top.

A private-subnet app pushing 1 TB/month of logs to a third-party SaaS = $45 NAT processing + ~$90 internet egress = $135/month. Pull it through a Gateway VPC endpoint (for S3/DynamoDB, free) or an Interface VPC endpoint / PrivateLink (for most AWS services, cheaper than NAT at scale) and most of that evaporates.

Cross-AZ is the sneaky leak

It's $0.01/GB each way. Tiny per-request, but microservices-at-scale chat looks like:

  • 500 req/s × 2 KB payload × 2 directions = ~86 GB/day = ~2.6 TB/month = ~$26/mo per service-pair cross-AZ link.

You can have 30 of those pairs. $800/month, gone. Fixes:

  • Deploy services zonally (one deployment per AZ) and prefer same-AZ traffic.
  • Use VPC Lattice or service mesh zone-aware load balancing so in-zone endpoints win the routing decision.
  • When multi-AZ HA matters, accept the cost intentionally — but measure it.

Egress to the internet

Most of what you send out is to other AWS accounts or the same cloud; only internet egress is expensive. Reduce it by:

  • Serving through CloudFront. Cache-hit egress is cheaper than origin egress.
  • Using VPC peering / Transit Gateway instead of going out and back in through public IPs.
  • Compressing responses (gzip/brotli) — easy 3–5× savings on JSON APIs.

Log ingestion: the silent killer

CloudWatch Logs is very expensive at scale ($0.50/GB ingested). Tactics:

  • Sample INFO and DEBUG in production.
  • Ship high-volume logs to S3 + Athena or Firehose → S3 for cheap retention with queryable access.
  • Keep CloudWatch for operational logs only; long-tail archive elsewhere.

Idle and oversized resources

Run these checks weekly:

  • Unattached EBS volumes. Snapshot, then delete.
  • Idle NAT Gateways in AZs with no workloads.
  • Load balancers with zero targets.
  • Elastic IPs not attached to a running instance (billed per hour when detached).
  • Non-prod databases running 24/7 at production size.
  • Oldest EC2 generation — gp3 is cheaper and faster than gp2; m7i beats m5 on $/vCPU-hr.

The one chart to watch

In Cost Explorer, group by Usage Type not Service. That surfaces the granular line items — DataTransfer-Regional-Bytes, NatGateway-Bytes, DataProcessing-Bytes, LogsIngestion — that Service grouping hides.

The right order of work

  1. Delete waste. Idle resources. Pure profit, zero risk.
  2. Right-size. Compute Optimizer + Trusted Advisor have concrete recommendations.
  3. Fix architecture leaks. NAT processing, cross-AZ, egress — the structural stuff.
  4. Commit to discounts. Only after steps 1–3. Reserved Instances and Savings Plans lock you in — you want to be sure what you're committing to is right-shaped first.

Tag everything with an owner. Untagged resources are where waste hides because nobody feels responsible for them.