cloud · level 4

VPC Networking

Subnets, routes, security groups, NACLs — why A can't reach B.

225 XP

VPC Networking

A VPC is a software-defined L3 network. Every packet travels through the same decision chain, and if a flow is broken, it's broken at exactly one of those gates. Knowing the chain end-to-end is the difference between "check everything at random" and "look at the one thing that matters".

Analogy

Think of a large gated residential estate. Every car that enters has to pass the same sequence of gates in order: the main estate gate, then the subdivision's side gate, then a road sign saying which streets exist, then the guard at the cul-de-sac, then the homeowner's front door with its peephole. If the pizza driver never makes it to the kitchen, one of those five gates turned them away — and the fastest way to find out which is to walk the path, not to randomly re-knock on the front door ten times. Route tables are the road signs; security groups are the peepholes.

The building blocks

Thing	What it is
VPC	A private address space (usually a /16) scoped to one region.
Subnet	A /24-ish slice of the VPC pinned to one availability zone.
Route table	Per-subnet: where does traffic for this destination go?
Internet Gateway (IGW)	Grants a subnet a public route to the internet.
NAT Gateway	Lets private-subnet instances egress to the internet; return traffic only.
Security Group (SG)	Per-instance stateful allow-list.
NACL	Per-subnet stateless allow/deny list, evaluated in order.

Public subnet = subnet whose route table has 0.0.0.0/0 → igw-xxx. Private subnet = everything else (usually 0.0.0.0/0 → nat-gw-xxx).

The rule: an instance is only public if both (a) its subnet routes to an IGW and (b) it has a public IP address attached.

The reachability chain

To ask "can instance A reach instance B on TCP 443?", walk this list. The first no kills the packet:

A's security group egress allows tcp:443 to B's CIDR/SG?
A's subnet route has a route that covers B's IP? (Within-VPC routes are automatic; cross-VPC requires peering, Transit Gateway, or endpoints.)
A's subnet NACL egress allows tcp:443? (Stateless — allow both request and response ports.)
B's subnet NACL ingress allows tcp:443 from A's IP?
B's security group ingress allows tcp:443 from A's SG?

If the destination is the internet, step 2 requires an IGW (public subnet, with public IP) or a NAT Gateway (private subnet).

Security groups vs NACLs

They look similar, they are not:

Aspect	Security group	NACL
Scope	Per-instance	Per-subnet
State	Stateful (return traffic is auto-allowed)	Stateless (you must allow both directions)
Rules	Allow only	Allow AND deny
Evaluation	All rules combined (OR)	First match in numbered order
Default	Deny all (empty group blocks everything)	Default NACL = allow all; custom = deny all

Security groups are where 95% of your access control lives. Reach for NACLs only when you need a coarse-grained subnet-level hammer (blocking a bad IP range, compliance segmentation).

The stateless NACL trap: you allow TCP 443 ingress but forget that the response needs an ephemeral egress port (32768–65535). Response leaves, client drops the TCP stream.

Route tables

Each subnet has exactly one effective route table. Routes match most-specific prefix first:

10.0.0.0/16  → local         # in-VPC traffic
172.16.0.0/12 → tgw-abc       # to peered corp network
0.0.0.0/0    → igw-xxx or nat-yyy  # default route

Gotchas:

A subnet with no default route can only reach other subnets in the same VPC.
Peering routes must be added on both sides. Peering is not transitive — A↔B + B↔C does not give A↔C.
VPC endpoints create route table entries that short-circuit specific AWS services over the private network, bypassing NAT and saving egress cost.

NAT Gateway — the tax nobody warns you about

A NAT Gateway lets private instances reach the internet. It costs:

Hourly: ~$0.045/hr per AZ = ~$32/month just for existing.
Data processing: $0.045/GB passing through — on top of any egress cost.

That second one is the one that blows up bills. A chatty log forwarder in a private subnet pushing 1 TB/month to a SaaS vendor costs $45 in NAT processing plus $90 in egress — $135/month that pure internet traffic would have been $90.

Fixes:

VPC endpoints for AWS services (S3 gateway endpoint is free; interface endpoints cost but bypass NAT).
Gateway Load Balancer endpoints for traffic inspection appliances.
PrivateLink for third-party SaaS services when the vendor offers it.

The debugging order

VPC flow logs first. A REJECT record tells you exactly which layer dropped the packet. No record means it never tried.
Reachability Analyzer (AWS) / Connectivity Tests (GCP) simulate the path and name the blocking gate.
ssm:send-command a ping/curl from the source to the destination to confirm L7 vs L3/L4 failure.

Don't open 0.0.0.0/0 on the destination SG "just to see if it works". You'll forget to close it.

Multi-AZ discipline

Subnets are pinned to one AZ. HA in AWS means: at least one subnet per AZ you care about, each with its own NAT Gateway (or an architecture that tolerates NAT-GW failure in one AZ). One NAT Gateway shared across AZs means an AZ outage in that one AZ also breaks egress for every private subnet elsewhere in the VPC.