VPC Networking
Subnets, routes, security groups, NACLs — why A can't reach B.
VPC Networking
A VPC is a software-defined L3 network. Every packet travels through the same decision chain, and if a flow is broken, it's broken at exactly one of those gates. Knowing the chain end-to-end is the difference between "check everything at random" and "look at the one thing that matters".
Analogy
Think of a large gated residential estate. Every car that enters has to pass the same sequence of gates in order: the main estate gate, then the subdivision's side gate, then a road sign saying which streets exist, then the guard at the cul-de-sac, then the homeowner's front door with its peephole. If the pizza driver never makes it to the kitchen, one of those five gates turned them away — and the fastest way to find out which is to walk the path, not to randomly re-knock on the front door ten times. Route tables are the road signs; security groups are the peepholes.
The building blocks
| Thing | What it is |
|---|---|
| VPC | A private address space (usually a /16) scoped to one region. |
| Subnet | A /24-ish slice of the VPC pinned to one availability zone. |
| Route table | Per-subnet: where does traffic for this destination go? |
| Internet Gateway (IGW) | Grants a subnet a public route to the internet. |
| NAT Gateway | Lets private-subnet instances egress to the internet; return traffic only. |
| Security Group (SG) | Per-instance stateful allow-list. |
| NACL | Per-subnet stateless allow/deny list, evaluated in order. |
Public subnet = subnet whose route table has 0.0.0.0/0 → igw-xxx.
Private subnet = everything else (usually 0.0.0.0/0 → nat-gw-xxx).
The rule: an instance is only public if both (a) its subnet routes to an IGW and (b) it has a public IP address attached.
The reachability chain
To ask "can instance A reach instance B on TCP 443?", walk this list. The first no kills the packet:
- A's security group egress allows
tcp:443to B's CIDR/SG? - A's subnet route has a route that covers B's IP? (Within-VPC routes are automatic; cross-VPC requires peering, Transit Gateway, or endpoints.)
- A's subnet NACL egress allows
tcp:443? (Stateless — allow both request and response ports.) - B's subnet NACL ingress allows
tcp:443from A's IP? - B's security group ingress allows
tcp:443from A's SG?
If the destination is the internet, step 2 requires an IGW (public subnet, with public IP) or a NAT Gateway (private subnet).
Security groups vs NACLs
They look similar, they are not:
| Aspect | Security group | NACL |
|---|---|---|
| Scope | Per-instance | Per-subnet |
| State | Stateful (return traffic is auto-allowed) | Stateless (you must allow both directions) |
| Rules | Allow only | Allow AND deny |
| Evaluation | All rules combined (OR) | First match in numbered order |
| Default | Deny all (empty group blocks everything) | Default NACL = allow all; custom = deny all |
Security groups are where 95% of your access control lives. Reach for NACLs only when you need a coarse-grained subnet-level hammer (blocking a bad IP range, compliance segmentation).
The stateless NACL trap: you allow TCP 443 ingress but forget that the response needs an ephemeral egress port (32768–65535). Response leaves, client drops the TCP stream.
Route tables
Each subnet has exactly one effective route table. Routes match most-specific prefix first:
10.0.0.0/16 → local # in-VPC traffic
172.16.0.0/12 → tgw-abc # to peered corp network
0.0.0.0/0 → igw-xxx or nat-yyy # default route
Gotchas:
- A subnet with no default route can only reach other subnets in the same VPC.
- Peering routes must be added on both sides. Peering is not transitive — A↔B + B↔C does not give A↔C.
- VPC endpoints create route table entries that short-circuit specific AWS services over the private network, bypassing NAT and saving egress cost.
NAT Gateway — the tax nobody warns you about
A NAT Gateway lets private instances reach the internet. It costs:
- Hourly: ~$0.045/hr per AZ = ~$32/month just for existing.
- Data processing: $0.045/GB passing through — on top of any egress cost.
That second one is the one that blows up bills. A chatty log forwarder in a private subnet pushing 1 TB/month to a SaaS vendor costs $45 in NAT processing plus $90 in egress — $135/month that pure internet traffic would have been $90.
Fixes:
- VPC endpoints for AWS services (S3 gateway endpoint is free; interface endpoints cost but bypass NAT).
- Gateway Load Balancer endpoints for traffic inspection appliances.
- PrivateLink for third-party SaaS services when the vendor offers it.
The debugging order
- VPC flow logs first. A REJECT record tells you exactly which layer dropped the packet. No record means it never tried.
- Reachability Analyzer (AWS) / Connectivity Tests (GCP) simulate the path and name the blocking gate.
- ssm:send-command a ping/curl from the source to the destination to confirm L7 vs L3/L4 failure.
Don't open 0.0.0.0/0 on the destination SG "just to see if it works". You'll forget to close it.
Multi-AZ discipline
Subnets are pinned to one AZ. HA in AWS means: at least one subnet per AZ you care about, each with its own NAT Gateway (or an architecture that tolerates NAT-GW failure in one AZ). One NAT Gateway shared across AZs means an AZ outage in that one AZ also breaks egress for every private subnet elsewhere in the VPC.