Network troubleshooting
Walk the OSI stack — ping, dig, curl, tcpdump — and pin the broken layer.
Network troubleshooting
When a user files a ticket that says "the internet is broken," the worst thing you can do is start guessing. The best thing you can do is walk the OSI stack from the bottom up, eliminating layers methodically until exactly one is left holding the bag. This level builds that habit.
The five-question walk
The whole discipline collapses to five sequential yes/no questions, each pinning the next layer:
- Can you ping? — link + network are up. (
ping 8.8.8.8, thenping host) - Can you resolve? — DNS is healthy. (
dig host,nslookup host) - Can you route? — the path to the destination is intact. (
traceroute host,mtr host) - Can you TLS? — the certificate chain validates and the handshake completes. (
curl -v https://host,openssl s_client -connect host:443) - Can the app parse? — the request is well-formed and the server is healthy. (
curl -i https://host/path, app logs)
The first "no" is the layer to fix. The four levels above it are wasted noise; the four levels below it are already proven good.
The five-layer cheat sheet
| Layer | Carries | Ask | Probe with | Common failure |
|---|---|---|---|---|
| Link / L2 | ARP, Ethernet, Wi-Fi | "is my cable / radio working?" | ip link, tcpdump arp |
switch port amber, ARP storm, NIC dead |
| Network / L3 | IP, ICMP, routes | "is the route to the host alive?" | ping, traceroute, ip route |
wrong default gateway, blackhole route |
| Transport / L4 | TCP, UDP, ports | "did the 3-way handshake complete?" | nc -zv, tcpdump tcp |
RST flood, port not listening, firewall |
| TLS / L5–6 | handshake, certificates | "did the cert chain validate?" | openssl s_client, curl -v |
expired cert, wrong SAN, CA missing |
| Application / L7 | HTTP, DNS, app logic | "did the server answer correctly?" | dig, curl -i, app traces |
5xx, NXDOMAIN, wrong response shape |
Reading the canonical commands
ping
A round-trip ICMP echo. 0% packet loss with sub-second times means link + network are fine — keep walking up. 100% packet loss means link or network is down; check cables, ARP, default gateway. "Destination Host Unreachable" is the gateway saying it has no route — that's a routing problem, layer 3.
traceroute
Sends UDP/ICMP probes with successively higher TTLs to map the path. Each hop responds in turn until you hit the destination → routing is good. Route stops at hop N with * * * → the failure is at hop N (or, if N is your own gateway, that's the box to debug). Whole route is asterisks → either the destination drops ICMP entirely (common) or the link is down before hop 1.
dig
The clean DNS query tool. Status: NOERROR with an ANSWER section → DNS works. NXDOMAIN → the name does not exist (typo? wrong TLD? split-horizon DNS biting you?). SERVFAIL → the authoritative server errored — usually a misconfigured zone. "connection timed out; no servers could be reached" → your resolver is unreachable; that's a network failure, not a DNS one. Check /etc/resolv.conf.
curl -i and curl -v
curl -I (or -i) prints the response headers; curl -v adds verbose connection tracing — TLS handshake details, the request line, the response status. curl: (60) SSL certificate problem → TLS layer (expired cert, missing CA, wrong CN/SAN). curl: (56) Connection reset by peer → transport (server killed the TCP connection — firewall? half-open connection? connection rate limit?). HTTP/2 500 → the wire is fine; the app is bleeding. HTTP/2 200 with the wrong body → still application — but now you're in payload territory.
tcpdump
The last resort when nothing higher-level explains the symptom. ARP storms → link layer (a rogue device or a misconfigured switch). Floods of TCP [R.] flags from the server → transport (server is rejecting the connection, often a firewall or LB rule). ICMP host unreachable from a router → routing (network). Save the captures; correlate with switch logs.
A worked example: "the API is broken"
A user reports api.example.com is returning errors. Walk the stack:
- Ping —
ping api.example.com→ 0% packet loss, 8 ms RTT. Link + network are fine. - dig —
dig api.example.com→ NOERROR, ANSWER section returns198.51.100.5. DNS is fine. - curl -v —
curl -v https://api.example.com/health→ TLS handshake completes (* SSL connection using TLSv1.3), then< HTTP/2 503. TLS is fine, transport is fine, app is bleeding. - Stop walking. The fault is at L7 —
application. Tell the on-call engineer; the wire is innocent.
Total time: ~30 seconds. Total guessing: zero. The same scenario without the discipline could easily turn into an hour of "let me restart the laptop, then the router, then the WiFi card."
Patterns that catch every junior
- Half-open TCP — the client thinks the connection is up, the server forgot. Symptoms: long-lived connections that hang on the next write. Cause: NAT or firewall idle-timeouts dropping state. Fix: TCP keepalives or shorter app-level timeouts.
- MTU / fragmentation — small packets work, large ones vanish. Cause: a tunnel (VPN, GRE, IPsec) lowers the path MTU but doesn't honor PMTUD. Fix:
ip link set eth0 mtu 1400, or enable PMTUD properly. - DNS cache poisoning / split-horizon DNS — different machines resolve the same name to different IPs. Cause: corporate split-horizon (internal vs external resolvers) or a stale cache. Fix:
dig +short host @resolver1vs@resolver2; flush the cache. - Captive portals — the laptop has Wi-Fi, ping works to the gateway, but every HTTPS request returns a 302 to the hotel's login page. Fix: open a browser, accept the portal terms.
- NAT weirdness — outbound works, inbound doesn't. Or two clients behind the same NAT can't both connect. Cause: source-NAT port collisions or asymmetric routing. Fix:
traceroute --tcpto confirm path; check the gateway's connection-tracking table.
Gotchas
- "It must be the network." It's almost never the network. The base rate is: app, app, app, DNS, app, then maybe the network. Default to eliminate L7 first and only escalate when L7 is provably innocent.
- IPv6 hidden behind IPv4. Modern stacks try IPv6 first (Happy Eyeballs). If
dig AAAA hostreturns junk and your network blackholes IPv6, every connection will incur a few hundred ms of failover before falling back to IPv4. Fix: setnet.ipv6.conf.all.disable_ipv6 = 1or fix the upstream IPv6. - Symmetric assumptions on asymmetric paths.
tracerouteshows the forward path; the return path can be entirely different. A failure in the reply direction is invisible to a single-ended traceroute. pingblocked at the firewall. ICMP is often dropped on hardened hosts.pingfailing is necessary but not sufficient to declare network down — confirm with TCP probes (nc -zv host 443) before reaching for the cable.
Playground
Pick one of five canned scenarios in the playground (dns-fails, tcp-reset, tls-handshake-fails, app-500, link-down). Each command's output is the canned real-world response — ping returns the same string an actual /usr/bin/ping would. As you run commands, the candidate-layers indicator shrinks; your job is to walk efficiently and name the broken layer.
Visualizer
The OSI stack panel is a five-rectangle reference card — application at the top, link at the bottom, with each rectangle annotated by what it carries (HTTP, certs, ports, IP, ARP) and which command pokes at it. Click a layer to toggle its candidate state if you're using it as a scratchpad while reading.
Challenge
The seed picks one of the canned scenarios; you submit the broken layer's name. The validator accepts informal answers — app, tcp, L4, arp all parse — but if you can't name exactly which layer is broken, the answer is not yet ripe. Walk one more command first.