PKI from the IT Side
Certs expire. Watch the fleet and renew before users notice.
PKI from the IT Side
The cryptography side of PKI is covered elsewhere in the arcade — RSA, TLS, the math of signatures. This level is about the operational side: the part an IT team owns once the theory is shipped. Certs are files with expiry dates, and the IT team's job is to keep the fleet of them healthy before a missed renewal takes a service down at 7am on a Monday.
Analogy
Think of a cert like a driver's licence for a server. Issued by an authority (the CA), valid for a window (notBefore → notAfter), identifies one specific holder by name (CN / SANs), and revocable if it falls into the wrong hands. A licence with the right name, within its dates, issued by an authority the other party recognises, gets you through the gate. A licence that's expired — even by a day — doesn't. A licence issued in a name that doesn't match yours doesn't. A licence issued by a club the gate doesn't recognise (untrusted root) doesn't. IT runs the DMV: track who holds what, renew before the card expires, cancel when someone leaves.
The lifecycle, in four moves
Every cert in the fleet sits somewhere in this loop:
- Issue — a CA signs a cert binding a public key to an identity (a CN + SANs). The CA writes notBefore / notAfter; you either accept that window or go ask for a longer one.
- Deploy — the cert and its private key land on the host that needs them. Private key permissions matter: readable only by the service account that uses it. If the key leaks, you revoke.
- Monitor — track every cert's
notAfter. Any monitoring stack eventually grows anopenssl x509 -enddatescraper or the managed equivalent. - Renew or revoke — before expiry, issue a replacement and swap it in without downtime. If the cert's compromised or the host is retired, revoke so the CA's CRL / OCSP responder tells the world the cert is dead.
Skip any step and you get an outage. The playground lets you walk a fleet through the loop; the challenge makes you spot the cert that will break first.
Common outages
Almost every cert-related page is one of these:
| Outage | What the user sees | Root cause |
|---|---|---|
| Expired leaf | NET::ERR_CERT_DATE_INVALID |
The renewal job didn't run, or nobody replaced the file on the host. |
| Untrusted root | NET::ERR_CERT_AUTHORITY_INVALID |
The device doesn't have the signing CA in its trust store — classic on newly-provisioned Windows images missing the internal root CA. |
| Wrong CN / SAN | NET::ERR_CERT_COMMON_NAME_INVALID |
Someone used a cert issued for api.example.com on a host serving www.example.com. Browsers match SANs, not CNs, these days. |
| Missing chain | Mobile clients break, desktop browsers often recover via AIA fetch | The server is sending the leaf but not the intermediate. Fix the bundle, not the cert. |
| Expired intermediate | Everything breaks, despite a perfectly valid leaf | The middle of the chain aged out. Rotating intermediates is a CA responsibility; you notice because openssl s_client -showcerts shows the dates. |
| Revoked leaf | Mixed — OCSP-stapling browsers see it, older clients may not | A cert the CA has marked revoked is still being served. Pull the file. |
Notice the pattern: users don't know or care that an intermediate cert is expired — they just see ERR_CERT_* and open a ticket. IT's job is to translate the symptom into the right cert, then into the right fix.
Key escrow and recovery
A private key that only one person knows is, paradoxically, a single point of failure. If that person leaves or loses the key, any encrypted data signed under it is bricked and any service depending on it has to be re-issued. Mature IT teams keep an escrow — a copy of critical private keys stored in a protected vault (HashiCorp Vault, AWS KMS, on-prem HSM) with strict access logging. The rule of thumb: escrow data-protection keys (things that encrypt stored data), do not escrow identity keys (signing keys for code, for devices). If you escrow an identity key, you've just handed the vault admin the ability to impersonate anyone in the fleet.
Recovery drills matter. The first time you need to unseal a Vault or restore a cert from a backup should not be during an outage.
ACME and Let's Encrypt — web-facing automation
For anything behind a public DNS name with a browser audience, ACME-based automation (Let's Encrypt is the dominant free CA; ZeroSSL and BuyPass are alternatives) has effectively eliminated the "forgot to renew" class of outage. The flow:
- Client proves control of the DNS name (HTTP-01 serves a token from the host, DNS-01 publishes a TXT record).
- CA issues a short-lived leaf — 90 days typical.
- Client schedules a renewal at ~60 days; if that fails, retries continue until expiry.
- On every success, the deploy hook reloads the serving process so the new cert goes live.
Tools that implement the client side: certbot, acme.sh, lego, the reverse proxy Caddy has it baked in, Kubernetes uses cert-manager. The operational principle is that 90-day certs are safer than 3-year certs: if your automation doesn't work, you find out in two months instead of two years.
Internal CAs — device certs and code signing
Public CAs won't issue a cert for laptop-ceo.corp or Corp Code Signing. That's where an internal CA comes in. The common options:
- Microsoft Active Directory Certificate Services (AD CS) — the classic domain-joined option. Auto-enrols Windows machines, signs their 802.1X / VPN client certs, and integrates with group policy for trust-store distribution.
- HashiCorp Vault PKI — a secrets-engine that issues short-lived certs via API. Excellent for service mesh, internal APIs, CI workers; the certs live for hours, not years, and the fleet churns automatically.
- Smallstep — modern ACME-style internal CA, designed to make ADCS-shaped deployments easier and to support SSH cert issuance alongside X.509.
- DigiCert / Venafi — enterprise-scale management platforms layered over public or internal CAs, focused on cert inventory, expiry reporting, and policy enforcement across thousands of services.
Device certs replace usernames/passwords for network access (WPA2-Enterprise, VPN client auth); code-signing certs bind a cryptographic identity to software builds (Apple Developer ID, Windows Authenticode). Lose the code-signing key and an attacker can ship malware that your users' machines will happily install.
Gotchas the playbook has to cover
- Expired intermediate breaks even a valid leaf. Most famously: 2020's AddTrust External CA root expiration took out Roku, Stripe, Spotify. Moral: monitor your chain, not just your leaf.
- Cert pinning is brittle. If a mobile app pins a specific intermediate and the CA rotates that intermediate, the app stops trusting valid certs until the next app release ships. Pin to the SPKI hash of your own key where possible, not the CA's.
- Root CA distribution is its own project. An internal root has to reach every endpoint — Windows GPO / Intune for domain machines, Jamf / Apple Configurator for Macs, MDM profiles for iOS, and manual install (or automation) for everything else. Miss a platform and you'll see "untrusted root" errors that the user can't self-resolve.
- Clock skew masquerades as cert failure. A host whose time is 10 minutes off can reject a perfectly valid cert because
notBeforeis "in the future". NTP is part of the PKI story. - Renewal must swap the file and reload the service.
nginx -s reload,systemctl reload postfix, Kubernetes restarts the pod — whatever it is, the new cert isn't live until the serving process picks it up. Ship a post-renew hook or the monitoring continues to alert on the old file.
Playground
The playground drops a synthetic fleet in front of you — a mix of healthy, about-to-expire, already-expired, and revoked certs. Pick an action (renew, revoke, issue-new), click a row to apply it, drag the time slider to fast-forward the simulated clock, and watch which certs slide into the amber band before they fall into red. The report pane at the bottom names the single next cert that will break — which, for any competent IT team, is the one to put on the calendar.
Visualizer
The Cert Chain panel shows the three-tier chain of trust — Root CA → Intermediate → Leaf — and a timeline of the selected cert's validity window. Click any node to switch the timeline onto its cert. The lesson of the panel is the one in the gotcha list: a leaf with 200 days left doesn't help anyone if its intermediate expires next week.