asymmetric · level 7

Key Management

Storage, rotation, HSMs, KMSes, agents — the systems that wrap the keys.

170 XP

Key Management

Generating a key is the easy part. Where it lives, who can use it, and how often it rotates are the operations problems that decide whether the system is actually secure. A perfect cipher around a key sitting in .env.production on a developer's laptop is no security at all.

This lesson is about the systems that wrap the keys.

The "where do you keep the key for the key" problem

Suppose your application encrypts customer PII with AES-256. The AES key has to live somewhere the app can read it. Wherever that is — file, environment variable, Vault, KMS — that storage needs to be authenticated by another key. And that key needs to live somewhere too. Recursion.

Real-world systems break the recursion in two ways:

  1. Hardware root-of-trust. A purpose-built chip (TPM, HSM, secure enclave, smartcard) holds a master key in tamper-resistant silicon. Operations are performed on-chip; the key never leaves. This is the bottom of the trust stack.
  2. Operational trust. You trust a provider (AWS, GCP, Azure, your own ops team) to operate the hardware correctly. AWS KMS is "an HSM you don't have to operate."

You always end up trusting something. The question is what's at the bottom of your stack.

Where keys can live

Location Tamper-resistance Operational complexity Typical use
Plaintext file on disk None Trivial Dev / CI runtime keys (NOT production)
~/.ssh/id_ed25519 (encrypted with passphrase) Mild Trivial Personal SSH keys
ssh-agent / gpg-agent (RAM after passphrase typed) Process isolation only Trivial Daily SSH/GPG use
OS keychain (Keychain, libsecret, DPAPI) OS-level Trivial App credentials, browser keys
TPM (host-level secure chip) Strong Medium Disk encryption, attestation
YubiKey / smartcard (PIV / PGP / FIDO2) Strong Medium Engineering SSH/GPG, code signing
Cloud KMS (AWS/GCP/Azure) FIPS 140-2 Level 3 HSM-backed Low App data encryption, signing
HashiCorp Vault transit Software (or HSM-backed) Medium-high Self-hosted KMS-equivalent
Standalone HSM (Thales, AWS CloudHSM) FIPS 140-2 Level 3-4 High CA roots, payment card systems

The pattern: as you move down the table, the keys get less extractable but the system gets more operationally expensive.

Cloud KMS — what you actually get

Pretty much every cloud provider has the same KMS shape:

  • You create a key. The key material is generated inside an HSM and never returns to your code.
  • You ask the API to perform operations: Encrypt, Decrypt, Sign, Verify, GenerateDataKey.
  • Each call is authorized via IAM (or whatever the provider's auth is).
  • Each call is audit-logged (CloudTrail / GCP Audit Logs / Azure Monitor).
  • You pay per call (a few microcents per op) plus a flat rate per key per month.

KMS is the right default for any serious application. Even small startups should use KMS for production secrets — the operational overhead is essentially zero compared to running your own HSM.

The big trade-off: every operation is a network call. If you encrypt every row in a database directly with KMS, you'll bottleneck on KMS throughput and pay $$$. The standard pattern is envelope encryption:

  1. Generate a per-record AES key locally.
  2. Encrypt the record with that AES key.
  3. Encrypt the AES key (just 32 bytes) with KMS.
  4. Store the ciphertext + the encrypted AES key together.

Now you only call KMS once per record (or even once per minute, with a key cache). Decrypt is the same in reverse: ask KMS to decrypt the wrapped key, then decrypt the row locally.

ssh-agent / gpg-agent — the local key cache

Personal asymmetric keys (your SSH key, your GPG key) usually live encrypted with a passphrase on disk. Typing the passphrase for every SSH connection is unworkable, so the agent pattern was invented:

ssh-add ~/.ssh/id_ed25519     ← prompts once for passphrase
                                The agent holds the decrypted key in RAM.
                                Subsequent ssh / git push → agent signs the challenge.

The agent:

  • Listens on a Unix socket ($SSH_AUTH_SOCK).
  • Holds private keys in protected memory.
  • Performs signing operations on request — the key never leaves the agent's address space.
  • Supports forwarding so a remote process can sign with a key that physically lives on your laptop. (Use carefully — agent forwarding is a foot-gun if you SSH into a compromised host.)

Modern enhancements:

  • YubiKey instead of file-on-disk: the agent forwards signing requests to the chip, which performs the sign without ever revealing the key.
  • 1Password / Bitwarden ssh-agent integration: the password manager's vault stores the encrypted key; the agent unlocks via biometric/passphrase.

Key rotation

Three reasons to rotate:

  1. Compromise risk over time. The longer a key lives, the more places it has been accessed from, the more snapshots and backups it appears in. Rotating limits this surface.
  2. Cryptographic margin. A 2048-bit key acceptable today might be on the edge in 10 years. Routine rotation gets you off the old key without panic.
  3. Compliance. PCI-DSS, SOC 2, FedRAMP — most frameworks mandate rotation schedules. Annual is typical for signing keys; quarterly for high-value data-encryption keys.

The mechanics:

  • Generate new key under a new identifier.
  • Cross-sign / dual-publish. During the transition window, both old and new are valid. Verifiers accept either; signers prefer the new.
  • Cut over. Stop using the old key for new operations.
  • Revoke / decommission. Once the old key is no longer used, mark it revoked (CRL, OCSP, equivalent).

The painful part is identifying everywhere a key is used. Production audits regularly find that "the database key" is hardcoded in 4 services, 2 deployment scripts, and a .env checked into a forgotten Git branch.

Revocation

Even with rotation, sometimes you need to declare a key dead now. Two main mechanisms:

  • Certificate Revocation List (CRL). A signed list of revoked certificates, published by the CA. Verifiers download it and check.
  • OCSP (Online Certificate Status Protocol). Verifiers query the CA for a single cert's status in real-time.

Both have ergonomic problems — CRLs grow large, OCSP requires the CA to be online — which is why modern systems prefer short-lived certificates (days or hours) instead of long-lived certs with revocation. If a key only lives 24 hours, you don't need to revoke it; you just stop signing new ones.

A practical decision framework

Start by classifying your key by impact:

Tier Examples Storage Rotation
1 — Catastrophic CA root, code-signing root HSM, air-gapped Years (carefully)
2 — High KMS master, JWT signing, API key Cloud KMS Quarterly to yearly
3 — Medium Per-environment app keys Vault / KMS-managed Monthly to quarterly
4 — Low Per-session keys RAM only Per-session

The mistake people make is treating tier-1 keys like tier-3 keys (storing the CA root in .env) or tier-3 keys like tier-1 (running an HSM for an environment-specific app key, paying for complexity that doesn't help).

What this lesson asks of you

The playground asks you to pick the right storage location for five concrete keys (a CA root, a per-app encryption key, a personal SSH key, a JWT signing key, a payment-system key). The visualizer shows the topology of a typical KMS-backed application — which calls go to KMS, which stay local, where envelope encryption fits.

Tools in the wild

4 tools