apis · level 7

Webhooks

Outbound HTTP, signing, retries, and dead-letter handling.

200 XP

Webhooks

A webhook is a regular HTTP request — except your service is the one making it, and the destination is a URL the customer gave you. Inversion of control. Five things to get right: delivery, signing, replay protection, retries, and dead-letter handling.

The flow

Your service                       Customer's endpoint
──────────────                     ───────────────────
[event happens]
      │
      │   POST https://customer.example.com/webhooks
      │   Content-Type: application/json
      │   Webhook-Id: evt_8a6f3c2b9d4e1f50
      │   Webhook-Timestamp: 1714323840
      │   Webhook-Signature: v1,7a3...
      │
      │   { "type": "payment.succeeded", ... }
      ▶───────────────────────────────────────────────▶
                                                 [200 OK]
      ◀───────────────────────────────────────────────◀
                                                       │
[mark delivered]                                       │

Three headers carry the security envelope: an event ID, a timestamp, and a signature. The body is the event itself.

Signing — HMAC over (timestamp.body)

Webhooks must be signed. Without a signature, anyone who learns the URL can post fake events.

The standard pattern (Stripe, GitHub, Standard Webhooks) is HMAC-SHA256 over a string that includes the timestamp:

signed_payload = timestamp + "." + raw_body
signature      = hex( HMAC-SHA256( secret, signed_payload ) )

The receiver re-computes the same and compares with crypto.timingSafeEqual (constant-time, to prevent leak via timing attack):

import { createHmac, timingSafeEqual } from "node:crypto";

const expected = createHmac("sha256", secret)
  .update(`${ts}.${rawBody}`)
  .digest("hex");

if (!timingSafeEqual(Buffer.from(sig, "hex"), Buffer.from(expected, "hex"))) {
  return res.status(400).json({ error: "bad_signature" });
}

Two non-obvious gotchas:

  1. Verify the RAW body, not parsed JSON. Once you JSON.parse, key order and whitespace are lost. A re-serialized body will not match. Most frameworks let you grab the raw body before parsing.
  2. Timing-safe compare. == leaks one byte per round-trip via timing attack. Use crypto.timingSafeEqual (Node) or hmac.compare_digest (Python).

Replay protection — timestamps with tolerance

A captured signed payload is replayable forever unless you bound it in time. Senders include a Unix timestamp; receivers refuse anything outside (say) 5 minutes:

const tolerance = 300; // 5 minutes
if (Math.abs(Date.now() / 1000 - ts) > tolerance) {
  return res.status(400).json({ error: "timestamp_out_of_tolerance" });
}

The tolerance is a trade-off: too low and clock skew or slow networks break legitimate deliveries; too high and replay windows are wide. 5 minutes is the convention.

Retries with backoff

The receiver can be flaky. Networks, deploys, transient bugs. The sender retries — but bluntly. Retries that hammer a struggling endpoint at fixed intervals make outages worse.

The pattern is exponential backoff with jitter:

attempt 1: immediate
attempt 2: ~10s + jitter
attempt 3: ~1m + jitter
attempt 4: ~10m + jitter
attempt 5: ~1h + jitter
attempt 6: ~6h + jitter
attempt 7: ~24h + jitter

Stripe retries for up to 3 days before giving up. After that, the event lands in a dead-letter location.

What status codes trigger retry?

Code Retry? Reason
2xx No Success — drop the schedule.
3xx No Redirect — follow up to a limit, then treat as 2xx or fail.
4xx No (except 408, 429) Caller mistake — retry won't help.
408 Yes Request timeout.
429 Yes Rate-limited — honour Retry-After if present.
5xx Yes Server error — almost always transient.
timeout/network Yes Connection failed before any response.

The 4xx-no-retry rule keeps you from spamming a misconfigured receiver forever.

Dead-letter handling

Eventually retries run out. Where does the event go?

The right answer: a dead-letter store the customer can inspect and replay from. SaaS webhook providers (Svix, Hookdeck) build a UI for this. Self-hosted: a Postgres table or an SQS DLQ works fine. Fields:

event_id           pk
event_type
payload            jsonb
last_attempt_at    timestamptz
last_error_status  int
attempt_count      int
deadlettered_at    timestamptz

The customer needs three things: a list view, a per-event detail page (request, response, error), and a "retry" button that re-enqueues with a fresh schedule.

The "retry" button is the difference between an integration that works and one that loses data.

The receiver's side — idempotency

Your receiver will see duplicates. Two reasons:

  1. The sender's retry won the race against the receiver's ACK.
  2. A "retry" button on the dashboard re-fired an already-processed event.

Solve once, at the front door:

CREATE TABLE webhook_events_seen (
  event_id    text PRIMARY KEY,
  received_at timestamptz NOT NULL DEFAULT now()
);
async function handle(event) {
  try {
    await db.query(
      "INSERT INTO webhook_events_seen (event_id) VALUES ($1)",
      [event.id]
    );
  } catch (e) {
    if (isUniqueViolation(e)) return; // already processed; ack and ignore
    throw e;
  }
  await processEvent(event);
}

The unique constraint on event_id means a duplicate INSERT throws — you swallow the duplicate and ACK. The processing only runs once.

Webhook receiver fatigue

Customers integrating with your webhooks suffer from "webhook fatigue":

  • 14 different signature schemes if you've integrated 14 vendors.
  • Surprise event types that don't appear in the docs.
  • Surprise schema migrations: a field becomes optional, code that destructures it crashes.
  • Replays that re-fire ancient events when a customer enables a new integration.

Mitigations:

  1. Use Standard Webhooks — a community spec for the headers, signing scheme, and retry policy. One verifier, many vendors.
  2. Document every event type up front — what triggers it, what fields it contains, examples.
  3. Version eventspayment.succeeded.v2 is OK; silently changing payment.succeeded is not.
  4. Make replays explicit — never silently re-fire historical events on integration setup.

Summary

Webhooks are five problems wearing a trench coat:

  1. Sign with HMAC over (timestamp.body) — verify the raw body in constant time.
  2. Bound replay windows — reject timestamps outside ±5 minutes.
  3. Retry 5xx/timeout with exponential backoff + jitter — don't retry 4xx.
  4. Dead-letter after the budget — give the customer a list + replay button.
  5. Receiver dedups by event ID — process exactly once even with duplicates.

Build all five from day one; retrofitting any one of them is painful.

Tools in the wild

4 tools
  • An OSS spec + reference SDKs that consolidate the Stripe/GitHub/Slack webhook conventions.

    spec
  • Webhooks-as-a-service — signing, retries, DLQ, replay UI built in.

    service
  • Webhook event gateway with queueing, retries, and a verification dashboard.

    service
  • ngrokfree tier

    Tunnel a local dev port to the public web so a webhook can reach localhost.

    cli