cicd · level 8

Rollback and Feature Flags

Progressive delivery, auto-rollback, and decoupling deploy from release.

200 XP

Rollback and Feature Flags

Bad releases happen. The question isn't "how do we prevent every one" — that's an unwinnable war. The question is "how fast can we recover, and at what cost". Two complementary tools: progressive delivery with auto-rollback for deploy-time risk, and feature flags for runtime risk.

Progressive delivery — the canary recipe

The naive deploy: replace v1 with v2 on every instance simultaneously. If v2 is broken, 100% of users see the breakage from the first second.

Progressive delivery routes a small percentage of traffic to v2 first, watches metrics, and only rolls out further if everything looks good:

t=0:    100% → v1     0% → v2
t=5m:    95% → v1     5% → v2     ← canary starts
              [observe success rate, latency p99, error rate]
t=15m:   75% → v1    25% → v2     ← if SLOs hold
t=25m:   50% → v1    50% → v2
t=35m:    0% → v1   100% → v2     ← v2 fully promoted

If at ANY step the SLO breaches (error rate > 1%, p99 > 500ms, whatever you defined), the rollout halts and automatically rolls back — traffic returns to v1 within seconds.

Tools that automate this:

  • Argo Rollouts (Kubernetes) — defines canary steps + analysis templates that query Prometheus.
  • Flagger (Kubernetes, FluxCD-aligned) — same idea, integrates with Datadog/Prometheus/CloudWatch.
  • AWS CodeDeploy — canary at the load-balancer level, automated rollback on alarm trigger.
  • GCP Traffic Director — Envoy-driven canaries with SLO gates.

The auto-rollback is the magic. Without it, "canary" just means "fewer users see the breakage for a few minutes" — still bad. WITH it, a broken canary self-heals before most users notice.

SLO gates — what to actually measure

The canary's success criteria should be the same SLO you alert on in production:

  • Success rate(2xx + 3xx) / total > 99.5%
  • Latencyp99 < 500ms
  • Error budget burn — request rate × error rate
  • Resource saturation — CPU < 80%, memory < 90%

If your canary gets 5% of traffic, the canary needs to maintain the SAME ratios as v1 — not absolute counts. 5% of traffic with 1% errors is fine; the same 5% with 5% errors is a problem.

Argo Rollouts' analysis template makes this explicit:

metrics:
  - name: success-rate
    interval: 1m
    successCondition: result[0] >= 0.99      # 99% success
    failureLimit: 3                          # 3 consecutive failures = rollback
    provider:
      prometheus:
        address: http://prometheus.svc:9090
        query: |
          sum(rate(http_requests_total{job="my-svc",status=~"2..|3.."}[1m])) /
          sum(rate(http_requests_total{job="my-svc"}[1m]))

Three minutes of < 99% success → automatic revert.

Feature flags — decoupling deploy from release

Sometimes the risky thing isn't the code shape — it's whether you turn it on. Feature flags handle that case.

// Both paths in the same binary
if (await flags.isEnabled("new-checkout-flow", { userId })) {
  return renderNewCheckout();
}
return renderOldCheckout();

The feature can be "in production but inactive" for any duration. Then:

  • Enable for 1% of users via the flag dashboard. Watch for issues.
  • Bump to 10%. Watch.
  • Bump to 50%. Watch.
  • Bump to 100%.

If something breaks at any step, flip the flag back to 0%. Recovery is seconds, not "kick off a deploy and pray". The full code path is still in the binary; the flag dashboard just stops sending traffic to it.

Flag systems also support targeting beyond percentages:

flags.isEnabled("new-checkout-flow", {
  userId,
  custom: {
    plan: user.plan,         // free / pro / enterprise
    region: user.region,     // us / eu / apac
    isInternal: user.isStaff,
  },
});

Now you can roll out to internal staff first, then to enterprise customers, then to everyone — all without redeploying.

Roll-forward vs roll-back

Two cultures, both legitimate:

  • Roll-forward: ship a fix, deploy that. Argues that reverting introduces its own risk (the database migration ran; rolling back the code without rolling back the migration leaves a broken state). Also argues that if your deploy pipeline is fast, "fix and re-deploy" is faster than "investigate the rollback".
  • Roll-back: revert to the last known-good immediately, investigate after. Argues that recovery time matters more than blame analysis. Most outages get worse with time, not better.

Pick the one your team can execute under stress. Practice it.

A pragmatic blend:

  • For pure code changes: roll back. Cheap and instant.
  • For changes that touch DB schema, queues, or persistent state: roll FORWARD with a fix. Backwards-incompatible changes can't be reverted without leaving the persistent state inconsistent.
  • For feature flag failures: there is no "rollback" — flip the flag, you're done.

Backwards-compatible migrations

If you can never roll back state-touching changes, the discipline is to make every change backwards-compatible:

Step 1: write code that reads OLD-or-NEW schema (deploy)
Step 2: write code that writes NEW schema (deploy)
Step 3: backfill OLD records to NEW format (script)
Step 4: write code that reads ONLY NEW schema (deploy)
Step 5: drop OLD columns (migration)

Five steps. Each individually rollable. Slow but safe. The shortcut — drop OLD and add NEW in one migration — looks faster, but you can't roll back without manual data recovery.

Combine: canary + flags

The full modern stack:

1. Code path is feature-flagged off.
2. Deploy via canary (progressive-delivery + SLO gates).
3. Deploy succeeds — code is now in prod, flag is off.
4. Toggle the flag for 1% of users.
5. Watch metrics. Bump to 10%, 25%, 100%.
6. If metrics regress at any %: flip the flag back, fix forward, retry.

The deploy is decoupled from the release; both have auto-rollback for their respective failure modes. This is what FAANG-scale companies have done for a decade and what's now achievable for everyone with off-the-shelf tools.

Summary

  • Use canary + SLO-gated auto-rollback for deploy-time risk.
  • Use feature flags to decouple deploy from release; flag-toggle gives you sub-second runtime rollback.
  • For state-touching changes, make every step backwards-compatible.
  • Pick a roll-forward or roll-back culture and practice it.
  • Combine canary + flags for the full modern progressive-delivery stack.

The goal is "the cost of a bad release is small enough that we can absorb it without a war room". Both tools push that cost down.

Tools in the wild

5 tools
  • Hosted feature-flag platform with %-rollout, targeting, and experimentation built in.

    service
  • Statsigfree tier

    Feature flags + experimentation with statistical guardrails on each rollout step.

    service
  • Unleashfree tier

    Open-source feature-flag platform — self-host for full control.

    library
  • Argo Rolloutsfree tier

    Kubernetes-native canary + blue/green controller with SLO-driven auto-promote.

    library
  • Flaggerfree tier

    FluxCD's progressive-delivery operator — Prometheus + Datadog SLO gates.

    library