🟡 Intermediate

SSO Break-Glass & Secret Rotation — Operational patterns

Single sign-on is one of the highest-leverage simplifications in modern operations — but it is also a single point of failure. If the identity provider is down, everything behind it is down. This article covers two related operational patterns: break-glass access that lets an operator bypass SSO during an outage, and secret rotation that keeps the credentials feeding into SSO and into adjacent systems fresh without paging the whole team.

Both topics are deeply operational. Get them wrong and you discover the problem at the worst possible moment — usually at 3am while the identity provider is the thing on fire.


Why a single SSO point matters operationally

When all of your internal tools (CI dashboards, deploy consoles, monitoring) live behind one forward-auth gate, an outage of that gate locks you out at the moment you are most likely to need access. The classic mitigations are:

  1. A redundant identity provider (expensive, complex, often more dangerous than the disease).
  2. A documented break-glass path that a single on-call human can execute without depending on the down system.

Most teams should pick option 2 first. Option 1 only makes sense once break-glass has been drilled and the failure modes are understood.


Break-glass: what it is, what it is not

A break-glass mechanism is an emergency-only alternate path into a system. It must be:

  • Independent of the failed component (no shared infrastructure with the thing that's down).
  • Auditable (every use logged loudly — usually with a TG/Slack alert and a journald entry).
  • Restrictive (limited to one or two operators, with a credential they can carry offline).
  • Rare (drilled quarterly, used only when standard paths fail).

It is not a daily-driver alternate login. The moment break-glass becomes routine, it is no longer break-glass — it is just another auth path, and it inherits all the problems of the system you were trying to bypass.

The loopback + SSH bypass pattern

A common shape on small fleets:

  1. The application sits behind a forward-auth reverse proxy (Caddy + login service + JWT cookie).
  2. The application itself will accept a request that comes from 127.0.0.1 with an X-Auth-User header set — this is the loopback bypass.
  3. The reverse proxy enforces firewall rules so that the loopback port is only reachable from the host's docker bridge (or from 127.0.0.1 directly), never from the internet.
  4. An operator who can SSH into the host can therefore curl the app directly with the bypass header, skipping the (down) SSO entirely.
# Example — replace login.example.com with your IdP host.
# This only works because the operator is already on the host via SSH.
curl -fsS -H "X-Auth-User: admin@example.com" \
     http://127.0.0.1:7676/admin/restart-queue

The SSH credentials become the new root-of-trust. Protect them, rotate them, and never put SSH keys behind the SSO that they are meant to bypass.

Drilling break-glass

A break-glass path that is not drilled is a path that does not work. Drill it on a regular cadence (quarterly is a reasonable starting point) and time the result. A common acceptance threshold is:

From the moment an operator decides SSO is down, they must be able to perform a routine admin action via break-glass within 2 minutes.

If your drill blows the budget, the failure is usually one of:

  • Documentation is stale (the command changed when the app changed).
  • The bypass port has been firewalled away during an unrelated cleanup.
  • The on-call operator doesn't have the SSH key on the device they're holding.

Drill, fix, re-drill. Treat the drill log as the canonical "is break-glass alive" indicator — not the existence of the script.


Secret rotation: the same problem in slow motion

Secrets — HMAC shared with webhooks, database root passwords, API tokens for external services — have a natural lifetime. Past that lifetime they accumulate risk: more places they've leaked into logs, more former employees who saw them, more services that hold a stale copy.

The operational pattern that works:

  1. Inventory — know what secrets exist, where they live, who owns them, and when they were last rotated.
  2. Automate the rotation flow for every secret that can be rotated without human judgment.
  3. Gate dangerous rotations behind dry-run + maintenance-window confirmation.
  4. Track rotation events in an authoritative ledger — never trust file modification times.

Why mtimes lie

A common temptation is to use stat -c %Y on the env file as a "when was this last rotated" signal. It is wrong almost always:

  • A service restart that re-writes the env file (Dokku does this on every config:set) bumps the mtime.
  • A file copy as part of a host migration resets the mtime to the migration date.
  • An editor that re-saves the file without changing the secret bumps the mtime.

You need an authoritative ledger — a single TSV (or DB table, or audit log) that records (timestamp, secret_id, operator, extras) for every successful rotation. The rotation script appends to it on success; a staleness watchdog reads it to alert on overdue secrets. The file mtime is descriptive at best, evidentiary never.

The overlap-then-cutover pattern

For secrets shared with multiple consumers (e.g., a webhook HMAC shared with N GitHub repos), single-secret rotation creates a guaranteed downtime window: the moment you flip the validator to the new secret, every consumer still signing with the old one breaks.

The fix is to teach the validator to accept either secret during a rotation window:

WEBHOOK_SECRETS=old_secret,new_secret

Then the flow becomes:

  1. Phase 1 — set WEBHOOK_SECRETS=old,new on the validator. Both accepted.
  2. Phase 2 — walk every consumer (every repo, every webhook config), PATCH the secret to new.
  3. Phase 3 — set WEBHOOK_SECRETS=new. Old secret is now dead.

The window between phases 1 and 3 is the overlap. It must be longer than your slowest consumer (in repos: the slowest GitHub webhook delivery retry — minutes, not hours).

Without overlap support, you have downtime by design. Every push during the gap fails 403 until the consumer catches up.

Lesson from a real incident: a rotation script that updated the host service env but missed the actual validator container left every webhook 403 for 40 minutes until the discrepancy was noticed. The container, not the host, was the validator. Inventory your mental model of "where does the check actually run" before writing the rotation script — the simplest deployment topology can hide surprises.


Putting them together: a quarterly drill

A reasonable quarterly cadence for a small fleet:

When Action Why
Q-start Run break-glass drill against staging IdP. Time it. Confirm path still works.
Q-mid Trigger one automated rotation (HMAC is a good first candidate — low blast). Validate scripts before high-blast secrets.
Q-end Read the rotation ledger. Anything older than 90d? Catch secrets that fell off the auto-rotate list.

Schedule this as a calendar item with a human owner, not a cron job that nobody reads. Cron is good for the mechanical execution; the human owner is good for noticing that the cron stopped running three months ago because the script started exiting non-zero on an enumerated repo that no longer exists.


Checklist — start of next quarter

  • Break-glass path is documented in a single file, not in tribal memory.
  • Break-glass uses credentials that do not depend on the SSO it bypasses.
  • Break-glass has been drilled in the last 90 days; the drill log is reviewable.
  • Every secret in production has an entry in the inventory.
  • Every rotation event is in an authoritative ledger (not just file mtimes).
  • Validators that hold a shared secret support an overlap window (SECRETS=a,b).
  • A staleness watchdog runs daily and alerts on overdue rotations.

If any row is missing, your next 3am incident will tell you about it. Better to find out during business hours.


Further reading