Secrets Rotation Runbooks for Hybrid Infrastructure

# Secrets Rotation Runbooks for Hybrid Infrastructure

Excerpt: A practical, sanitized walkthrough for designing and executing repeatable secrets rotation runbooks across on-prem and cloud systems, with example inventories, maintenance windows, validation steps, and rollback patterns that security teams can adapt to their own environments.

Secrets rotation becomes painful in hybrid infrastructure for a simple reason: credentials rarely live in one place, and systems rarely fail in neat, isolated ways. A database password may be stored in a vault, injected into a container, mirrored into a configuration file for a legacy service, and referenced by a scheduled job on a bastion host. Rotating the secret is the easy part. Rotating it without breaking dependent systems is the real operation.

This article presents a sanitized and fictionalized runbook for rotating secrets in a mixed environment that includes on-prem systems, cloud workloads, containers, and scheduled automation. The goal is not to prescribe one toolchain, but to show the shape of a process that is safe, auditable, and repeatable.

All identifiers below are examples only. Example systems use hostnames such as bastion01, web01, proxy01, db01, and siem01, with example addresses from 10.0.1.x and 192.168.1.x. Example domains use example-corp.com.

Why ad hoc rotation fails

Most failed rotations are not caused by bad cryptography. They are caused by incomplete dependency mapping and weak execution discipline. Security teams often know where a secret was created, but not all the places where it is consumed.

Typical failure modes include:

Rotating an API key in the vault but forgetting a long-running worker that caches it in memory.
Changing a database credential for an application tier but missing a reporting job on bastion01.
Updating a certificate on proxy01 while an internal trust store still contains the old chain.
Replacing a service account token without coordinating scheduled tasks or deployment pipelines.
Cutting over all consumers at once without a rollback path.

A runbook solves this by forcing preparation, ordered execution, verification, and documented rollback.

Build a rotation inventory first

Before touching any secret, create an inventory record. This inventory should exist in version-controlled documentation or your security operations repository, not in someone’s memory.

A minimal inventory schema might look like this:

secret_id: prod/db/app-main
secret_type: password
owner_team: platform-security
system_owner: app-platform
criticality: high
rotation_frequency_days: 90
source_of_truth: vault-kv
created_by: automation
consumers:
  - host: web01
    path: /etc/example-app/app.env
    reload_method: systemctl restart example-app
  - host: bastion01
    path: /opt/jobs/reporting/.env
    reload_method: systemctl restart reporting-job
  - host: db01
    path: native account in database engine
    reload_method: immediate
validation:
  - nc -z db01.example-corp.com 5432
  - app synthetic login check
  - reporting job dry-run
rollback:
  method: restore previous version in vault and redeploy consumers
  max_rollback_window_minutes: 30
maintenance_window: Tue 20:00-21:00 CET

This is not bureaucracy. It is dependency mapping. If you cannot list consumers, you are not ready to rotate the secret safely.

Classify secrets by rotation pattern

Not every secret should be rotated the same way. A good runbook library classifies them into patterns.

1. Dual-secret or overlapping-validity pattern

Use this when the target system supports multiple active credentials. Examples include API keys, certificate pairs during trust overlap, and some service accounts.

This is the safest pattern because you can create the new secret, distribute it, validate consumers, and then revoke the old one later.

2. Single-secret coordinated cutover pattern

Use this when only one valid credential can exist at a time, such as some database passwords or legacy application accounts. This requires a maintenance window and carefully ordered updates.

3. Derived-secret fan-out pattern

Use this when a root secret or key material generates multiple downstream artifacts, such as TLS assets or signed tokens. Rotating the root means updating trust relationships and caches across multiple systems.

4. Human-access credential pattern

Use this for break-glass accounts, administrator passwords, and emergency access tokens. These need extra controls: witness procedures, logging, and post-rotation access validation.

When teams mix these patterns, mistakes follow. Build separate runbooks for each class.

Example runbook: rotating an application database password

Assume web01 hosts an internal application that talks to db01, and a scheduled reporting job on bastion01 also uses the same database account. The environment is hybrid: the password is stored in a centralized vault, but consumers read it from local environment files during service startup.

Phase 1: Pre-change checks

The first phase proves you understand the current state.

# Verify application and database reachability
nc -z db01.example-corp.com 5432
curl -fsS https://web01.example-corp.com/healthz

# Check current service status on known consumers
ssh [email protected] 'systemctl is-active example-app'
ssh [email protected] 'systemctl is-active reporting-job'

# Confirm last successful reporting run
ssh [email protected] 'journalctl -u reporting-job -n 20 --no-pager'

If a dependent service is already degraded, stop. Rotation during instability creates ambiguous failures.

Next, export or document the current secret version metadata from the vault without exposing the value in logs.

vault kv metadata get secret/prod/db/app-main

Finally, verify rollback access. If your rollback depends on access to the previous version, ensure the old version is recoverable before you rotate.

Phase 2: Generate and stage the new secret

Generate the new credential using a standard policy. Keep this deterministic from a process perspective, not from a value perspective.

openssl rand -base64 32 | tr -d 'n'

Store it in the secret manager as the next staged value, tagged for rotation but not yet announced as complete.

vault kv put secret/prod/db/app-main value='<redacted>' status='staged' rotated_at='2026-03-31T20:00:00Z'

For systems that support testing before cutover, create a temporary validation path. For example, if the application supports a secondary connection profile, deploy the new value there first. If not, be ready for a fast coordinated update.

Phase 3: Update the source system

Change the secret at the authoritative target first. For a database password, that means updating the credential on db01.

Example SQL, sanitized for demonstration:

ALTER ROLE app_main WITH PASSWORD '<new-secret>';

Immediately record the execution timestamp and operator in the change log. If you delay here, incident responders later will not know when the old credential stopped being valid.

Phase 4: Distribute to consumers in order

A common mistake is updating all consumers simultaneously with no visibility. Instead, update in a deliberate sequence: least risky consumer first, highest-user-impact consumer last, unless architecture forces the reverse.

For this scenario, update bastion01 first, because it affects scheduled jobs, not interactive traffic.

ssh [email protected] 
  "install -m 0600 /dev/null /opt/jobs/reporting/.env.new && 
   printf 'DB_USER=app_mainnDB_PASS=%sn' '<new-secret>' > /opt/jobs/reporting/.env.new && 
   mv /opt/jobs/reporting/.env.new /opt/jobs/reporting/.env && 
   systemctl restart reporting-job"

Validate it before moving on.

ssh [email protected] 'systemctl is-active reporting-job'
ssh [email protected] '/opt/jobs/reporting/run-now --dry-run'

Then update web01.

ssh [email protected] 
  "install -m 0600 /dev/null /etc/example-app/app.env.new && 
   printf 'DB_USER=app_mainnDB_PASS=%sn' '<new-secret>' > /etc/example-app/app.env.new && 
   mv /etc/example-app/app.env.new /etc/example-app/app.env && 
   systemctl restart example-app"

Validate with both a health check and a synthetic transaction.

curl -fsS https://web01.example-corp.com/healthz
curl -fsS -X POST https://web01.example-corp.com/api/v1/synthetic-login

If web01 fails but bastion01 succeeds, you already know the issue is application-specific rather than database-wide.

Add logging and detection around the rotation

Rotations should generate expected telemetry. If your detection pipeline on siem01 suddenly sees repeated authentication failures from 10.0.1.20 after the cutover, that is operational evidence that a dependency was missed.

Useful queries during and after rotation include:

SELECT src_host, username, count(*) AS failures
FROM auth_events
WHERE timestamp >= now() - interval '30 minutes'
  AND destination_host = 'db01'
  AND outcome = 'failure'
GROUP BY src_host, username
ORDER BY failures DESC;

And if you collect service restart telemetry:

journalctl --since '30 minutes ago' -u example-app -u reporting-job --no-pager

Runbooks should explicitly call out the detections that distinguish a healthy cutover from a hidden dependency break.

Use automation, but keep human checkpoints

The right balance is automated execution with manual approval gates. Full manual rotation is slow and error-prone. Fully blind automation is fast, but it can spread failure quickly.

A practical model is:

Automation generates and stores the staged secret.
A change ticket or approval step authorizes execution.
Automation updates each consumer and runs health checks.
A human operator confirms telemetry looks normal.
Automation marks the rotation complete and revokes the old secret if applicable.

A simplified orchestration definition might look like this:

steps:
  - name: prechecks
    run: ./runbooks/db-rotation/prechecks.sh
  - name: rotate-source
    run: ./runbooks/db-rotation/rotate-db-role.sh
  - name: update-bastion01
    run: ./runbooks/db-rotation/update-consumer.sh bastion01
  - name: validate-bastion01
    run: ./runbooks/db-rotation/validate-reporting.sh
  - name: update-web01
    run: ./runbooks/db-rotation/update-consumer.sh web01
  - name: validate-web01
    run: ./runbooks/db-rotation/validate-app.sh
  - name: observe-siem01
    run: ./runbooks/db-rotation/check-auth-failures.sh
  - name: complete
    run: ./runbooks/db-rotation/finalize.sh

Notice that the validation is not a single “did the service start” check. It includes functional verification.

Design rollback before the change window starts

Rollback is not failure. It is part of the plan. The runbook should define exactly what gets restored, in what order, and what evidence triggers rollback.

For a single-secret cutover, rollback conditions might include:

Synthetic login fails for more than five minutes after deployment.
Authentication failures on db01 exceed a threshold from an unknown source.
A critical downstream batch process on bastion01 cannot complete a dry run.

Rollback steps should be concrete:

1. Restore previous secret version in vault.
Reset account credential on db01 to previous value.
Redeploy prior environment files to bastion01 and web01.
Restart services in reverse order of update.
Re-run validation checks.
Annotate incident/change record with exact failure point.

Do not invent rollback during the outage. Write it in advance.

Operational considerations and next steps

A mature secrets rotation program is less about password generation and more about dependency management, observability, and repeatable choreography. If your team still rotates secrets through chat messages and shell history, the first improvement is not a new vault. It is a documented inventory and a tested runbook library.

As next steps, security teams should identify their top ten high-impact secrets, classify each by rotation pattern, document all consumers, add validation checks that prove application behavior rather than mere process uptime, and schedule tabletop exercises for rollback. Then move the highest-risk rotations into guarded automation with clear approval gates. The best rotation runbook is the one you have already rehearsed before an incident forces you to use it.