Security Infrastructure Disaster Recovery: Vault, PKI, and SIEM Failover

Security Infrastructure Disaster Recovery: Vault, PKI, and SIEM Failover

Problem

When your security infrastructure fails, you are flying blind. If Vault is down, applications cannot retrieve secrets and new deployments stall. If your PKI is unavailable, certificate renewal fails and mTLS connections start breaking. If your SIEM or log pipeline is down, an attacker operating during the outage has unlimited dwell time with zero detection.

The “who watches the watchers” problem is real. Most teams monitor their applications with Prometheus, Grafana, and centralised logging. But what monitors Prometheus? What alerts you when Grafana is down? What detects an attacker when your detection system is offline?

Security infrastructure is the foundation that every other security control depends on. A Vault outage is not just a secret management problem. It cascades: services cannot authenticate, certificates cannot rotate, new pods cannot start, and incident response teams cannot access credentials needed to investigate the outage itself.

Target systems: HashiCorp Vault (secrets management). Internal PKI (cert-manager, SPIRE, or custom CA). SIEM/log pipeline (Prometheus, Loki, Elasticsearch, or managed equivalents). Alerting systems (Alertmanager, PagerDuty, Opsgenie).

Threat Model

  • Adversary: An attacker who triggers or exploits a security infrastructure outage. This could be intentional (DDoS against your monitoring stack, corrupting Vault storage) or opportunistic (attacking during an unrelated outage when detection is impaired).
  • Objective: Operate with impunity during the monitoring blind spot. Exfiltrate data, establish persistence, or destroy evidence while detection systems are offline.
  • Blast radius: A monitoring outage does not cause a breach directly. It makes every other breach undetectable for the duration of the outage. A 4-hour SIEM outage means 4 hours of unmonitored activity across every system that depends on centralised logging.

Configuration

Vault Disaster Recovery

# vault-config.hcl
# Vault with integrated storage (Raft) and auto-unseal
storage "raft" {
  path    = "/vault/data"
  node_id = "vault-0"

  retry_join {
    leader_api_addr = "https://vault-0.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "https://vault-1.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "https://vault-2.vault-internal:8200"
  }
}

# Auto-unseal prevents manual intervention after restart
seal "awskms" {
  region     = "us-east-1"
  kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abc123"
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_cert_file = "/vault/tls/tls.crt"
  tls_key_file  = "/vault/tls/tls.key"
}
# Vault snapshot automation (run daily via cron)
#!/bin/bash
set -euo pipefail

SNAPSHOT_DIR="/vault/snapshots"
DATE=$(date +%Y%m%d-%H%M%S)
RETENTION_DAYS=30

# Take Raft snapshot
vault operator raft snapshot save "${SNAPSHOT_DIR}/vault-${DATE}.snap"

# Verify snapshot is valid
vault operator raft snapshot inspect "${SNAPSHOT_DIR}/vault-${DATE}.snap"

# Upload to remote storage (separate from Vault's own storage)
aws s3 cp "${SNAPSHOT_DIR}/vault-${DATE}.snap" \
  "s3://vault-dr-backups/vault-${DATE}.snap" \
  --sse aws:kms \
  --sse-kms-key-id "arn:aws:kms:us-east-1:123456789012:key/backup-key"

# Clean up old local snapshots
find "${SNAPSHOT_DIR}" -name "vault-*.snap" -mtime +${RETENTION_DAYS} -delete

echo "Vault snapshot completed: vault-${DATE}.snap"

Vault Recovery Procedure

# Vault recovery from Raft snapshot
# Run on a fresh Vault cluster

# 1. Start Vault server (it will be uninitialized)
vault server -config=/vault/config/vault.hcl

# 2. Restore from snapshot
vault operator raft snapshot restore \
  -force \
  /vault/snapshots/vault-20260422-020000.snap

# 3. Vault auto-unseals via KMS
# 4. Verify secrets are accessible
vault kv get secret/test

# 5. Verify dynamic secret engines are functional
vault read database/creds/readonly

Monitoring the Monitors

# prometheus-self-monitoring.yaml
# Prometheus monitors itself and a secondary Prometheus monitors the primary
groups:
  - name: monitoring-health
    rules:
      - alert: PrometheusDown
        expr: up{job="prometheus"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus instance down"
          description: "Primary Prometheus is unreachable. Security alerting is impaired."

      - alert: AlertmanagerDown
        expr: up{job="alertmanager"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Alertmanager down. Alerts will not be delivered."

      - alert: VaultDown
        expr: up{job="vault"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Vault is unreachable. Secret retrieval and certificate rotation will fail."

      - alert: VaultSealed
        expr: vault_core_unsealed == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Vault is sealed. All secret operations are blocked."

      - alert: LogPipelineDown
        expr: >
          rate(fluentbit_output_records_total[5m]) == 0
          and rate(fluentbit_input_records_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Log pipeline receiving data but not shipping. Logs are being dropped."

      - alert: CertificateRenewalFailing
        expr: >
          certmanager_certificate_ready_status{condition="False"} == 1
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Certificate {{ $labels.name }} renewal failing for 30 minutes"

Dual-Ship Log Pipeline

Send logs to two independent backends so that a single backend failure does not create a monitoring blind spot.

# fluent-bit-dual-ship.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            cri
        Tag               kube.*
        Refresh_Interval  5
        Mem_Buf_Limit     10MB

    # Primary backend
    [OUTPUT]
        Name              loki
        Match             *
        Host              loki.monitoring.svc.cluster.local
        Port              3100
        Labels            job=fluent-bit
        Auto_Kubernetes_Labels on

    # Secondary backend (independent infrastructure)
    [OUTPUT]
        Name              http
        Match             *
        Host              logs.backup-region.example.com
        Port              443
        URI               /api/v1/push
        Format            json
        Header            Authorization Bearer ${BACKUP_LOG_TOKEN}
        TLS               On

    # Local filesystem buffer (survives both backend outages)
    [OUTPUT]
        Name              file
        Match             *
        Path              /var/log/fluent-bit-buffer/
        Format            plain

Degraded-Mode Detection

When centralised monitoring is unavailable, these host-level tools continue operating.

# degraded-mode-detection.sh
# Run on every host. No dependency on centralised systems.
# Deploy as a systemd timer that runs every 5 minutes.

#!/bin/bash
set -euo pipefail

LOG_FILE="/var/log/security/degraded-mode.log"
ALERT_FILE="/var/log/security/alerts.log"

timestamp() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }

# Check for unexpected processes
UNEXPECTED_PROCS=$(ps aux | grep -E '(xmrig|cryptonight|minerd|stratum)' | grep -v grep || true)
if [ -n "${UNEXPECTED_PROCS}" ]; then
  echo "$(timestamp) CRITICAL: Suspected crypto miner detected: ${UNEXPECTED_PROCS}" >> "${ALERT_FILE}"
fi

# Check for unexpected network listeners
UNEXPECTED_LISTENERS=$(ss -tlnp | grep -vE ':(22|80|443|8080|8443|9090|9100|6443) ' | grep LISTEN || true)
if [ -n "${UNEXPECTED_LISTENERS}" ]; then
  echo "$(timestamp) WARNING: Unexpected network listeners: ${UNEXPECTED_LISTENERS}" >> "${ALERT_FILE}"
fi

# Check for unexpected cron jobs
CRON_CHANGES=$(find /etc/cron* /var/spool/cron -newer /var/log/security/.cron-baseline -type f 2>/dev/null || true)
if [ -n "${CRON_CHANGES}" ]; then
  echo "$(timestamp) WARNING: Cron files modified since last baseline: ${CRON_CHANGES}" >> "${ALERT_FILE}"
fi

# Check Vault accessibility
if ! curl -sf -o /dev/null "https://vault.example.com:8200/v1/sys/health"; then
  echo "$(timestamp) CRITICAL: Vault health check failed" >> "${ALERT_FILE}"
fi

# Ship alerts via alternative channel if centralised logging is down
if [ -s "${ALERT_FILE}" ] && ! curl -sf -o /dev/null "http://loki.monitoring.svc.cluster.local:3100/ready"; then
  # Logging is down, send alerts directly via webhook
  curl -sf -X POST "${EMERGENCY_WEBHOOK}" \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"DEGRADED MODE ALERT from $(hostname): $(cat ${ALERT_FILE})\"}" || true
fi
# /etc/systemd/system/degraded-detection.timer
[Unit]
Description=Degraded mode security detection

[Timer]
OnBootSec=60
OnUnitActiveSec=300
RandomizedDelaySec=30

[Install]
WantedBy=timers.target

DR Drill Procedure

# quarterly-dr-drill.sh
# Test security infrastructure recovery quarterly

echo "=== Security Infrastructure DR Drill ==="
echo "Date: $(date)"
echo ""

echo "1. Vault Recovery Test"
echo "   - Restoring from latest snapshot to DR cluster..."
# vault operator raft snapshot restore -force /path/to/latest.snap
echo "   - Verifying secret access..."
# vault kv get secret/dr-test
echo "   - Expected: secrets readable within 5 minutes of restore start"
echo ""

echo "2. Log Pipeline Failover Test"
echo "   - Stopping primary Loki..."
# kubectl scale statefulset loki --replicas=0 -n monitoring
echo "   - Verifying logs appear in secondary backend within 30 seconds..."
echo "   - Restarting primary Loki..."
# kubectl scale statefulset loki --replicas=3 -n monitoring
echo ""

echo "3. Certificate Renewal Under CA Failure"
echo "   - Simulating cert-manager CA unavailability..."
echo "   - Verifying existing certificates remain valid..."
echo "   - Verifying renewal recovers when CA returns..."
echo ""

echo "4. Alerting Pipeline Test"
echo "   - Firing test alert..."
# amtool alert add --alertmanager.url=http://alertmanager:9093 \
#   alertname="DR_DRILL_TEST" severity="critical"
echo "   - Verifying alert delivered to on-call..."
echo ""

echo "Record results and update DR runbook with any findings."

Expected Behaviour

  • Vault runs in a 3-node Raft cluster with auto-unseal. Daily snapshots stored in a separate region.
  • Vault recovery from snapshot completes within 15 minutes
  • Log pipeline ships to two independent backends. If primary fails, secondary continues receiving.
  • Host-level degraded-mode detection runs independently of centralised monitoring
  • Monitoring-of-monitoring alerts fire within 2 minutes of any component failure
  • DR drills run quarterly with documented results

Trade-offs

Decision Impact Risk Mitigation
Dual-ship logs Doubles log storage cost Log backends may diverge (different retention, different query capabilities) Treat secondary as a DR backend with shorter retention (7 days vs 30 days). Use primary for day-to-day operations.
Auto-unseal via KMS Vault recovers without manual intervention KMS outage prevents Vault from starting Use a KMS in a different region than Vault. If total KMS failure, fall back to Shamir unseal keys stored offline.
Host-level degraded detection Works when everything else is down Limited detection capability compared to centralised SIEM This is the last line of defence, not the first. It catches obvious indicators only.
Quarterly DR drills Validates recovery procedures work Drills require 2-4 hours of engineering time per quarter Schedule during low-traffic periods. Automate as much as possible. The cost of not testing is a failed recovery during a real incident.

Failure Modes

Failure Symptom Detection Recovery
Vault snapshot corrupted Restore fails with integrity error Snapshot inspect during backup verifies integrity Retry with previous day’s snapshot. Investigate storage corruption.
Both log backends fail simultaneously Complete monitoring blind spot Host-level degraded detection script fires; no logs arriving at either backend Logs buffer on Fluent Bit local filesystem. Investigate root cause (shared dependency like DNS or network).
Auto-unseal KMS unavailable Vault cannot start after restart Vault logs show KMS connection failure Use Shamir unseal keys as fallback. KMS in a separate region reduces shared failure risk.
DR drill reveals outdated runbook Recovery takes longer than expected or fails Drill results document the gap Update runbook immediately. Schedule follow-up drill to verify the fix.
Certificate chain breaks during CA recovery Services reject mTLS connections TLS handshake errors across services; cert-manager events show issuance failure Verify CA certificate in cert-manager ClusterIssuer. If CA was rebuilt, distribute new root CA to all trust stores.

When to Consider a Managed Alternative

Managed observability providers handle their own HA: Grafana Cloud (#108), Axiom (#112), Better Stack (#113). This is the strongest argument for managed observability. When your self-hosted Prometheus goes down, you have no metrics. When Grafana Cloud goes down, their SRE team responds. Incident.io (#175) and FireHydrant (#176) for incident management during DR scenarios, providing structured incident response when your internal tools are compromised.

Premium content pack: Security DR runbook templates. Vault backup and restore scripts, Fluent Bit dual-ship configuration, degraded-mode detection scripts, quarterly DR drill checklist, and monitoring-of-monitoring Prometheus rules.