Incident Response Hardening Playbook: From Detection to Post-Mortem

Incident Response Hardening Playbook: From Detection to Post-Mortem

Problem

During an active security incident, hardening is reactive: isolate the compromised system, contain the blast radius, preserve evidence, and stop the bleeding. After the incident, hardening is preventive: translate findings into permanent controls that ensure the same attack path never works again. Most teams lack a structured approach for either.

The typical incident response is chaotic. An alert fires. Someone investigates. Credentials get rotated, maybe. The compromised pod gets deleted (destroying evidence). A week later, life continues and no permanent hardening happens. The same attack path remains open. The post-mortem (if one happens) produces action items that sit in a ticket queue for months.

This playbook provides step-by-step procedures for containment during an incident and a structured process for converting incident findings into permanent hardening controls afterward.

Target systems: Kubernetes clusters. Linux servers. Any infrastructure where security incidents require both immediate response and long-term hardening.

Threat Model

  • Adversary: Active attacker with some level of access. The specific attack type is unknown at the start of incident response. The playbook must work regardless of whether the incident is a container escape, a compromised credential, a data exfiltration, or a supply chain compromise.
  • Objective (attacker): Maintain access, exfiltrate data, escalate privileges, or cause destruction.
  • Objective (defender): Contain the attacker’s access, preserve forensic evidence, restore service, and close the attack path permanently.
  • Blast radius: Depends on time to containment. Every minute between detection and isolation is a minute the attacker can escalate. The playbook’s goal is to reduce this window from hours to minutes.

Configuration

Phase 1: Network Isolation (First 5 Minutes)

Isolate the compromised workload immediately. Do not delete it. Isolation preserves evidence while stopping lateral movement.

# quarantine-network-policy.yaml
# Apply to isolate a compromised pod
# This blocks ALL ingress and egress while keeping the pod running
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: quarantine-pod
  namespace: production  # Target namespace
spec:
  podSelector:
    matchLabels:
      quarantine: "true"  # Label the compromised pod
  policyTypes:
    - Ingress
    - Egress
  ingress: []   # Deny all ingress
  egress: []    # Deny all egress
#!/bin/bash
# quarantine.sh - Isolate a compromised pod
# Usage: ./quarantine.sh <namespace> <pod-name>
set -euo pipefail

NAMESPACE=$1
POD=$2
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)

echo "[${TIMESTAMP}] INCIDENT: Quarantining pod ${POD} in ${NAMESPACE}"

# Step 1: Label the pod for quarantine network policy
kubectl label pod "${POD}" -n "${NAMESPACE}" quarantine=true --overwrite

# Step 2: Apply quarantine network policy
kubectl apply -f quarantine-network-policy.yaml -n "${NAMESPACE}"

# Step 3: Verify isolation
echo "Verifying network isolation..."
kubectl exec "${POD}" -n "${NAMESPACE}" -- \
  timeout 5 curl -s http://kubernetes.default.svc 2>&1 || echo "Network isolated: cannot reach API server"

# Step 4: Record the quarantine event
kubectl annotate pod "${POD}" -n "${NAMESPACE}" \
  "incident.quarantined-at=${TIMESTAMP}" \
  "incident.quarantined-by=${USER}" \
  --overwrite

echo "[${TIMESTAMP}] Pod ${POD} quarantined. Do NOT delete this pod. Evidence preservation required."

Phase 2: Credential Rotation (First 15 Minutes)

Assume all credentials accessible to the compromised workload are compromised.

#!/bin/bash
# rotate-credentials.sh
# Usage: ./rotate-credentials.sh <namespace> <service-account>
set -euo pipefail

NAMESPACE=$1
SA=$2
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)

echo "[${TIMESTAMP}] Rotating credentials for ${SA} in ${NAMESPACE}"

# 1. Delete and recreate service account tokens
echo "Revoking Kubernetes service account tokens..."
kubectl get secrets -n "${NAMESPACE}" -o json | \
  jq -r ".items[] | select(.metadata.annotations[\"kubernetes.io/service-account.name\"]==\"${SA}\") | .metadata.name" | \
  xargs -I{} kubectl delete secret {} -n "${NAMESPACE}"

# 2. Rotate Vault tokens (if applicable)
echo "Revoking Vault leases for ${SA}..."
vault lease revoke -prefix "auth/kubernetes/login" || true

# 3. Rotate cloud provider credentials (IRSA/Workload Identity)
# These are short-lived by default (1h for IRSA), but rotate the role binding
echo "Verify IRSA/Workload Identity role has not been modified..."
# Check IAM role policy for unexpected changes

# 4. Rotate database credentials
echo "Rotating database credentials..."
vault write -f database/rotate-root/production

# 5. Invalidate active sessions
echo "Invalidating active sessions..."
# Application-specific: clear session store, invalidate JWTs

# 6. Rotate API keys
echo "Rotating external API keys..."
# Provider-specific rotation (see Article #82)

echo "[${TIMESTAMP}] Credential rotation complete. Verify application functionality."

Phase 3: Evidence Preservation (First 30 Minutes)

Capture forensic evidence before anything is cleaned up.

#!/bin/bash
# preserve-evidence.sh
# Usage: ./preserve-evidence.sh <namespace> <pod-name>
set -euo pipefail

NAMESPACE=$1
POD=$2
EVIDENCE_DIR="/evidence/incident-$(date +%Y%m%d-%H%M%S)"
mkdir -p "${EVIDENCE_DIR}"

echo "Preserving evidence for ${POD} in ${NAMESPACE}"

# 1. Pod description and status
kubectl get pod "${POD}" -n "${NAMESPACE}" -o yaml > "${EVIDENCE_DIR}/pod.yaml"
kubectl describe pod "${POD}" -n "${NAMESPACE}" > "${EVIDENCE_DIR}/pod-describe.txt"

# 2. Container logs (all containers including init and sidecar)
for CONTAINER in $(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[*].name}'); do
  kubectl logs "${POD}" -n "${NAMESPACE}" -c "${CONTAINER}" > "${EVIDENCE_DIR}/logs-${CONTAINER}.txt" 2>&1 || true
  kubectl logs "${POD}" -n "${NAMESPACE}" -c "${CONTAINER}" --previous > "${EVIDENCE_DIR}/logs-${CONTAINER}-previous.txt" 2>&1 || true
done

# 3. Process list inside the container
kubectl exec "${POD}" -n "${NAMESPACE}" -- ps auxww > "${EVIDENCE_DIR}/processes.txt" 2>&1 || true

# 4. Network connections
kubectl exec "${POD}" -n "${NAMESPACE}" -- ss -tlnp > "${EVIDENCE_DIR}/network-listeners.txt" 2>&1 || true
kubectl exec "${POD}" -n "${NAMESPACE}" -- ss -tnp > "${EVIDENCE_DIR}/network-connections.txt" 2>&1 || true

# 5. Filesystem modifications (compare against image)
NODE=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.nodeName}')
CONTAINER_ID=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.containerStatuses[0].containerID}' | sed 's|containerd://||')
echo "Container ID: ${CONTAINER_ID} on node ${NODE}" > "${EVIDENCE_DIR}/container-info.txt"

# 6. Events from the namespace
kubectl get events -n "${NAMESPACE}" --sort-by='.lastTimestamp' > "${EVIDENCE_DIR}/events.txt"

# 7. Network policy state
kubectl get networkpolicy -n "${NAMESPACE}" -o yaml > "${EVIDENCE_DIR}/network-policies.yaml"

# 8. Copy to immutable storage
tar czf "${EVIDENCE_DIR}.tar.gz" "${EVIDENCE_DIR}"
aws s3 cp "${EVIDENCE_DIR}.tar.gz" "s3://incident-evidence/$(basename ${EVIDENCE_DIR}).tar.gz" \
  --sse aws:kms

echo "Evidence preserved at ${EVIDENCE_DIR} and uploaded to S3."
echo "DO NOT delete the quarantined pod until investigation is complete."

Phase 4: Close the Attack Path (Post-Incident)

After the immediate incident is resolved, implement permanent fixes.

# Example: if the incident was caused by overly permissive RBAC
# Fix: replace cluster-admin with least-privilege role

# Before (incident root cause)
# apiVersion: rbac.authorization.k8s.io/v1
# kind: ClusterRoleBinding
# metadata:
#   name: deploy-bot
# roleRef:
#   kind: ClusterRole
#   name: cluster-admin     # <-- This was the problem
# subjects:
#   - kind: ServiceAccount
#     name: deploy-bot
#     namespace: ci-cd

# After (permanent fix)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deploy-bot-role
  namespace: production  # Scoped to single namespace
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "patch", "update"]
  - apiGroups: [""]
    resources: ["configmaps", "secrets"]
    verbs: ["get", "list"]
    # No create, no delete, no access to other namespaces
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: deploy-bot-binding
  namespace: production
roleRef:
  kind: Role
  name: deploy-bot-role
subjects:
  - kind: ServiceAccount
    name: deploy-bot
    namespace: ci-cd

Phase 5: Strengthen Detection

Write detection rules for the specific attack pattern observed during the incident.

# falco-rule-from-incident.yaml
# Example: incident revealed that the attacker used kubectl exec
# to establish a reverse shell from a production pod

- rule: Reverse Shell in Production
  desc: >
    Outbound connection from a production container to a non-standard port,
    combined with shell execution. Derived from incident INC-2026-042.
  condition: >
    spawned_process
    and container
    and k8s.ns.name = "production"
    and proc.name in (bash, sh, dash)
    and evt.type = connect
    and fd.sip != "0.0.0.0"
    and not fd.sip in (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
  output: >
    Reverse shell detected in production
    (pod=%k8s.pod.name ns=%k8s.ns.name process=%proc.name
     dest=%fd.sip:%fd.sport user=%user.name)
  priority: CRITICAL
  tags: [incident-derived, reverse-shell, INC-2026-042]

- rule: Kubectl Exec to Production Pod
  desc: >
    kubectl exec into a production pod outside of approved maintenance windows.
    Derived from incident INC-2026-042.
  condition: >
    kevt
    and kcreate
    and k8s.ns.name = "production"
    and ka.verb = "create"
    and ka.target.subresource = "exec"
  output: >
    kubectl exec to production pod
    (pod=%ka.target.name ns=%ka.target.namespace user=%ka.user.name)
  priority: WARNING
  tags: [incident-derived, exec, INC-2026-042]

Phase 6: Incident-to-Hardening Conversion Template

# incident-hardening-template.yaml
# Fill this out for every security incident to convert findings into permanent controls
incident:
  id: "INC-2026-042"
  date: "2026-04-22"
  severity: "high"
  summary: "Attacker used compromised CI/CD service account to exec into production pod and exfiltrate database credentials"

findings:
  - finding: "CI/CD service account had cluster-admin privileges"
    root_cause: "Default RBAC from initial cluster setup was never tightened"
    control: "Least-privilege RBAC for all service accounts"
    article_reference: "Article #27 - RBAC Hardening"
    implementation:
      status: "completed"
      pr: "https://github.com/org/infra/pull/847"
      deployed: "2026-04-23"

  - finding: "No detection for kubectl exec in production"
    root_cause: "Falco rules did not cover exec subresource"
    control: "Falco rule for production exec events"
    article_reference: "Article #29 - Falco Rules"
    implementation:
      status: "completed"
      pr: "https://github.com/org/infra/pull/848"
      deployed: "2026-04-23"

  - finding: "Database credentials stored as Kubernetes Secret (not Vault)"
    root_cause: "Migration to Vault was incomplete for this service"
    control: "All database credentials managed by Vault with dynamic secrets"
    article_reference: "Article #52 - Secret Management"
    implementation:
      status: "in-progress"
      ticket: "SEC-2026-089"
      target_date: "2026-05-01"

verification:
  - test: "Attempt kubectl exec to production pod with CI/CD service account"
    expected: "Denied by RBAC"
    result: "Verified 2026-04-24"
  - test: "Falco alert fires on kubectl exec to production"
    expected: "CRITICAL alert within 30 seconds"
    result: "Verified 2026-04-24"

Expected Behaviour

  • Compromised pod is network-isolated within 5 minutes of detection
  • All credentials accessible to the compromised workload are rotated within 15 minutes
  • Forensic evidence is captured and stored in immutable storage within 30 minutes
  • Attack path is closed with a permanent fix within 48 hours
  • New detection rules for the observed attack pattern are deployed within 48 hours
  • Incident-to-hardening template is completed within 1 week
  • All hardening actions from the incident have a tracking ticket and a target date

Trade-offs

Decision Impact Risk Mitigation
Quarantine (not delete) Preserves evidence but leaves compromised pod running (isolated) Resource consumption; team anxiety about “leaving the bad pod” Network isolation eliminates the risk. The pod cannot communicate. Delete only after evidence is preserved.
Immediate credential rotation Stops the attacker from using stolen credentials May disrupt legitimate services using the same credentials Verify service health after rotation. Accept brief disruption in exchange for containment.
Post-incident hardening required Every incident produces permanent security improvement Additional work after the incident is already resolved; team fatigue Build the conversion template into the post-mortem process. No post-mortem is complete without hardening actions.
Incident-derived Falco rules Detection rules based on real attacks, not theoretical ones Rules may be too specific (only catches exact replay of this attack) Write rules at the technique level (reverse shell, unauthorized exec), not the exact payload level.

Failure Modes

Failure Symptom Detection Recovery
Quarantine network policy not applied Compromised pod continues lateral movement Network monitoring shows continued connections from quarantined pod Verify network policy controller (Cilium/Calico) is running. Apply policy to node-level firewall as fallback.
Evidence destroyed (pod deleted prematurely) No forensic data available for investigation Evidence directory empty or missing Recover what is possible from centralised logs and node-level audit logs. Update playbook to emphasise “do not delete.”
Credential rotation misses a credential Attacker retains access through a credential that was not rotated Continued attacker activity after rotation Audit all credentials the service account had access to. Check Vault audit logs, cloud IAM access advisor, and API key usage logs.
Post-incident hardening stalls Tickets created but never completed Hardening tickets older than 30 days without progress Assign hardening actions to specific owners with deadlines. Review in weekly security standup. Escalate after 30 days.

When to Consider a Managed Alternative

Incident.io (#175), FireHydrant (#176), and Rootly (#177) for structured incident management with automated workflows, Slack/Teams integration, and post-mortem templates. Sysdig (#122) for runtime detection that feeds directly into incident response with container forensics. Grafana Cloud (#108) for log analysis during incidents with fast query across all infrastructure. Vanta (#169) and Drata (#170) for post-incident compliance documentation showing that findings were remediated.

Premium content pack: Incident response hardening templates. Quarantine scripts, credential rotation runbook, evidence preservation procedures, Falco rule templates for common incident types, and the incident-to-hardening conversion template.