Security-Relevant Prometheus Metrics: What to Collect, How to Alert, When to Page

Problem

Prometheus is deployed in most Kubernetes environments for infrastructure monitoring (CPU, memory, disk, request latency. But security teams rarely use it for detection. Authentication failures, RBAC denials, certificate expiry, network policy drops, and syscall violations all produce Prometheus metrics. Nobody writes alert rules for them. The gap between “infrastructure observability” and “security monitoring” is not a tooling gap) it is an alert rules gap.

Threat Model

Adversary: Any attacker. Security metrics detect brute force (auth failure spikes), privilege escalation (RBAC deny spikes), lateral movement (network policy drops), resource exhaustion (OOM kills from crypto miners), and misconfiguration (certificate expiry).
Without security metrics: Attacks are detected by their EFFECTS (outage, data breach, cost spike), often days or weeks later. With security metrics: attacks are detected by their CAUSES (auth failure spike, unusual RBAC denials), within minutes.

Configuration

Authentication Failure Monitoring

# PrometheusRule for authentication failures
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: security-auth-alerts
  namespace: monitoring
spec:
  groups:
    - name: authentication
      interval: 30s
      rules:
        # Recording rule: auth failure rate per source
        - record: security:auth_failures:rate5m
          expr: sum by (source_ip, service) (rate(auth_failures_total{result="failure"}[5m]))

        # Alert: brute force detection
        - alert: BruteForceDetected
          expr: security:auth_failures:rate5m > 0.5  # >30 failures per minute
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "Possible brute force against {{ $labels.service }} from {{ $labels.source_ip }}"
            runbook_url: "https://systemshardening.com/runbooks/brute-force"
            description: "{{ $value | humanize }} auth failures/sec from {{ $labels.source_ip }}"

        # Alert: credential stuffing (many IPs, same pattern)
        - alert: CredentialStuffing
          expr: count by (service) (security:auth_failures:rate5m > 0.1) > 10
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Possible credential stuffing against {{ $labels.service }}, {{ $value }} source IPs"

Kubernetes RBAC Denial Monitoring

        # RBAC denials from the API server
        - alert: RBACDenialSpike
          expr: >
            rate(apiserver_authorization_decisions_total{decision="forbid"}[5m])
            > 3 * avg_over_time(rate(apiserver_authorization_decisions_total{decision="forbid"}[5m])[7d:5m])
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "RBAC denial rate is 3x above 7-day average"
            runbook_url: "https://systemshardening.com/runbooks/rbac-denial"
            description: |
              Current rate: {{ $value | humanize }}/sec.
              Investigate: is a service account misconfigured, or is someone probing for permissions?

        # Specific: cluster-admin usage
        - alert: ClusterAdminUsage
          expr: >
            increase(apiserver_request_total{
              verb=~"create|update|patch|delete",
              userAgent!~".*kube-controller-manager.*|.*kube-scheduler.*"
            }[5m]) > 0
            and on() (apiserver_authorization_decisions_total{decision="allow"} > 0)
          labels:
            severity: info
          annotations:
            summary: "Mutation API call detected, review for unauthorized changes"

Certificate Expiry Monitoring

    - name: certificates
      interval: 1m
      rules:
        # cert-manager certificate expiry
        - alert: CertificateExpiringSoon
          expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 24 * 3600
          labels:
            severity: warning
          annotations:
            summary: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value | humanizeDuration }}"

        - alert: CertificateExpiryCritical
          expr: certmanager_certificate_expiration_timestamp_seconds - time() < 24 * 3600
          labels:
            severity: critical
          annotations:
            summary: "Certificate {{ $labels.name }} expires in {{ $value | humanizeDuration }}. IMMEDIATE ACTION REQUIRED"

        # cert-manager renewal failures
        - alert: CertificateRenewalFailed
          expr: certmanager_certificate_ready_status{condition="False"} == 1
          for: 1h
          labels:
            severity: critical
          annotations:
            summary: "Certificate {{ $labels.name }} renewal has failed for over 1 hour"

Network Policy Drop Monitoring

    - name: network-security
      interval: 30s
      rules:
        # Cilium network policy drops
        - alert: NetworkPolicyDrop
          expr: rate(cilium_drop_count_total{reason="POLICY_DENIED"}[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Network policy dropping traffic in {{ $labels.namespace }}"
            description: "{{ $value | humanize }} packets/sec dropped. Check if a new service needs a policy update or if this is suspicious traffic."

        # New destination detection (lateral movement indicator)
        - alert: NewNetworkDestination
          expr: >
            count by (source_workload) (
              rate(hubble_flows_processed_total{verdict="FORWARDED"}[1h]) > 0
            )
            unless
            count by (source_workload) (
              rate(hubble_flows_processed_total{verdict="FORWARDED"}[7d]) > 0
            )
          labels:
            severity: info
          annotations:
            summary: "{{ $labels.source_workload }} connected to a destination not seen in the past 7 days"

Resource Exhaustion (Security-Relevant)

    - name: resource-security
      interval: 30s
      rules:
        # OOM kills - could indicate crypto mining or resource exhaustion attack
        - alert: OOMKillDetected
          expr: increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[5m]) > 0
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.container }} in {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"

        # Unexpected high CPU - crypto mining indicator
        - alert: UnexpectedHighCPU
          expr: >
            (rate(container_cpu_usage_seconds_total[5m])
            / on(namespace, pod) kube_pod_container_resource_limits{resource="cpu"})
            > 0.95
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.container }} at >95% CPU limit for 15 minutes"
            description: "Sustained high CPU could indicate crypto mining. Investigate the process."

Complete PrometheusRule Deployment

# Save all rules to a single file and apply:
kubectl apply -f security-prometheus-rules.yaml

# Verify rules are loaded:
kubectl get prometheusrules -n monitoring
# Expected: security-auth-alerts listed

# Check rules in Prometheus UI:
# Navigate to http://prometheus:9090/rules
# Look for the 'authentication', 'certificates', 'network-security', 'resource-security' groups

Alert Routing

# Alertmanager configuration for security alerts
# Route security alerts to the security team channel, not the general on-call.

route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: security-pager
      continue: true
    - match_re:
        alertname: "BruteForce.*|CredentialStuffing|RBACDenial.*|CertificateExpiry.*"
      receiver: security-slack
      group_wait: 30s
      group_interval: 5m

receivers:
  - name: security-slack
    slack_configs:
      - channel: '#security-alerts'
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: security-pager
    pagerduty_configs:
      - service_key: "${PAGERDUTY_KEY}"

Expected Behaviour

Security alerts fire within 1-2 minutes of threshold breach
Brute force detection triggers on >30 auth failures per minute from a single source
Certificate expiry alerts at 30, 7, and 1 day before expiry
Network policy drops generate informational alerts (not pages, too noisy for paging)
RBAC denial spikes alert when rate exceeds 3x the 7-day average
All alerts include runbook URLs and actionable context (which service, which source, what to investigate)
False positive rate below 2 per day after 2-week tuning period

Trade-offs

Control	Impact	Risk	Mitigation
Auth failure alerting	Detects brute force within minutes	Legitimate login failures (wrong password) generate noise	Set threshold high enough to exclude individual failures (>30/min, not >1/min).
RBAC denial alerting	Catches permission escalation attempts	CI/CD pipelines with wrong permissions trigger false positives	Exclude known CI service accounts from the alert.
Network policy drop alerting	Detects lateral movement attempts	New deployments generate drops until policies are updated	Use `info` severity (not `warning` or `critical`). Suppress during deployment windows.
30-second scrape interval	Near-real-time detection	Slightly higher Prometheus resource usage	Default 30s is fine. Only reduce to 15s if detection latency is critical.

Failure Modes

Failure	Symptom	Detection	Recovery
Prometheus scrape target down	Missing metrics for a service; security gap	`up == 0` alert fires for the missing target	Fix ServiceMonitor/PodMonitor. Check network policy allows Prometheus to scrape the target.
Alert threshold too sensitive	10+ false positive pages per day	On-call fatigue; team starts ignoring security alerts	Increase thresholds. Add deployment-window suppression. Move high-noise alerts to `info` severity (Slack, not pager).
Metric cardinality explosion	Prometheus OOM or slow queries	Prometheus memory spike; query latency increase	Drop high-cardinality labels (per-IP auth metrics → aggregate to per-service). Use recording rules for pre-aggregation.
Alertmanager routing misconfigured	Security alerts go to the wrong channel or nobody	Test alerts not received; real incidents missed	Test alert routing monthly (send a test alert, verify delivery). Include alerting in security drill.

When to Consider a Managed Alternative

Self-managed Prometheus storage exceeds 500GB within 6 months for a 20-node cluster. HA Prometheus (Thanos/Cortex) adds significant operational complexity. Cross-cluster metric aggregation requires federation or remote write infrastructure.

Grafana Cloud (#108): Prometheus-compatible remote write, managed storage, unified alerting across metrics and logs. Start free (10K metrics). The most natural migration path from self-hosted Prometheus.
Chronosphere (#116): Handles high-cardinality metrics (per-IP, per-user) without cost explosion. Built on M3. For teams where cardinality is the primary scaling challenge.
VictoriaMetrics (#111): Self-hosted but lower resource usage than Prometheus. Extends the free stage before needing managed.

Premium content pack: Complete PrometheusRule YAML for all security metrics (auth, RBAC, certs, network, resource), Alertmanager routing configuration, and Grafana dashboard JSON for security monitoring. See Article #74 for the dashboard design guide.