Security-Relevant Prometheus Metrics: What to Collect, How to Alert, When to Page
Problem
Prometheus is deployed in most Kubernetes environments for infrastructure monitoring (CPU, memory, disk, request latency. But security teams rarely use it for detection. Authentication failures, RBAC denials, certificate expiry, network policy drops, and syscall violations all produce Prometheus metrics. Nobody writes alert rules for them. The gap between “infrastructure observability” and “security monitoring” is not a tooling gap) it is an alert rules gap.
Threat Model
- Adversary: Any attacker. Security metrics detect brute force (auth failure spikes), privilege escalation (RBAC deny spikes), lateral movement (network policy drops), resource exhaustion (OOM kills from crypto miners), and misconfiguration (certificate expiry).
- Without security metrics: Attacks are detected by their EFFECTS (outage, data breach, cost spike), often days or weeks later. With security metrics: attacks are detected by their CAUSES (auth failure spike, unusual RBAC denials), within minutes.
Configuration
Authentication Failure Monitoring
# PrometheusRule for authentication failures
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: security-auth-alerts
namespace: monitoring
spec:
groups:
- name: authentication
interval: 30s
rules:
# Recording rule: auth failure rate per source
- record: security:auth_failures:rate5m
expr: sum by (source_ip, service) (rate(auth_failures_total{result="failure"}[5m]))
# Alert: brute force detection
- alert: BruteForceDetected
expr: security:auth_failures:rate5m > 0.5 # >30 failures per minute
for: 2m
labels:
severity: warning
annotations:
summary: "Possible brute force against {{ $labels.service }} from {{ $labels.source_ip }}"
runbook_url: "https://systemshardening.com/runbooks/brute-force"
description: "{{ $value | humanize }} auth failures/sec from {{ $labels.source_ip }}"
# Alert: credential stuffing (many IPs, same pattern)
- alert: CredentialStuffing
expr: count by (service) (security:auth_failures:rate5m > 0.1) > 10
for: 5m
labels:
severity: critical
annotations:
summary: "Possible credential stuffing against {{ $labels.service }}, {{ $value }} source IPs"
Kubernetes RBAC Denial Monitoring
# RBAC denials from the API server
- alert: RBACDenialSpike
expr: >
rate(apiserver_authorization_decisions_total{decision="forbid"}[5m])
> 3 * avg_over_time(rate(apiserver_authorization_decisions_total{decision="forbid"}[5m])[7d:5m])
for: 5m
labels:
severity: warning
annotations:
summary: "RBAC denial rate is 3x above 7-day average"
runbook_url: "https://systemshardening.com/runbooks/rbac-denial"
description: |
Current rate: {{ $value | humanize }}/sec.
Investigate: is a service account misconfigured, or is someone probing for permissions?
# Specific: cluster-admin usage
- alert: ClusterAdminUsage
expr: >
increase(apiserver_request_total{
verb=~"create|update|patch|delete",
userAgent!~".*kube-controller-manager.*|.*kube-scheduler.*"
}[5m]) > 0
and on() (apiserver_authorization_decisions_total{decision="allow"} > 0)
labels:
severity: info
annotations:
summary: "Mutation API call detected, review for unauthorized changes"
Certificate Expiry Monitoring
- name: certificates
interval: 1m
rules:
# cert-manager certificate expiry
- alert: CertificateExpiringSoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 24 * 3600
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value | humanizeDuration }}"
- alert: CertificateExpiryCritical
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 24 * 3600
labels:
severity: critical
annotations:
summary: "Certificate {{ $labels.name }} expires in {{ $value | humanizeDuration }}. IMMEDIATE ACTION REQUIRED"
# cert-manager renewal failures
- alert: CertificateRenewalFailed
expr: certmanager_certificate_ready_status{condition="False"} == 1
for: 1h
labels:
severity: critical
annotations:
summary: "Certificate {{ $labels.name }} renewal has failed for over 1 hour"
Network Policy Drop Monitoring
- name: network-security
interval: 30s
rules:
# Cilium network policy drops
- alert: NetworkPolicyDrop
expr: rate(cilium_drop_count_total{reason="POLICY_DENIED"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Network policy dropping traffic in {{ $labels.namespace }}"
description: "{{ $value | humanize }} packets/sec dropped. Check if a new service needs a policy update or if this is suspicious traffic."
# New destination detection (lateral movement indicator)
- alert: NewNetworkDestination
expr: >
count by (source_workload) (
rate(hubble_flows_processed_total{verdict="FORWARDED"}[1h]) > 0
)
unless
count by (source_workload) (
rate(hubble_flows_processed_total{verdict="FORWARDED"}[7d]) > 0
)
labels:
severity: info
annotations:
summary: "{{ $labels.source_workload }} connected to a destination not seen in the past 7 days"
Resource Exhaustion (Security-Relevant)
- name: resource-security
interval: 30s
rules:
# OOM kills - could indicate crypto mining or resource exhaustion attack
- alert: OOMKillDetected
expr: increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[5m]) > 0
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} in {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"
# Unexpected high CPU - crypto mining indicator
- alert: UnexpectedHighCPU
expr: >
(rate(container_cpu_usage_seconds_total[5m])
/ on(namespace, pod) kube_pod_container_resource_limits{resource="cpu"})
> 0.95
for: 15m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} at >95% CPU limit for 15 minutes"
description: "Sustained high CPU could indicate crypto mining. Investigate the process."
Complete PrometheusRule Deployment
# Save all rules to a single file and apply:
kubectl apply -f security-prometheus-rules.yaml
# Verify rules are loaded:
kubectl get prometheusrules -n monitoring
# Expected: security-auth-alerts listed
# Check rules in Prometheus UI:
# Navigate to http://prometheus:9090/rules
# Look for the 'authentication', 'certificates', 'network-security', 'resource-security' groups
Alert Routing
# Alertmanager configuration for security alerts
# Route security alerts to the security team channel, not the general on-call.
route:
receiver: default
routes:
- match:
severity: critical
receiver: security-pager
continue: true
- match_re:
alertname: "BruteForce.*|CredentialStuffing|RBACDenial.*|CertificateExpiry.*"
receiver: security-slack
group_wait: 30s
group_interval: 5m
receivers:
- name: security-slack
slack_configs:
- channel: '#security-alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: security-pager
pagerduty_configs:
- service_key: "${PAGERDUTY_KEY}"
Expected Behaviour
- Security alerts fire within 1-2 minutes of threshold breach
- Brute force detection triggers on >30 auth failures per minute from a single source
- Certificate expiry alerts at 30, 7, and 1 day before expiry
- Network policy drops generate informational alerts (not pages, too noisy for paging)
- RBAC denial spikes alert when rate exceeds 3x the 7-day average
- All alerts include runbook URLs and actionable context (which service, which source, what to investigate)
- False positive rate below 2 per day after 2-week tuning period
Trade-offs
| Control | Impact | Risk | Mitigation |
|---|---|---|---|
| Auth failure alerting | Detects brute force within minutes | Legitimate login failures (wrong password) generate noise | Set threshold high enough to exclude individual failures (>30/min, not >1/min). |
| RBAC denial alerting | Catches permission escalation attempts | CI/CD pipelines with wrong permissions trigger false positives | Exclude known CI service accounts from the alert. |
| Network policy drop alerting | Detects lateral movement attempts | New deployments generate drops until policies are updated | Use info severity (not warning or critical). Suppress during deployment windows. |
| 30-second scrape interval | Near-real-time detection | Slightly higher Prometheus resource usage | Default 30s is fine. Only reduce to 15s if detection latency is critical. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Prometheus scrape target down | Missing metrics for a service; security gap | up == 0 alert fires for the missing target |
Fix ServiceMonitor/PodMonitor. Check network policy allows Prometheus to scrape the target. |
| Alert threshold too sensitive | 10+ false positive pages per day | On-call fatigue; team starts ignoring security alerts | Increase thresholds. Add deployment-window suppression. Move high-noise alerts to info severity (Slack, not pager). |
| Metric cardinality explosion | Prometheus OOM or slow queries | Prometheus memory spike; query latency increase | Drop high-cardinality labels (per-IP auth metrics → aggregate to per-service). Use recording rules for pre-aggregation. |
| Alertmanager routing misconfigured | Security alerts go to the wrong channel or nobody | Test alerts not received; real incidents missed | Test alert routing monthly (send a test alert, verify delivery). Include alerting in security drill. |
When to Consider a Managed Alternative
Self-managed Prometheus storage exceeds 500GB within 6 months for a 20-node cluster. HA Prometheus (Thanos/Cortex) adds significant operational complexity. Cross-cluster metric aggregation requires federation or remote write infrastructure.
- Grafana Cloud (#108): Prometheus-compatible remote write, managed storage, unified alerting across metrics and logs. Start free (10K metrics). The most natural migration path from self-hosted Prometheus.
- Chronosphere (#116): Handles high-cardinality metrics (per-IP, per-user) without cost explosion. Built on M3. For teams where cardinality is the primary scaling challenge.
- VictoriaMetrics (#111): Self-hosted but lower resource usage than Prometheus. Extends the free stage before needing managed.
Premium content pack: Complete PrometheusRule YAML for all security metrics (auth, RBAC, certs, network, resource), Alertmanager routing configuration, and Grafana dashboard JSON for security monitoring. See Article #74 for the dashboard design guide.