Security Dashboards That Engineers Actually Use: Grafana Designs for Hardening Verification

Problem

Most security dashboards are vanity metrics, total alerts this month, pie charts of vulnerability severity, traffic heatmaps that look impressive but answer no actionable question. Engineers glance at them once and never return. The result: security state is invisible until something breaks.

Effective security dashboards answer one question: are my security controls working right now? Each panel has an associated action, green means “nothing to do,” red means “fix this today.”

Threat Model

Adversary: Invisible security degradation. A network policy that was deleted. A certificate that is expiring. A seccomp profile that was removed during a deployment. A Falco rule that stopped matching. Without dashboards, these regressions go unnoticed until the next audit or incident.

Configuration

Dashboard Design Principles

Answer a question, not display a metric. Every panel answers: “Is X working?” not “What is the value of Y?”
Red/amber/green status over raw numbers. A single-stat panel showing “3 pods without network policy” is more useful than a time-series graph of policy counts.
Action link on every panel. Red panel → link to the runbook or the article that fixes the issue.
No vanity metrics. Remove panels that nobody acts on. If a panel has been green for 6 months and nobody has ever clicked on it, delete it.

Panel 1: Network Policy Coverage

{
  "title": "Pods Without Network Policy",
  "type": "stat",
  "description": "Number of pods in production namespaces with no matching network policy. Should be 0.",
  "targets": [{
    "expr": "count(kube_pod_info{namespace=~'production|staging'}) - count(kube_pod_info{namespace=~'production|staging'} * on(namespace) group_left kube_networkpolicy_spec_pod_selector)",
    "legendFormat": "Unprotected pods"
  }],
  "thresholds": {
    "steps": [
      {"value": 0, "color": "green"},
      {"value": 1, "color": "red"}
    ]
  },
  "links": [{
    "title": "Fix: Apply network policies",
    "url": "/articles/kubernetes/network-policies/"
  }]
}

Panel 2: Pod Security Standard Compliance

{
  "title": "Namespaces Without PSS Enforcement",
  "type": "stat",
  "targets": [{
    "expr": "count(kube_namespace_labels{label_pod_security_kubernetes_io_enforce=''} or kube_namespace_labels unless kube_namespace_labels{label_pod_security_kubernetes_io_enforce!=''})",
    "legendFormat": "Non-enforcing namespaces"
  }],
  "thresholds": {
    "steps": [
      {"value": 0, "color": "green"},
      {"value": 1, "color": "yellow"},
      {"value": 3, "color": "red"}
    ]
  }
}

Panel 3: Certificate Health

{
  "title": "Certificates Expiring Within 30 Days",
  "type": "stat",
  "targets": [{
    "expr": "count(certmanager_certificate_expiration_timestamp_seconds - time() < 30 * 24 * 3600)",
    "legendFormat": "Expiring soon"
  }],
  "thresholds": {
    "steps": [
      {"value": 0, "color": "green"},
      {"value": 1, "color": "yellow"},
      {"value": 3, "color": "red"}
    ]
  },
  "links": [{
    "title": "Fix: Certificate monitoring",
    "url": "/articles/observability/certificate-expiry/"
  }]
}

Panel 4: RBAC Health

{
  "title": "ClusterRoleBindings with cluster-admin",
  "type": "stat",
  "description": "Number of ClusterRoleBindings granting cluster-admin. Should be minimal (kube-system only).",
  "targets": [{
    "expr": "count(kube_clusterrolebinding_info{clusterrole='cluster-admin'})",
    "legendFormat": "cluster-admin bindings"
  }],
  "thresholds": {
    "steps": [
      {"value": 0, "color": "green"},
      {"value": 3, "color": "yellow"},
      {"value": 5, "color": "red"}
    ]
  }
}

Panel 5: Vulnerability Status

{
  "title": "Images with Critical CVEs in Production",
  "type": "stat",
  "targets": [{
    "expr": "count(trivy_vulnerability_id{severity='CRITICAL', namespace=~'production|staging'})",
    "legendFormat": "Critical CVEs"
  }],
  "thresholds": {
    "steps": [
      {"value": 0, "color": "green"},
      {"value": 1, "color": "red"}
    ]
  }
}

Panel 6: Falco Alert Rate

{
  "title": "Falco Alerts (24h)",
  "type": "timeseries",
  "description": "Alert rate by priority. Spikes indicate potential security events.",
  "targets": [
    {"expr": "sum by (priority) (rate(falco_events_total[1h]))", "legendFormat": "{{ priority }}"}
  ],
  "fieldConfig": {
    "overrides": [
      {"matcher": {"id": "byName", "options": "Critical"}, "properties": [{"id": "color", "value": "red"}]},
      {"matcher": {"id": "byName", "options": "Warning"}, "properties": [{"id": "color", "value": "orange"}]}
    ]
  }
}

Panel 7: Seccomp Coverage

{
  "title": "Pods Without Seccomp Profile",
  "type": "stat",
  "targets": [{
    "expr": "count(kube_pod_container_info{namespace=~'production|staging'}) - count(kube_pod_container_info{namespace=~'production|staging'} * on(pod, namespace) group_left kube_pod_annotations{annotation_seccomp_security_alpha_kubernetes_io_pod!=''})",
    "legendFormat": "No seccomp"
  }],
  "thresholds": {
    "steps": [
      {"value": 0, "color": "green"},
      {"value": 1, "color": "yellow"},
      {"value": 5, "color": "red"}
    ]
  }
}

Complete Dashboard JSON

The full Grafana dashboard JSON (all 7 panels + variables for namespace/cluster filtering) is available in the premium content pack. To import:

# Import via Grafana API:
curl -X POST -H "Authorization: Bearer $GRAFANA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @security-dashboard.json \
  "$GRAFANA_URL/api/dashboards/db"

Dashboard Refresh and Variables

{
  "refresh": "30s",
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_namespace_labels, namespace)",
        "multi": true,
        "includeAll": true
      },
      {
        "name": "cluster",
        "type": "query",
        "query": "label_values(kube_node_info, cluster)",
        "multi": false
      }
    ]
  }
}

Expected Behaviour

Dashboard loads in under 3 seconds
All panels show current state with red/amber/green thresholds
Zero unprotected pods (network policy panel green)
Zero namespaces without PSS enforcement
Zero certificates expiring within 7 days
Minimal cluster-admin bindings (kube-system only)
Zero critical CVEs in production images
Falco alert rate stable and within tuned baseline

Trade-offs

Decision	Impact	Risk	Mitigation
30-second auto-refresh	Always current; uses Grafana resources	Unnecessary load for dashboards not actively viewed	Use auto-refresh only on actively-displayed dashboards. Disable for dashboards viewed occasionally.
Red/amber/green thresholds	Clear actionability	Threshold values need tuning per environment	Start with conservative thresholds. Adjust based on what’s achievable for your environment.
Action links on panels	Engineers can fix issues directly from the dashboard	Links to articles may become stale	Verify links in quarterly dashboard review.

Failure Modes

Failure	Symptom	Detection	Recovery
Prometheus scrape missing	Panel shows “No data”	Panel-level “no data” state visible; data source health check	Fix ServiceMonitor for the missing metric source.
Threshold too sensitive	Panel always red; team ignores it	Panel has been red for 30+ days with no action taken	Adjust threshold to reflect achievable state. A dashboard that is always red is useless.
Metric renamed after upgrade	Panel query returns zero results	Panel shows zero or “no data” after Prometheus/Kubernetes upgrade	Update PromQL query to use the new metric name.

When to Consider a Managed Alternative

Self-managed Grafana requires persistent storage, user management, OIDC integration, and backup. Grafana Cloud (#108) provides managed Grafana with team access, cross-cluster dashboards, and built-in alerting. The free tier (3 users, 10K metrics) covers small teams.

Primary premium content pack: Importable Grafana JSON dashboard for all hardening categories, network policies, PSS, certificates, RBAC, vulnerabilities, Falco, seccomp. Variables for namespace and cluster filtering. Pre-configured thresholds with action links to systemshardening.com articles.