GPU Cost and Security Monitoring: Detecting Abuse and Optimising Spend

GPU Cost and Security Monitoring: Detecting Abuse and Optimising Spend

Problem

GPU compute costs between $2 and $30 per hour per device. A single unauthorised cryptocurrency mining pod running on an A100 for a weekend generates $1,400+ in wasted compute. Most Kubernetes observability stacks monitor CPU, memory, and disk but have no GPU metrics. Without GPU-specific monitoring, teams cannot detect unauthorised usage, cannot allocate costs to teams, and cannot identify idle GPUs that could be reclaimed.

Standard Kubernetes resource metrics (from metrics-server or cAdvisor) do not include GPU utilisation, GPU memory, power draw, or temperature. You need NVIDIA DCGM (Data Center GPU Manager) exporting to Prometheus to get visibility. Without it, a GPU running at 100% utilisation on an unauthorised workload looks identical to an idle GPU in your monitoring dashboards, because those dashboards simply have no GPU data.

Target systems: Kubernetes clusters with NVIDIA GPUs. NVIDIA DCGM exporter. Prometheus and Grafana for metrics and visualisation.

Threat Model

  • Adversary: External attacker who has gained pod creation privileges (through compromised CI/CD, exposed Kubernetes API, or supply chain attack), or an insider running unauthorised workloads.
  • Objective: Use GPU resources for cryptocurrency mining, unauthorised model training, or other compute-intensive tasks at the organisation’s expense.
  • Blast radius: Financial: uncapped GPU costs until detected. Performance: legitimate workloads may be starved of GPU resources. Security: if the attacker has pod creation privileges, GPU abuse is likely not their only activity. The GPU mining is the visible symptom of a deeper compromise.

Configuration

Deploy NVIDIA DCGM Exporter

DCGM exporter collects GPU metrics and exposes them in Prometheus format.

# dcgm-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9400"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
          ports:
            - containerPort: 9400
              name: metrics
          env:
            - name: DCGM_EXPORTER_KUBERNETES
              value: "true"
            - name: DCGM_EXPORTER_LISTEN
              value: ":9400"
          securityContext:
            runAsNonRoot: false  # DCGM requires root for GPU access
            capabilities:
              add:
                - SYS_ADMIN  # Required for GPU monitoring
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 50m
              memory: 64Mi
          volumeMounts:
            - name: device-plugins
              mountPath: /var/lib/kubelet/device-plugins
              readOnly: true
      volumes:
        - name: device-plugins
          hostPath:
            path: /var/lib/kubelet/device-plugins
# servicemonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

GPU Security Alerts

# prometheus-gpu-security-rules.yaml
groups:
  - name: gpu-security
    rules:
      # Detect GPU utilisation from unexpected namespaces
      - alert: UnauthorisedGPUUsage
        expr: >
          DCGM_FI_DEV_GPU_UTIL{namespace!~"ai-training|ml-serving|ml-platform"} > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU usage detected in unexpected namespace {{ $labels.namespace }}"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is using GPU. Only ai-training, ml-serving, and ml-platform namespaces should have GPU workloads."
          runbook: "Investigate the pod. If unauthorised, delete it and audit how it was created."

      # Sustained high utilisation without a matching training job
      - alert: SustainedGPUWithoutJob
        expr: >
          DCGM_FI_DEV_GPU_UTIL > 90
          and on (pod, namespace)
          (kube_pod_labels{label_job_type!="training"} or absent(kube_pod_labels{label_job_type="training"}))
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "GPU at >90% utilisation for 30m without training job label"
          description: "Pod {{ $labels.pod }} is consuming GPU heavily but is not labelled as a training job."

      # GPU memory nearly full (potential crypto miner or memory leak)
      - alert: GPUMemoryExhaustion
        expr: >
          DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU memory >95% used on {{ $labels.gpu }}"

      # Unexpected GPU temperature (mining or overloading)
      - alert: GPUTemperatureHigh
        expr: >
          DCGM_FI_DEV_GPU_TEMP > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature above 85C on {{ $labels.gpu }}"
          description: "Sustained high temperature may indicate unauthorized workload or cooling failure."

Cost Allocation and Tracking

# prometheus-gpu-cost-recording-rules.yaml
groups:
  - name: gpu-cost-tracking
    interval: 5m
    rules:
      # GPU-hours per namespace per day
      - record: gpu:namespace:hours_used:daily
        expr: >
          sum by (namespace) (
            count_over_time(DCGM_FI_DEV_GPU_UTIL{} > 0[1d]) / 12
          )
        # Each sample is 5m apart, 12 samples = 1 hour

      # Estimated cost per namespace (configurable rate)
      - record: gpu:namespace:estimated_cost_usd:daily
        expr: >
          gpu:namespace:hours_used:daily * 3.50
        # $3.50/hour is approximate A100 on-demand cost. Adjust for your rate.

      # GPU utilisation efficiency (actual util vs allocated)
      - record: gpu:namespace:utilisation_efficiency
        expr: >
          avg by (namespace) (DCGM_FI_DEV_GPU_UTIL)
          / 100

Resource Quota Enforcement

# gpu-resource-quota.yaml
# Prevent namespaces from consuming more GPU than allocated
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ai-training
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-serving
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"

Grafana Dashboard Queries

Key panels for a GPU monitoring dashboard:

# GPU Utilisation by Namespace (time series)
avg by (namespace) (DCGM_FI_DEV_GPU_UTIL)

# GPU Memory Used (gauge, per device)
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100

# Estimated Daily Cost by Namespace (stat panel)
gpu:namespace:estimated_cost_usd:daily

# GPU Temperature (time series)
DCGM_FI_DEV_GPU_TEMP

# Power Draw (time series, watts)
DCGM_FI_DEV_POWER_USAGE

# Top GPU Consumers (table, sorted by utilisation)
topk(10, avg by (pod, namespace) (DCGM_FI_DEV_GPU_UTIL))

Expected Behaviour

  • DCGM exporter runs on every GPU node and reports metrics to Prometheus every 15 seconds
  • GPU utilisation from non-approved namespaces triggers a critical alert within 5 minutes
  • Per-namespace GPU cost is tracked daily and visible in Grafana
  • Resource quotas prevent any namespace from exceeding its GPU allocation
  • GPU temperature and power anomalies generate warnings for infrastructure review
  • Idle GPUs (under 5% utilisation for over 1 hour) are flagged for reclamation

Trade-offs

Decision Impact Risk Mitigation
DCGM exporter requires SYS_ADMIN GPU monitoring needs elevated privileges Privileged container on GPU nodes DCGM exporter is a read-only monitoring agent. Restrict with AppArmor profile that allows only GPU device reads.
15-second scrape interval Near real-time GPU visibility Higher storage requirements for Prometheus Use recording rules to pre-aggregate. Downsample historical data beyond 7 days.
Namespace-based cost allocation Simple attribution model Multi-tenant namespaces split costs inaccurately Use pod-level labels for finer-grained attribution. Label each workload with team and project.
Static GPU hourly rate in recording rules Simple cost estimation Does not reflect spot pricing, reserved instances, or MIG partitions Update the rate constant when pricing changes. For MIG, calculate per-partition cost as fraction of full GPU cost.

Failure Modes

Failure Symptom Detection Recovery
DCGM exporter not running on GPU node No GPU metrics for that node; gap in monitoring Prometheus target health shows DCGM exporter down; DaemonSet pod count less than GPU node count Check DaemonSet status. Verify node selector matches GPU node labels. Check NVIDIA driver compatibility.
GPU metrics missing after node upgrade DCGM exporter fails to start Pod crashloop; logs show NVIDIA driver version mismatch Update DCGM exporter image to match the new driver version.
False positive on crypto mining alert Legitimate high-utilisation workload triggers alert Alert fires for a known training job Add the job label (job_type=training) to legitimate workloads. Tune alert to exclude labelled training jobs.
Resource quota blocks legitimate workload Pod stuck in Pending with quota exceeded message kubectl describe pod shows “exceeded quota” Review quota allocation. Increase if justified. If a previous job’s GPUs were not released, clean up completed/failed jobs.

When to Consider a Managed Alternative

Grafana Cloud (#108) for managed Prometheus and Grafana with built-in GPU dashboard templates. Eliminates Prometheus storage management and provides long-term metric retention for cost trend analysis. Managed Kubernetes providers with GPU monitoring integrations reduce the DCGM deployment burden.

Premium content pack: GPU monitoring dashboard pack. Pre-built Grafana dashboards (utilisation, cost, security alerts), DCGM exporter DaemonSet manifests, Prometheus recording rules for cost allocation, and alert rules for unauthorised GPU usage.