GPU Workload Isolation: MIG, MPS, and vGPU Security Boundaries

Problem

Multi-tenant GPU sharing without isolation risks data leakage between workloads through shared GPU memory. NVIDIA offers three isolation mechanisms (MIG, MPS, and vGPU) with fundamentally different security properties. Most teams either skip isolation entirely (all workloads share the same GPU memory space) or pick the wrong mechanism for their security requirements.

Critical distinction: MIG provides hardware-level memory isolation. MPS provides no memory isolation at all, it is a performance feature, not a security feature. Choosing MPS for multi-tenant isolation is a security misconfiguration.

Target systems: NVIDIA A100, H100 (MIG). Any NVIDIA GPU (MPS). NVIDIA vGPU-licensed GPUs (vGPU). Kubernetes with NVIDIA device plugin.

Threat Model

Adversary: Workload from one tenant running on the same GPU as another tenant’s workload.
Objective: Read GPU memory contents from another workload (training data, model weights, inference inputs/outputs). Perform side-channel attacks through shared GPU resources.
Blast radius: Without isolation (raw GPU sharing or MPS), complete memory access between co-located workloads. With MIG, hardware-isolated partitions with no cross-partition memory access. With vGPU, hypervisor-level isolation with separate virtual GPU instances.

Configuration

MIG (Multi-Instance GPU) - Hardware Isolation

MIG partitions a single GPU into up to 7 isolated instances, each with dedicated memory, compute, and cache. Available on A100 and H100 only.

# Enable MIG mode on an A100
sudo nvidia-smi -i 0 -mig 1

# Reboot required after enabling MIG mode
sudo reboot

# Create MIG instances (example: 3 instances on A100-80GB)
# Each instance gets dedicated memory and compute
sudo nvidia-smi mig -cgi 9,9,9 -i 0
# 9 = MIG profile 3g.40gb (3 GPU engines, 40GB memory each)

# Create compute instances within each GPU instance
sudo nvidia-smi mig -cci -i 0

Available MIG profiles (A100-80GB):

Profile	GPU Engines	Memory	Use Case
1g.10gb	1	10GB	Small inference, development
2g.20gb	2	20GB	Medium inference
3g.40gb	3	40GB	Large inference, fine-tuning
7g.80gb	7	80GB	Full GPU (no sharing)

# Kubernetes: request a specific MIG partition
apiVersion: v1
kind: Pod
metadata:
  name: inference-tenant-a
spec:
  containers:
    - name: model
      image: registry.example.com/model-a:v1
      resources:
        limits:
          nvidia.com/mig-3g.40gb: 1  # Request one 3g.40gb MIG instance

Security verification:

# From inside tenant A's container, attempt to see tenant B's GPU memory:
nvidia-smi
# Expected: only the assigned MIG instance is visible.
# Tenant B's memory, compute, and processes are invisible.

# Verify isolation:
nvidia-smi -L
# Shows only the MIG device assigned to this container, not the full GPU.

MPS (Multi-Process Service) - NO Memory Isolation

MPS allows multiple processes to share a GPU with better scheduling (reduced context switching), but provides NO memory isolation.

# MPS is a performance feature, NOT a security feature.
# Any process using MPS can access any other MPS process's GPU memory.
# DO NOT use MPS for multi-tenant isolation.

# MPS is appropriate ONLY when:
# - All workloads belong to the same tenant/owner
# - Different models from the same team share a GPU
# - You need to maximize GPU utilization without security boundaries

When MPS is acceptable: Single-team development environments where all models belong to the same organisation and data sensitivity is low. Never for multi-tenant production.

vGPU - Hypervisor-Level Isolation

vGPU provides the strongest isolation by creating virtual GPU instances at the hypervisor level. Each vGPU has its own driver stack, memory space, and compute. Requires NVIDIA vGPU software license.

# vGPU is configured at the hypervisor level (VMware, KVM, Citrix).
# Each VM receives a virtual GPU that appears as a dedicated device.
# Configuration is hypervisor-specific - not Kubernetes-native.
#
# For Kubernetes on VMs with vGPU:
# - Each VM gets a vGPU device
# - The VM runs a Kubernetes node
# - The NVIDIA device plugin exposes the vGPU as a standard GPU resource
# - Pods request nvidia.com/gpu: 1 and receive the vGPU

# Security: strongest isolation (full hypervisor boundary)
# Cost: requires vGPU license ($$$) + hypervisor
# Performance: 5-10% overhead from virtualisation

Kubernetes GPU Node Pool Configuration

# gpu-node-pool.yaml - dedicated GPU nodes with taints
apiVersion: v1
kind: Node
metadata:
  labels:
    node.kubernetes.io/gpu: "true"
    nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
    gpu-isolation: "mig"
spec:
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule

# gpu-workload-deployment.yaml - workload requesting MIG partition
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-tenant-a
  namespace: tenant-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        gpu-isolation: "mig"
      containers:
        - name: model
          image: registry.example.com/model-a:v1
          resources:
            limits:
              nvidia.com/mig-3g.40gb: 1
          securityContext:
            runAsNonRoot: true
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
            seccompProfile:
              type: RuntimeDefault

Monitoring GPU Security

# Prometheus alert: detect GPU workloads without MIG isolation
groups:
  - name: gpu-security
    rules:
      - alert: GPUWorkloadWithoutMIG
        expr: >
          kube_pod_container_resource_limits{resource="nvidia.com/gpu"} > 0
          unless
          kube_pod_container_resource_limits{resource=~"nvidia.com/mig.*"} > 0
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} using raw GPU without MIG isolation"
          description: "In multi-tenant environments, all GPU workloads should use MIG partitions."

      - alert: UnexpectedGPUProcess
        expr: >
          nvidia_gpu_processes_count > count by (gpu) (kube_pod_container_resource_limits{resource=~"nvidia.*"})
        labels:
          severity: critical
        annotations:
          summary: "Unexpected GPU process detected, possible unauthorized GPU usage"

Expected Behaviour

Multi-tenant GPU workloads use MIG partitions with hardware isolation
Each tenant’s container sees only its assigned MIG device
nvidia-smi from inside a container shows only the MIG partition, not the full GPU
No GPU process from tenant A is visible to tenant B
GPU security alerts fire for workloads using raw GPU in multi-tenant namespaces

Trade-offs

Mechanism	Isolation Level	Performance Overhead	Hardware Requirement	Cost
MIG	Hardware (strongest for partitioned)	Minimal (1-2%)	A100, H100 only	No additional license
MPS	None (shared memory)	Minimal	Any NVIDIA GPU	No additional license
vGPU	Hypervisor (strongest overall)	5-10%	Any NVIDIA GPU + vGPU license	$$$$
Separate nodes per tenant	Physical (guaranteed)	Zero	Dedicated GPU per tenant	Most hardware cost

Failure Modes

Failure	Symptom	Detection	Recovery
MIG not enabled	GPU partitions unavailable; pods fail to schedule	`nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv` returns “Disabled”	Enable MIG: `nvidia-smi -i 0 -mig 1` and reboot
MIG profile mismatch	Pod requests profile that doesn’t exist	Pod stuck in Pending; `kubectl describe pod` shows “Insufficient nvidia.com/mig-3g.40gb”	Create the required MIG profile or change the pod’s resource request
MPS used for multi-tenant	Memory leakage between tenants	Security audit reveals raw GPU sharing without isolation	Migrate to MIG (requires A100/H100) or separate GPU nodes per tenant

When to Consider a Managed Alternative

Managed K8s with GPU node pool support: Civo (#22), cloud providers with GPU instances. CoreWeave (#136) and RunPod (#134) for managed GPU cloud with pre-configured isolation. Lambda Labs (#135) for GPU cloud optimised for ML workloads.

Premium content pack: GPU isolation Kubernetes manifests. MIG configuration scripts, NVIDIA device plugin Helm values for MIG, GPU workload deployment templates with security contexts, and Prometheus alert rules for GPU security monitoring.