Network Segmentation for AI Training Infrastructure

Problem

AI training clusters frequently share networks with production services. A training job that can reach the production database is one compromised notebook away from a data breach. The problem is compounded by the unique networking requirements of distributed training: RDMA and InfiniBand for GPU-to-GPU communication operate outside standard TCP/IP network policies, and data pipelines need access to object storage that often contains the organisation’s most sensitive data.

Most teams deploy training workloads into the same Kubernetes cluster as production, relying on namespace separation alone. Namespaces provide no network isolation by default. Without explicit network policies, any pod in any namespace can reach any other pod. A compromised training job can scan the entire cluster network, reach databases, exfiltrate data through any egress path, and pivot to production workloads.

Target systems: Kubernetes clusters with GPU node pools for training. Object storage (S3, GCS, MinIO) for training data. Distributed training using NCCL, Horovod, or DeepSpeed over RDMA/InfiniBand or TCP.

Threat Model

Adversary: Attacker with code execution in a training pod. This could be a compromised dependency in the training code, a malicious dataset that exploits a deserialization vulnerability, or an insider with notebook access.
Objective: Exfiltrate training data (often the organisation’s most valuable proprietary data), pivot to production services through the shared network, or establish persistent access via the training cluster.
Blast radius: Without segmentation, a single compromised training pod has network access to every service in the cluster plus any external endpoint reachable from the node. With proper segmentation, the blast radius is limited to the training namespace and approved data sources.

Configuration

Dedicated Namespace and Node Pool

Isolate training workloads on dedicated nodes with taints that prevent non-training pods from scheduling.

# training-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ai-training
  labels:
    workload-type: training
    network-isolation: strict
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: training-quota
  namespace: ai-training
spec:
  hard:
    requests.nvidia.com/gpu: "16"
    limits.nvidia.com/gpu: "16"
    pods: "50"

# gpu-training-node-taint.yaml
# Apply to GPU nodes dedicated to training
apiVersion: v1
kind: Node
metadata:
  labels:
    node-role: gpu-training
    nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
spec:
  taints:
    - key: workload-type
      value: training
      effect: NoSchedule

Default-Deny Network Policy

Start with blocking all traffic, then allow only what training needs.

# default-deny-training.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: ai-training
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Allow Training-Specific Traffic

# allow-training-communication.yaml
# Distributed training pods need to communicate with each other
# (parameter servers, all-reduce, gradient sync)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-training-inter-pod
  namespace: ai-training
spec:
  podSelector:
    matchLabels:
      workload-type: distributed-training
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              workload-type: distributed-training
      ports:
        - protocol: TCP
          port: 29500  # PyTorch distributed default
        - protocol: TCP
          port: 29501  # NCCL socket
  egress:
    - to:
        - podSelector:
            matchLabels:
              workload-type: distributed-training
      ports:
        - protocol: TCP
          port: 29500
        - protocol: TCP
          port: 29501

Restrict Data Pipeline Egress

Training pods should only reach approved data sources. Block all other egress.

# allow-data-source-egress.yaml
# Allow training pods to reach object storage and DNS only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-data-sources
  namespace: ai-training
spec:
  podSelector:
    matchLabels:
      workload-type: distributed-training
  policyTypes:
    - Egress
  egress:
    # DNS resolution
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    # MinIO / internal object storage
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: storage
          podSelector:
            matchLabels:
              app: minio
      ports:
        - protocol: TCP
          port: 9000

For external object storage (S3, GCS), use Cilium FQDN-based policies:

# cilium-fqdn-egress.yaml
# Cilium CiliumNetworkPolicy for FQDN-based egress control
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-s3-egress
  namespace: ai-training
spec:
  endpointSelector:
    matchLabels:
      workload-type: distributed-training
  egress:
    - toFQDNs:
        - matchName: "my-training-bucket.s3.us-east-1.amazonaws.com"
        - matchName: "my-training-bucket.s3.amazonaws.com"
      toPorts:
        - ports:
            - port: "443"
              protocol: TCP

Securing Training Data Access with IAM

# training-service-account.yaml
# Use IRSA (AWS) or Workload Identity (GCP) for least-privilege access
apiVersion: v1
kind: ServiceAccount
metadata:
  name: training-job
  namespace: ai-training
  annotations:
    # AWS IRSA
    eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/training-data-reader"

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-training-data",
        "arn:aws:s3:::my-training-data/*"
      ]
    },
    {
      "Effect": "Deny",
      "Action": "s3:*",
      "NotResource": [
        "arn:aws:s3:::my-training-data",
        "arn:aws:s3:::my-training-data/*"
      ]
    }
  ]
}

RDMA and InfiniBand Considerations

RDMA traffic bypasses the kernel TCP/IP stack and therefore bypasses Kubernetes network policies entirely. This is a fundamental limitation.

# For clusters using RDMA/InfiniBand for training:
# 1. Dedicate RDMA-capable nodes exclusively to training (physical isolation)
# 2. Use separate InfiniBand subnets for training vs production
# 3. Configure InfiniBand partition keys (pkeys) to isolate traffic

# Node affinity: ensure RDMA training only runs on isolated nodes
apiVersion: v1
kind: Pod
metadata:
  name: training-worker-0
  namespace: ai-training
spec:
  nodeSelector:
    node-role: gpu-training
    rdma-capable: "true"
  tolerations:
    - key: workload-type
      value: training
      effect: NoSchedule
  containers:
    - name: trainer
      image: registry.example.com/training:v1
      resources:
        limits:
          nvidia.com/gpu: 8
          rdma/rdma_shared_device_a: 1
      securityContext:
        runAsNonRoot: true
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]

Monitoring for Data Exfiltration

# prometheus-training-network-alerts.yaml
groups:
  - name: training-network-security
    rules:
      - alert: TrainingPodUnexpectedEgress
        expr: >
          sum by (destination) (
            rate(hubble_drop_total{
              source_namespace="ai-training",
              reason="POLICY_DENIED"
            }[5m])
          ) > 0
        labels:
          severity: warning
        annotations:
          summary: "Training pod attempted blocked egress to {{ $labels.destination }}"
          description: "A pod in ai-training namespace attempted to reach a destination blocked by network policy. Investigate for potential data exfiltration."

      - alert: TrainingDataEgressVolumeSpike
        expr: >
          sum(rate(container_network_transmit_bytes_total{namespace="ai-training"}[10m]))
          > 1.5 * avg_over_time(
            sum(rate(container_network_transmit_bytes_total{namespace="ai-training"}[10m]))[7d:1h]
          )
        labels:
          severity: warning
        annotations:
          summary: "Training namespace egress volume 1.5x above 7-day average"

Expected Behaviour

Training pods can communicate with each other on designated ports (distributed training)
Training pods can reach approved object storage endpoints and nothing else
Training pods cannot reach production namespaces, databases, or external services
RDMA traffic is physically isolated on dedicated nodes and InfiniBand subnets
Network policy violations generate alerts for investigation
Service accounts have read-only access to specific training data buckets

Trade-offs

Decision	Impact	Risk	Mitigation
Default-deny network policy	Blocks all unexpected traffic	New training jobs may fail if egress rules are not updated	Maintain a documented list of approved data sources. Use CI to validate network policies match training job requirements.
FQDN-based egress (Cilium)	Controls egress to specific external endpoints	Requires Cilium CNI. Standard Kubernetes network policies cannot match FQDNs.	If not using Cilium, use IP-based egress rules with automation to update IP lists for cloud service endpoints.
Dedicated RDMA nodes	Physical isolation for bypass-prone traffic	Higher cost (dedicated GPU nodes for training only)	Share nodes across training jobs from the same trust level. Do not mix training and production on RDMA nodes.
Read-only IAM for training	Prevents training jobs from writing to or deleting data	Model checkpoints need a separate write path	Create a dedicated checkpoint bucket with write access. Keep training data read-only.

Failure Modes

Failure	Symptom	Detection	Recovery
Network policy too restrictive	Distributed training fails (workers cannot reach each other)	Training job timeout; NCCL errors in pod logs	Check network policy allows inter-pod communication on training ports. Verify pod labels match policy selectors.
FQDN policy stale (S3 endpoint IP changed)	Training cannot download data	Pod logs show connection timeout to object storage	Cilium FQDN policies resolve dynamically. If using IP-based policies, update the IP list.
RDMA traffic leaking between trust zones	No direct symptom (RDMA bypasses standard monitoring)	Periodic audit of InfiniBand subnet membership and pkey configuration	Reconfigure pkeys. Verify node pool isolation.
Overly broad egress rule	Training pods can reach unintended destinations	Network flow monitoring shows connections outside approved list	Tighten egress rules. Audit all active network policies with `kubectl get networkpolicy -n ai-training -o yaml`.

When to Consider a Managed Alternative

Isovalent (#54) Cilium Enterprise for advanced network policy with FQDN-based egress, DNS-aware policies, and network flow visibility. Sysdig (#122) for network monitoring and forensics across training and production namespaces. Managed Kubernetes providers with advanced networking support simplify CNI configuration.

Premium content pack: Network policy pack for AI training cluster isolation. Default-deny policies, distributed training inter-pod rules, FQDN egress for major cloud storage providers, and Prometheus alert rules for training network anomalies.