Securing Fine-Tuning Pipelines: Data Isolation, Checkpoint Integrity, and Access Control

Securing Fine-Tuning Pipelines: Data Isolation, Checkpoint Integrity, and Access Control

Problem

Fine-tuning pipelines are high-value targets. They consume expensive GPU hours, process proprietary training data, and produce model checkpoints that will eventually serve production traffic. Most teams treat fine-tuning as an offline batch job with minimal security controls: training data stored in shared buckets without access restrictions, checkpoints written to world-readable volumes, and no verification that a checkpoint was produced by a legitimate training run.

An attacker who compromises a fine-tuning pipeline can poison training data to inject backdoors into the model, replace checkpoints with trojanized versions, or exfiltrate proprietary datasets. Because fine-tuning runs are long (hours to days), compromises can persist undetected for extended periods before the tainted model reaches production.

Target systems: Kubernetes-based fine-tuning pipelines using PyTorch, Hugging Face Transformers, or custom training frameworks. Applies to any orchestration layer (Argo Workflows, Kubeflow Pipelines, raw Jobs).

Threat Model

  • Adversary: Insider with cluster access (developer, data scientist) or external attacker who has compromised a CI/CD pipeline or container image.
  • Objective: Data poisoning (inject malicious samples into training data to create backdoored models). Checkpoint tampering (replace a legitimate checkpoint with a modified version containing unwanted behaviors). Training data exfiltration (steal proprietary or sensitive datasets). Compute theft (run unauthorized training jobs on GPU nodes).
  • Blast radius: Poisoned model deployed to production (safety/integrity). Proprietary data leaked (confidentiality). GPU budget exhausted by unauthorized jobs (financial).

Configuration

Training Data Access Controls

Isolate training data with dedicated namespaces and RBAC. Training data should only be accessible to the fine-tuning job itself, not to developers or other workloads.

# training-namespace.yaml - isolated namespace for fine-tuning
apiVersion: v1
kind: Namespace
metadata:
  name: ml-training
  labels:
    purpose: fine-tuning
    data-classification: confidential
---
# training-data-rbac.yaml - restrict who can access training data
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: training-data-reader
  namespace: ml-training
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["training-data-credentials"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    resourceNames: ["training-data-pvc"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: training-job-data-access
  namespace: ml-training
subjects:
  - kind: ServiceAccount
    name: fine-tuning-job
    namespace: ml-training
roleRef:
  kind: Role
  name: training-data-reader
  apiGroup: rbac.authorization.k8s.io
---
# Service account for fine-tuning jobs - no default token
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fine-tuning-job
  namespace: ml-training
automountServiceAccountToken: false

Network Isolation for Training Jobs

Training jobs should not have outbound internet access. All data and dependencies must be pre-staged.

# training-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: training-isolation
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      component: fine-tuning
  policyTypes:
    - Ingress
    - Egress
  ingress: []  # No inbound traffic needed
  egress:
    # Allow DNS resolution
    - to:
        - namespaceSelector: {}
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
    # Allow access to internal object storage only
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
      ports:
        - port: 9000
          protocol: TCP  # MinIO
        - port: 443
          protocol: TCP  # Internal S3-compatible
    # Block all public internet access

Hardened Fine-Tuning Job

# fine-tuning-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: finetune-llama-v2-run-042
  namespace: ml-training
  labels:
    component: fine-tuning
    model: llama-v2
    run-id: "042"
spec:
  backoffLimit: 2
  activeDeadlineSeconds: 86400  # 24h max runtime
  template:
    metadata:
      labels:
        component: fine-tuning
        model: llama-v2
    spec:
      serviceAccountName: fine-tuning-job
      automountServiceAccountToken: false
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      initContainers:
        # Verify training data integrity before starting
        - name: verify-data
          image: registry.internal/ml-tools:v1.4
          command: ["python", "/scripts/verify_data.py"]
          args:
            - "--manifest=/data/manifest.json"
            - "--checksums=/data/checksums.sha256"
          volumeMounts:
            - name: training-data
              mountPath: /data
              readOnly: true
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
      containers:
        - name: trainer
          image: registry.internal/fine-tuning:v2.1.0
          command: ["python", "train.py"]
          args:
            - "--config=/config/training-config.yaml"
            - "--data-dir=/data"
            - "--output-dir=/checkpoints"
            - "--run-id=042"
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: false
            capabilities:
              drop: ["ALL"]
          resources:
            requests:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: 4
            limits:
              cpu: "16"
              memory: "64Gi"
              nvidia.com/gpu: 4
          volumeMounts:
            - name: training-data
              mountPath: /data
              readOnly: true  # Training data is read-only
            - name: checkpoints
              mountPath: /checkpoints
            - name: config
              mountPath: /config
              readOnly: true
          env:
            - name: WANDB_DISABLED
              value: "true"  # Disable external telemetry
            - name: HF_HUB_OFFLINE
              value: "1"  # Prevent Hugging Face downloads
      restartPolicy: Never
      volumes:
        - name: training-data
          persistentVolumeClaim:
            claimName: training-data-pvc
            readOnly: true
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-storage-pvc
        - name: config
          configMap:
            name: training-config

Checkpoint Signing and Verification

Sign checkpoints after training completes. Verify signatures before promoting to production.

# sign_checkpoint.py - run as a post-training step
import hashlib
import json
import subprocess
import sys
from pathlib import Path

def compute_checkpoint_manifest(checkpoint_dir: str) -> dict:
    """Generate a manifest of all files with their SHA-256 hashes."""
    manifest = {"files": {}, "metadata": {}}
    checkpoint_path = Path(checkpoint_dir)

    for file_path in sorted(checkpoint_path.rglob("*")):
        if file_path.is_file():
            sha256 = hashlib.sha256()
            with open(file_path, "rb") as f:
                for chunk in iter(lambda: f.read(8192), b""):
                    sha256.update(chunk)
            relative = str(file_path.relative_to(checkpoint_path))
            manifest["files"][relative] = sha256.hexdigest()

    return manifest


def sign_manifest(manifest: dict, manifest_path: str, key_ref: str):
    """Sign the manifest using cosign with a KMS key."""
    with open(manifest_path, "w") as f:
        json.dump(manifest, f, indent=2)

    result = subprocess.run(
        [
            "cosign", "sign-blob",
            "--key", key_ref,
            "--output-signature", f"{manifest_path}.sig",
            "--output-certificate", f"{manifest_path}.cert",
            manifest_path,
        ],
        capture_output=True,
        text=True,
    )
    if result.returncode != 0:
        print(f"Signing failed: {result.stderr}", file=sys.stderr)
        sys.exit(1)

    print(f"Checkpoint signed: {manifest_path}.sig")


if __name__ == "__main__":
    checkpoint_dir = sys.argv[1]
    key_ref = sys.argv[2]  # e.g., "gcpkms://projects/my-proj/locations/global/keyRings/ml/cryptoKeys/checkpoint-signer"

    manifest = compute_checkpoint_manifest(checkpoint_dir)
    manifest_path = f"{checkpoint_dir}/manifest.json"
    sign_manifest(manifest, manifest_path, key_ref)
# verify_checkpoint.py - run before model promotion
import hashlib
import json
import subprocess
import sys
from pathlib import Path


def verify_signature(manifest_path: str, key_ref: str) -> bool:
    """Verify the cosign signature on the manifest."""
    result = subprocess.run(
        [
            "cosign", "verify-blob",
            "--key", key_ref,
            "--signature", f"{manifest_path}.sig",
            manifest_path,
        ],
        capture_output=True,
        text=True,
    )
    return result.returncode == 0


def verify_files(manifest_path: str, checkpoint_dir: str) -> bool:
    """Verify all file hashes match the signed manifest."""
    with open(manifest_path) as f:
        manifest = json.load(f)

    checkpoint_path = Path(checkpoint_dir)
    for relative_path, expected_hash in manifest["files"].items():
        file_path = checkpoint_path / relative_path
        if not file_path.exists():
            print(f"MISSING: {relative_path}")
            return False
        sha256 = hashlib.sha256()
        with open(file_path, "rb") as f:
            for chunk in iter(lambda: f.read(8192), b""):
                sha256.update(chunk)
        if sha256.hexdigest() != expected_hash:
            print(f"HASH MISMATCH: {relative_path}")
            return False

    return True


if __name__ == "__main__":
    checkpoint_dir = sys.argv[1]
    key_ref = sys.argv[2]
    manifest_path = f"{checkpoint_dir}/manifest.json"

    if not verify_signature(manifest_path, key_ref):
        print("SIGNATURE VERIFICATION FAILED", file=sys.stderr)
        sys.exit(1)

    if not verify_files(manifest_path, checkpoint_dir):
        print("FILE INTEGRITY CHECK FAILED", file=sys.stderr)
        sys.exit(1)

    print("Checkpoint verified successfully")

Secure Model Promotion Workflow

# promotion-pipeline.yaml - Argo Workflow for gated promotion
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: model-promotion
  namespace: ml-training
spec:
  entrypoint: promote
  serviceAccountName: model-promoter
  templates:
    - name: promote
      steps:
        - - name: verify-checkpoint
            template: verify
        - - name: scan-checkpoint
            template: security-scan
        - - name: evaluate-model
            template: evaluate
        - - name: promote-to-registry
            template: push-registry
            when: "{{steps.evaluate-model.outputs.parameters.passed}} == true"

    - name: verify
      container:
        image: registry.internal/ml-tools:v1.4
        command: ["python", "verify_checkpoint.py"]
        args:
          - "/checkpoints/run-042"
          - "gcpkms://projects/my-proj/locations/global/keyRings/ml/cryptoKeys/checkpoint-signer"

    - name: security-scan
      container:
        image: registry.internal/ml-tools:v1.4
        command: ["python", "scan_model.py"]
        args:
          - "/checkpoints/run-042"
          - "--check-pickle-exploits"
          - "--check-embedded-code"

    - name: evaluate
      container:
        image: registry.internal/ml-eval:v1.2
        command: ["python", "evaluate.py"]
        args:
          - "--checkpoint=/checkpoints/run-042"
          - "--eval-dataset=/data/eval-set"
          - "--min-accuracy=0.85"
          - "--max-toxicity=0.02"

    - name: push-registry
      container:
        image: registry.internal/ml-tools:v1.4
        command: ["python", "push_model.py"]
        args:
          - "--checkpoint=/checkpoints/run-042"
          - "--registry=registry.internal/models"
          - "--tag=llama-v2-ft-042"
          - "--sign=true"

Expected Behaviour

  • Training data volumes are mounted read-only and accessible only to the fine-tuning job’s service account
  • Training jobs cannot reach the public internet; all dependencies are pre-staged
  • Each checkpoint is accompanied by a signed manifest listing SHA-256 hashes for every file
  • Model promotion requires signature verification, security scanning, and evaluation passing minimum thresholds
  • GPU jobs have active deadline limits preventing runaway compute consumption
  • No Hugging Face Hub downloads occur during training (offline mode enforced)

Trade-offs

Control Impact Risk Mitigation
Read-only training data Cannot augment data during training Some workflows generate intermediate data Write intermediate data to a separate ephemeral volume, not the source data volume
No internet access Cannot download pre-trained weights or libraries during training Job fails if a dependency is missing Pre-build container images with all dependencies. Stage base model weights in internal storage before the run.
Checkpoint signing Adds 2-5 minutes per checkpoint for hashing and signing Slows iteration speed for researchers Sign only final checkpoints, not intermediate ones. Researchers can skip signing in dev namespaces with separate RBAC.
Active deadline on jobs Long training runs may be killed Legitimate multi-day training runs get terminated Set deadline based on expected run time plus 50% buffer. Monitor and extend if needed.
Offline Hugging Face mode Cannot use from_pretrained() with model hub names Requires manual model staging Download models once to internal storage. Reference local paths in training configs.

Failure Modes

Failure Symptom Detection Recovery
Training data poisoned before ingestion Model exhibits unexpected behavior on specific inputs Evaluation pipeline catches accuracy drops or toxicity spikes; manual red-teaming detects backdoors Quarantine the training data. Trace data provenance. Retrain from a known-good dataset.
Checkpoint replaced after signing Signature verification fails during promotion Promotion pipeline blocks on verify step; alert fires Investigate who had write access to the checkpoint volume. Re-run training from known-good state.
GPU quota exhausted by unauthorized job Legitimate training jobs stuck in Pending ResourceQuota alerts; unexpected pods in training namespace Remove unauthorized workloads. Apply ResourceQuota limits. Audit RBAC for who can create Jobs.
Container image tampered Training job runs malicious code Image signature verification (cosign) fails; Trivy detects new vulnerabilities Block unsigned images via admission controller. Rebuild from trusted base.

When to Consider a Managed Alternative

Managed fine-tuning platforms handle data isolation, checkpoint management, and access control.

  • Modal (#132): Serverless GPU fine-tuning with built-in secrets management and network isolation.
  • Replicate (#133): Managed fine-tuning with automatic checkpoint storage and versioning.
  • Baseten (#140): Fine-tuning with Truss framework. Built-in model registry.
  • Snyk (#48): Scan training container images for vulnerabilities before running on GPU nodes.
  • Cosign (#150): Keyless or KMS-backed signing for checkpoint integrity verification.

Premium content pack: Complete Argo Workflow templates for secure fine-tuning pipelines with checkpoint signing, data verification init containers, and promotion gate configurations.