Securing Fine-Tuning Pipelines: Data Isolation, Checkpoint Integrity, and Access Control
Problem
Fine-tuning pipelines are high-value targets. They consume expensive GPU hours, process proprietary training data, and produce model checkpoints that will eventually serve production traffic. Most teams treat fine-tuning as an offline batch job with minimal security controls: training data stored in shared buckets without access restrictions, checkpoints written to world-readable volumes, and no verification that a checkpoint was produced by a legitimate training run.
An attacker who compromises a fine-tuning pipeline can poison training data to inject backdoors into the model, replace checkpoints with trojanized versions, or exfiltrate proprietary datasets. Because fine-tuning runs are long (hours to days), compromises can persist undetected for extended periods before the tainted model reaches production.
Target systems: Kubernetes-based fine-tuning pipelines using PyTorch, Hugging Face Transformers, or custom training frameworks. Applies to any orchestration layer (Argo Workflows, Kubeflow Pipelines, raw Jobs).
Threat Model
- Adversary: Insider with cluster access (developer, data scientist) or external attacker who has compromised a CI/CD pipeline or container image.
- Objective: Data poisoning (inject malicious samples into training data to create backdoored models). Checkpoint tampering (replace a legitimate checkpoint with a modified version containing unwanted behaviors). Training data exfiltration (steal proprietary or sensitive datasets). Compute theft (run unauthorized training jobs on GPU nodes).
- Blast radius: Poisoned model deployed to production (safety/integrity). Proprietary data leaked (confidentiality). GPU budget exhausted by unauthorized jobs (financial).
Configuration
Training Data Access Controls
Isolate training data with dedicated namespaces and RBAC. Training data should only be accessible to the fine-tuning job itself, not to developers or other workloads.
# training-namespace.yaml - isolated namespace for fine-tuning
apiVersion: v1
kind: Namespace
metadata:
name: ml-training
labels:
purpose: fine-tuning
data-classification: confidential
---
# training-data-rbac.yaml - restrict who can access training data
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: training-data-reader
namespace: ml-training
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["training-data-credentials"]
verbs: ["get"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
resourceNames: ["training-data-pvc"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: training-job-data-access
namespace: ml-training
subjects:
- kind: ServiceAccount
name: fine-tuning-job
namespace: ml-training
roleRef:
kind: Role
name: training-data-reader
apiGroup: rbac.authorization.k8s.io
---
# Service account for fine-tuning jobs - no default token
apiVersion: v1
kind: ServiceAccount
metadata:
name: fine-tuning-job
namespace: ml-training
automountServiceAccountToken: false
Network Isolation for Training Jobs
Training jobs should not have outbound internet access. All data and dependencies must be pre-staged.
# training-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: training-isolation
namespace: ml-training
spec:
podSelector:
matchLabels:
component: fine-tuning
policyTypes:
- Ingress
- Egress
ingress: [] # No inbound traffic needed
egress:
# Allow DNS resolution
- to:
- namespaceSelector: {}
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Allow access to internal object storage only
- to:
- ipBlock:
cidr: 10.0.0.0/8
ports:
- port: 9000
protocol: TCP # MinIO
- port: 443
protocol: TCP # Internal S3-compatible
# Block all public internet access
Hardened Fine-Tuning Job
# fine-tuning-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: finetune-llama-v2-run-042
namespace: ml-training
labels:
component: fine-tuning
model: llama-v2
run-id: "042"
spec:
backoffLimit: 2
activeDeadlineSeconds: 86400 # 24h max runtime
template:
metadata:
labels:
component: fine-tuning
model: llama-v2
spec:
serviceAccountName: fine-tuning-job
automountServiceAccountToken: false
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
initContainers:
# Verify training data integrity before starting
- name: verify-data
image: registry.internal/ml-tools:v1.4
command: ["python", "/scripts/verify_data.py"]
args:
- "--manifest=/data/manifest.json"
- "--checksums=/data/checksums.sha256"
volumeMounts:
- name: training-data
mountPath: /data
readOnly: true
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
containers:
- name: trainer
image: registry.internal/fine-tuning:v2.1.0
command: ["python", "train.py"]
args:
- "--config=/config/training-config.yaml"
- "--data-dir=/data"
- "--output-dir=/checkpoints"
- "--run-id=042"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop: ["ALL"]
resources:
requests:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: 4
limits:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: 4
volumeMounts:
- name: training-data
mountPath: /data
readOnly: true # Training data is read-only
- name: checkpoints
mountPath: /checkpoints
- name: config
mountPath: /config
readOnly: true
env:
- name: WANDB_DISABLED
value: "true" # Disable external telemetry
- name: HF_HUB_OFFLINE
value: "1" # Prevent Hugging Face downloads
restartPolicy: Never
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
readOnly: true
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoint-storage-pvc
- name: config
configMap:
name: training-config
Checkpoint Signing and Verification
Sign checkpoints after training completes. Verify signatures before promoting to production.
# sign_checkpoint.py - run as a post-training step
import hashlib
import json
import subprocess
import sys
from pathlib import Path
def compute_checkpoint_manifest(checkpoint_dir: str) -> dict:
"""Generate a manifest of all files with their SHA-256 hashes."""
manifest = {"files": {}, "metadata": {}}
checkpoint_path = Path(checkpoint_dir)
for file_path in sorted(checkpoint_path.rglob("*")):
if file_path.is_file():
sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
relative = str(file_path.relative_to(checkpoint_path))
manifest["files"][relative] = sha256.hexdigest()
return manifest
def sign_manifest(manifest: dict, manifest_path: str, key_ref: str):
"""Sign the manifest using cosign with a KMS key."""
with open(manifest_path, "w") as f:
json.dump(manifest, f, indent=2)
result = subprocess.run(
[
"cosign", "sign-blob",
"--key", key_ref,
"--output-signature", f"{manifest_path}.sig",
"--output-certificate", f"{manifest_path}.cert",
manifest_path,
],
capture_output=True,
text=True,
)
if result.returncode != 0:
print(f"Signing failed: {result.stderr}", file=sys.stderr)
sys.exit(1)
print(f"Checkpoint signed: {manifest_path}.sig")
if __name__ == "__main__":
checkpoint_dir = sys.argv[1]
key_ref = sys.argv[2] # e.g., "gcpkms://projects/my-proj/locations/global/keyRings/ml/cryptoKeys/checkpoint-signer"
manifest = compute_checkpoint_manifest(checkpoint_dir)
manifest_path = f"{checkpoint_dir}/manifest.json"
sign_manifest(manifest, manifest_path, key_ref)
# verify_checkpoint.py - run before model promotion
import hashlib
import json
import subprocess
import sys
from pathlib import Path
def verify_signature(manifest_path: str, key_ref: str) -> bool:
"""Verify the cosign signature on the manifest."""
result = subprocess.run(
[
"cosign", "verify-blob",
"--key", key_ref,
"--signature", f"{manifest_path}.sig",
manifest_path,
],
capture_output=True,
text=True,
)
return result.returncode == 0
def verify_files(manifest_path: str, checkpoint_dir: str) -> bool:
"""Verify all file hashes match the signed manifest."""
with open(manifest_path) as f:
manifest = json.load(f)
checkpoint_path = Path(checkpoint_dir)
for relative_path, expected_hash in manifest["files"].items():
file_path = checkpoint_path / relative_path
if not file_path.exists():
print(f"MISSING: {relative_path}")
return False
sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
if sha256.hexdigest() != expected_hash:
print(f"HASH MISMATCH: {relative_path}")
return False
return True
if __name__ == "__main__":
checkpoint_dir = sys.argv[1]
key_ref = sys.argv[2]
manifest_path = f"{checkpoint_dir}/manifest.json"
if not verify_signature(manifest_path, key_ref):
print("SIGNATURE VERIFICATION FAILED", file=sys.stderr)
sys.exit(1)
if not verify_files(manifest_path, checkpoint_dir):
print("FILE INTEGRITY CHECK FAILED", file=sys.stderr)
sys.exit(1)
print("Checkpoint verified successfully")
Secure Model Promotion Workflow
# promotion-pipeline.yaml - Argo Workflow for gated promotion
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: model-promotion
namespace: ml-training
spec:
entrypoint: promote
serviceAccountName: model-promoter
templates:
- name: promote
steps:
- - name: verify-checkpoint
template: verify
- - name: scan-checkpoint
template: security-scan
- - name: evaluate-model
template: evaluate
- - name: promote-to-registry
template: push-registry
when: "{{steps.evaluate-model.outputs.parameters.passed}} == true"
- name: verify
container:
image: registry.internal/ml-tools:v1.4
command: ["python", "verify_checkpoint.py"]
args:
- "/checkpoints/run-042"
- "gcpkms://projects/my-proj/locations/global/keyRings/ml/cryptoKeys/checkpoint-signer"
- name: security-scan
container:
image: registry.internal/ml-tools:v1.4
command: ["python", "scan_model.py"]
args:
- "/checkpoints/run-042"
- "--check-pickle-exploits"
- "--check-embedded-code"
- name: evaluate
container:
image: registry.internal/ml-eval:v1.2
command: ["python", "evaluate.py"]
args:
- "--checkpoint=/checkpoints/run-042"
- "--eval-dataset=/data/eval-set"
- "--min-accuracy=0.85"
- "--max-toxicity=0.02"
- name: push-registry
container:
image: registry.internal/ml-tools:v1.4
command: ["python", "push_model.py"]
args:
- "--checkpoint=/checkpoints/run-042"
- "--registry=registry.internal/models"
- "--tag=llama-v2-ft-042"
- "--sign=true"
Expected Behaviour
- Training data volumes are mounted read-only and accessible only to the fine-tuning job’s service account
- Training jobs cannot reach the public internet; all dependencies are pre-staged
- Each checkpoint is accompanied by a signed manifest listing SHA-256 hashes for every file
- Model promotion requires signature verification, security scanning, and evaluation passing minimum thresholds
- GPU jobs have active deadline limits preventing runaway compute consumption
- No Hugging Face Hub downloads occur during training (offline mode enforced)
Trade-offs
| Control | Impact | Risk | Mitigation |
|---|---|---|---|
| Read-only training data | Cannot augment data during training | Some workflows generate intermediate data | Write intermediate data to a separate ephemeral volume, not the source data volume |
| No internet access | Cannot download pre-trained weights or libraries during training | Job fails if a dependency is missing | Pre-build container images with all dependencies. Stage base model weights in internal storage before the run. |
| Checkpoint signing | Adds 2-5 minutes per checkpoint for hashing and signing | Slows iteration speed for researchers | Sign only final checkpoints, not intermediate ones. Researchers can skip signing in dev namespaces with separate RBAC. |
| Active deadline on jobs | Long training runs may be killed | Legitimate multi-day training runs get terminated | Set deadline based on expected run time plus 50% buffer. Monitor and extend if needed. |
| Offline Hugging Face mode | Cannot use from_pretrained() with model hub names |
Requires manual model staging | Download models once to internal storage. Reference local paths in training configs. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Training data poisoned before ingestion | Model exhibits unexpected behavior on specific inputs | Evaluation pipeline catches accuracy drops or toxicity spikes; manual red-teaming detects backdoors | Quarantine the training data. Trace data provenance. Retrain from a known-good dataset. |
| Checkpoint replaced after signing | Signature verification fails during promotion | Promotion pipeline blocks on verify step; alert fires | Investigate who had write access to the checkpoint volume. Re-run training from known-good state. |
| GPU quota exhausted by unauthorized job | Legitimate training jobs stuck in Pending | ResourceQuota alerts; unexpected pods in training namespace | Remove unauthorized workloads. Apply ResourceQuota limits. Audit RBAC for who can create Jobs. |
| Container image tampered | Training job runs malicious code | Image signature verification (cosign) fails; Trivy detects new vulnerabilities | Block unsigned images via admission controller. Rebuild from trusted base. |
When to Consider a Managed Alternative
Managed fine-tuning platforms handle data isolation, checkpoint management, and access control.
- Modal (#132): Serverless GPU fine-tuning with built-in secrets management and network isolation.
- Replicate (#133): Managed fine-tuning with automatic checkpoint storage and versioning.
- Baseten (#140): Fine-tuning with Truss framework. Built-in model registry.
- Snyk (#48): Scan training container images for vulnerabilities before running on GPU nodes.
- Cosign (#150): Keyless or KMS-backed signing for checkpoint integrity verification.
Premium content pack: Complete Argo Workflow templates for secure fine-tuning pipelines with checkpoint signing, data verification init containers, and promotion gate configurations.