Securing Model Artifact Pipelines: From Training to Serving
Problem
Model files are opaque binaries ranging from 1GB to over 1TB. You cannot code-review a set of weights. An attacker who tampers with model weights between training and serving controls the model’s behaviour without touching a single line of application code. A poisoned model passes every integration test because the tests do not verify the weights themselves.
Most teams store model artifacts in object storage (S3, GCS, MinIO) with no integrity verification. The model file downloaded for serving could have been modified at rest, in transit, or through a compromised training pipeline. There is no standard equivalent to package signing for model files. If someone replaces your model binary with a backdoored version, you have no mechanism to detect the swap before the model starts serving requests.
Target systems: Kubernetes clusters running model inference workloads. OCI-compatible registries for model storage. CI/CD pipelines producing and deploying model artifacts. MLflow, DVC, or custom model registries.
Threat Model
- Adversary: Insider with write access to model storage, or external attacker who has compromised the training pipeline or object storage credentials.
- Objective: Replace or modify model weights to alter model behaviour. Targets include injecting backdoor triggers (specific inputs produce attacker-chosen outputs), degrading model accuracy to cause business harm, or embedding data exfiltration channels in model outputs.
- Blast radius: A tampered model serves incorrect or malicious responses to every request. In safety-critical applications (medical, financial, autonomous systems), the consequences are not limited to data loss. Without integrity verification, the compromise persists until someone notices degraded outputs, which could be weeks.
Configuration
Store Models as OCI Artifacts
OCI registries provide content-addressable storage, versioning, and access control. Storing models as OCI artifacts gives you the same integrity guarantees that container images receive.
# Push a model to an OCI registry using ORAS (OCI Registry As Storage)
# Install ORAS CLI: https://oras.land/docs/installation
oras push registry.example.com/models/fraud-detector:v2.3 \
--artifact-type application/vnd.ml.model \
./model.safetensors:application/octet-stream \
./model_card.json:application/json \
./training_metadata.json:application/json
# The registry stores a content-addressable manifest.
# Any modification to the model file changes the digest.
# Pull and verify the digest matches what training produced
oras pull registry.example.com/models/fraud-detector:v2.3 \
--output ./serving/
# Verify SHA-256 matches the training pipeline output
sha256sum ./serving/model.safetensors
# Compare against the digest recorded during training
Sign Models with Cosign
Cosign provides cryptographic signing for OCI artifacts. Signing the model after training and verifying the signature before serving creates a chain of trust from training to production.
# Generate a cosign key pair (do this once, store the private key in Vault)
cosign generate-key-pair
# Sign the model artifact after a successful training run
cosign sign --key cosign.key \
--annotations "training_run_id=run-20260422-001" \
--annotations "training_commit=$(git rev-parse HEAD)" \
--annotations "framework_version=pytorch-2.5.1" \
registry.example.com/models/fraud-detector:v2.3
# Verify the signature before serving
cosign verify --key cosign.pub \
registry.example.com/models/fraud-detector:v2.3
For keyless signing with Sigstore (eliminates key management):
# Keyless signing using Sigstore's Fulcio CA and Rekor transparency log
# Requires OIDC identity (GitHub Actions, GitLab CI, or workload identity)
cosign sign --yes \
--annotations "training_run_id=run-20260422-001" \
registry.example.com/models/fraud-detector:v2.3
# Verify using the OIDC identity that signed
cosign verify \
--certificate-identity "https://github.com/myorg/training-pipeline/.github/workflows/train.yml@refs/heads/main" \
--certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
registry.example.com/models/fraud-detector:v2.3
Admission Control: Block Unsigned Models
Use Kyverno to prevent unsigned model artifacts from being deployed to serving infrastructure.
# kyverno-policy-model-signature.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-model
spec:
validationFailureAction: Enforce
background: false
rules:
- name: check-model-signature
match:
any:
- resources:
kinds:
- Pod
namespaceSelector:
matchLabels:
workload-type: model-serving
verifyImages:
- imageReferences:
- "registry.example.com/models/*"
attestors:
- entries:
- keyless:
issuer: "https://token.actions.githubusercontent.com"
subject: "https://github.com/myorg/training-pipeline/*"
count: 1
required: true
Provenance Tracking
Record the complete chain from training data to serving deployment.
{
"model_id": "fraud-detector-v2.3",
"training": {
"run_id": "run-20260422-001",
"commit": "a1b2c3d4e5f6",
"dataset_digest": "sha256:9f86d08...",
"framework": "pytorch==2.5.1",
"base_image": "nvcr.io/nvidia/pytorch:24.03-py3",
"started_at": "2026-04-21T02:00:00Z",
"completed_at": "2026-04-21T18:00:00Z",
"gpu_type": "A100-80GB",
"gpu_count": 8
},
"artifact": {
"registry": "registry.example.com/models/fraud-detector:v2.3",
"digest": "sha256:abc123...",
"size_bytes": 2147483648,
"format": "safetensors",
"signed_by": "training-pipeline@github-actions",
"signature_digest": "sha256:def456..."
},
"serving": {
"deployment": "fraud-detector-prod",
"namespace": "ml-serving",
"deployed_at": "2026-04-22T10:00:00Z",
"deployed_by": "argocd"
}
}
SHA-256 Verification at Model Load Time
Add a verification step to your model serving container that checks the model digest before loading weights.
# verify_model.py - run before model.load_state_dict()
import hashlib
import sys
import os
def verify_model_integrity(model_path: str, expected_digest: str) -> bool:
"""Verify model file SHA-256 matches expected digest."""
sha256 = hashlib.sha256()
with open(model_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
actual = sha256.hexdigest()
if actual != expected_digest:
print(f"INTEGRITY CHECK FAILED: expected {expected_digest}, got {actual}")
return False
print(f"Model integrity verified: {actual}")
return True
if __name__ == "__main__":
model_path = os.environ.get("MODEL_PATH", "/models/model.safetensors")
expected = os.environ.get("MODEL_DIGEST")
if not expected:
print("MODEL_DIGEST environment variable not set. Refusing to load.")
sys.exit(1)
if not verify_model_integrity(model_path, expected):
sys.exit(1)
# model-serving-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-detector
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: fraud-detector
template:
metadata:
labels:
app: fraud-detector
spec:
initContainers:
- name: verify-model
image: registry.example.com/model-verifier:v1
command: ["python", "verify_model.py"]
env:
- name: MODEL_PATH
value: "/models/model.safetensors"
- name: MODEL_DIGEST
valueFrom:
configMapKeyRef:
name: fraud-detector-config
key: model-digest
volumeMounts:
- name: model-volume
mountPath: /models
readOnly: true
containers:
- name: inference
image: registry.example.com/fraud-detector-serving:v2.3
resources:
limits:
nvidia.com/gpu: 1
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: model-volume
mountPath: /models
readOnly: true
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: fraud-detector-model
Expected Behaviour
- Every model artifact in the OCI registry has a cosign signature tied to the training pipeline identity
- Kyverno blocks any unsigned model from being deployed to serving namespaces
- The init container verifies the SHA-256 digest before the inference container starts
- If the digest does not match, the pod fails to start and the deployment stalls at the init container
- Provenance metadata links every serving deployment back to a specific training run, commit, and dataset version
Trade-offs
| Decision | Impact | Risk | Mitigation |
|---|---|---|---|
| OCI registry for models | Leverages existing container registry infrastructure; content-addressable storage | Large model files (100GB+) strain registry storage and network bandwidth | Use registry with chunked upload support. Consider dedicated model registry for artifacts over 50GB. |
| Cosign signing in CI | Adds 30-60 seconds to pipeline for signing | Private key compromise allows signing malicious models | Use keyless signing with Sigstore (OIDC identity, no long-lived keys). Rotate keys if using key-pair signing. |
| SHA-256 init container | Catches any modification between registry and serving | Adds 1-5 minutes to pod startup for large models (hashing 100GB+ at disk speed) | Acceptable for production deployments. Skip for development environments if needed. |
| Kyverno admission control | Hard block on unsigned models | Kyverno outage blocks all model deployments | Run Kyverno in HA mode (3 replicas). Configure failure policy to Fail (not Ignore). |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Model tampered in storage | Init container digest verification fails; pod stuck in Init | Pod events show “INTEGRITY CHECK FAILED” message; deployment rollout stalled | Investigate storage access logs. Re-pull model from training pipeline. Rotate storage credentials. |
| Cosign signature missing | Kyverno blocks pod creation | kubectl describe pod shows Kyverno admission denial |
Re-sign the model artifact from the training pipeline. Verify cosign key or OIDC identity configuration. |
| Registry unavailable during model pull | Pod stuck in ImagePullBackOff or init container cannot download | Pod events show pull errors | Use registry mirror or cache. For critical models, pre-pull to node-local storage. |
| Signing key compromised | Attacker can sign malicious models | No immediate symptom. Detected through provenance audit (signed model not linked to legitimate training run). | Revoke compromised key. Re-sign all models with new key. Update Kyverno policy with new key reference. Audit all models signed with compromised key. |
When to Consider a Managed Alternative
Snyk (#48) for scanning model-serving container images and their dependency trees. Backblaze (#161) and Wasabi (#162) for immutable object storage with versioning enabled and object lock preventing deletion or modification. Protect AI (#141) for model-specific security scanning that goes beyond container-level checks.
Premium content pack: Model signing pipeline templates. GitHub Actions workflow for cosign signing, Kyverno admission policies, init container verification scripts, and provenance metadata schema.