Model Registry Access Control: Versioning, Signing, and Promotion Gates

Model Registry Access Control: Versioning, Signing, and Promotion Gates

Problem

Model registries are the bridge between training and production. A model pushed to the production registry gets served to users. Most teams use generic artifact storage (S3 buckets, OCI registries, or MLflow) with minimal access controls: any developer can push a model, there is no signing or integrity verification, and promotion from dev to production is a manual copy with no gates.

This creates a supply chain gap. If an attacker can write to the model registry, or if a developer accidentally pushes a broken model, production inference changes immediately. Unlike container images, which have mature signing and scanning tooling, model artifacts often lack any verification. A poisoned model can serve traffic for hours before anyone notices degraded behavior.

Target systems: OCI-based model registries (Harbor, ECR, GCR), MLflow Model Registry, or custom registries running on Kubernetes. Cosign for artifact signing.

Configuration

OCI Registry RBAC for Model Artifacts

Store models as OCI artifacts in the same registry infrastructure you use for container images. This lets you reuse existing RBAC, scanning, and signing tooling.

# harbor-robot-accounts.yaml - separate credentials per environment
# Dev account: push and pull from dev repository
apiVersion: v1
kind: Secret
metadata:
  name: registry-dev-credentials
  namespace: ml-training
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded>
  # Robot account: robot$ml-dev
  # Permissions: push + pull on projects/ml-models-dev/*
  # No access to ml-models-staging or ml-models-prod
---
# Staging account: pull from dev, push to staging
apiVersion: v1
kind: Secret
metadata:
  name: registry-staging-credentials
  namespace: ml-staging
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded>
  # Robot account: robot$ml-staging
  # Permissions: pull on ml-models-dev/*, push + pull on ml-models-staging/*
---
# Production account: pull only from prod repository
apiVersion: v1
kind: Secret
metadata:
  name: registry-prod-credentials
  namespace: ai-inference
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded>
  # Robot account: robot$ml-prod
  # Permissions: pull only on ml-models-prod/*
  # CANNOT push - only the promotion pipeline can write here

Pushing Models as OCI Artifacts with ORAS

#!/bin/bash
# push_model.sh - push a model to the OCI registry with metadata
set -euo pipefail

MODEL_DIR="$1"       # e.g., /checkpoints/run-042
MODEL_NAME="$2"      # e.g., llama-v2-finetuned
MODEL_VERSION="$3"   # e.g., v1.0.42
REGISTRY="$4"        # e.g., registry.internal/ml-models-dev

# Create model manifest with metadata
cat > "${MODEL_DIR}/model-card.json" <<MANIFEST
{
  "name": "${MODEL_NAME}",
  "version": "${MODEL_VERSION}",
  "training_run_id": "$(cat ${MODEL_DIR}/run_id.txt)",
  "base_model": "meta-llama/Llama-2-7b",
  "framework": "pytorch",
  "created_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "created_by": "$(whoami)",
  "eval_accuracy": "$(cat ${MODEL_DIR}/eval_metrics.json | jq -r .accuracy)",
  "data_hash": "$(cat ${MODEL_DIR}/data_manifest.sha256)"
}
MANIFEST

# Push model as OCI artifact using ORAS
oras push "${REGISTRY}/${MODEL_NAME}:${MODEL_VERSION}" \
  --config "${MODEL_DIR}/model-card.json:application/vnd.ml.model.config.v1+json" \
  "${MODEL_DIR}/model.safetensors:application/vnd.ml.model.weights" \
  "${MODEL_DIR}/tokenizer.json:application/vnd.ml.model.tokenizer" \
  "${MODEL_DIR}/config.json:application/vnd.ml.model.config"

echo "Model pushed: ${REGISTRY}/${MODEL_NAME}:${MODEL_VERSION}"

Cosign Signing for Model Artifacts

#!/bin/bash
# sign_model.sh - sign a model artifact in the OCI registry
set -euo pipefail

MODEL_REF="$1"  # e.g., registry.internal/ml-models-dev/llama-v2-finetuned:v1.0.42
KMS_KEY="$2"    # e.g., gcpkms://projects/my-proj/locations/global/keyRings/ml/cryptoKeys/model-signer

# Sign the OCI artifact
cosign sign --key "${KMS_KEY}" "${MODEL_REF}"

# Attach attestation with training metadata
cosign attest --key "${KMS_KEY}" \
  --predicate training-provenance.json \
  --type https://systemshardening.com/model-provenance/v1 \
  "${MODEL_REF}"

echo "Model signed and attested: ${MODEL_REF}"
// training-provenance.json - SLSA-style provenance for model artifacts
{
  "buildType": "https://systemshardening.com/model-training/v1",
  "builder": {
    "id": "https://registry.internal/ml-training-pipeline"
  },
  "invocation": {
    "configSource": {
      "uri": "git+https://git.internal/ml-configs@refs/heads/main",
      "digest": {"sha256": "abc123..."},
      "entrypoint": "configs/llama-v2-finetune.yaml"
    }
  },
  "materials": [
    {
      "uri": "registry.internal/ml-models-base/llama-2-7b:v1.0",
      "digest": {"sha256": "def456..."}
    },
    {
      "uri": "s3://ml-training-data/dataset-v3/",
      "digest": {"sha256": "ghi789..."}
    }
  ]
}

Admission Controller for Model Verification

Block unsigned or unverified models from being loaded in production.

# cosign-policy.yaml - Kyverno policy to verify model signatures
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-model-signatures
spec:
  validationFailureAction: Enforce
  rules:
    - name: verify-model-artifact-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
              namespaces:
                - ai-inference
                - ai-inference-staging
      verifyImages:
        - imageReferences:
            - "registry.internal/ml-models-prod/*"
            - "registry.internal/ml-models-staging/*"
          attestors:
            - entries:
                - keys:
                    publicKeys: |-
                      -----BEGIN PUBLIC KEY-----
                      MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
                      -----END PUBLIC KEY-----
          attestations:
            - type: https://systemshardening.com/model-provenance/v1
              conditions:
                - all:
                    - key: "{{ buildType }}"
                      operator: Equals
                      value: "https://systemshardening.com/model-training/v1"

Promotion Pipeline with Gates

# model-promotion-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: promote-model-to-staging
  namespace: ml-pipeline
spec:
  entrypoint: promotion-gates
  serviceAccountName: model-promoter
  arguments:
    parameters:
      - name: model-ref
        value: "registry.internal/ml-models-dev/llama-v2-finetuned:v1.0.42"
      - name: target-env
        value: "staging"
  templates:
    - name: promotion-gates
      steps:
        # Gate 1: Verify signature from training pipeline
        - - name: verify-signature
            template: cosign-verify
        # Gate 2: Security scan for known model exploits
        - - name: security-scan
            template: model-scan
        # Gate 3: Evaluation benchmarks
        - - name: eval-benchmarks
            template: run-evaluation
        # Gate 4: Copy to target environment registry
        - - name: promote
            template: copy-artifact
            when: >-
              {{steps.verify-signature.outputs.parameters.verified}} == "true" &&
              {{steps.security-scan.outputs.parameters.clean}} == "true" &&
              {{steps.eval-benchmarks.outputs.parameters.passed}} == "true"
        # Gate 5: Sign with staging key
        - - name: sign-promoted
            template: cosign-sign-promoted

    - name: cosign-verify
      container:
        image: registry.internal/ml-tools:v1.4
        command: ["sh", "-c"]
        args:
          - |
            cosign verify \
              --key gcpkms://projects/my-proj/locations/global/keyRings/ml/cryptoKeys/model-signer \
              {{workflow.parameters.model-ref}} && echo "true" > /tmp/verified || echo "false" > /tmp/verified
      outputs:
        parameters:
          - name: verified
            valueFrom:
              path: /tmp/verified

    - name: model-scan
      container:
        image: registry.internal/ml-tools:v1.4
        command: ["python", "scan_model.py"]
        args:
          - "--model-ref={{workflow.parameters.model-ref}}"
          - "--check-pickle-exploits"
          - "--check-safetensors-headers"
          - "--check-embedded-code"
          - "--output=/tmp/scan-result"
      outputs:
        parameters:
          - name: clean
            valueFrom:
              path: /tmp/scan-result

    - name: run-evaluation
      container:
        image: registry.internal/ml-eval:v1.2
        command: ["python", "evaluate.py"]
        args:
          - "--model-ref={{workflow.parameters.model-ref}}"
          - "--benchmark=mmlu,hellaswag,truthfulqa"
          - "--min-mmlu=0.65"
          - "--max-toxicity=0.02"
          - "--output=/tmp/eval-result"
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
      outputs:
        parameters:
          - name: passed
            valueFrom:
              path: /tmp/eval-result

    - name: copy-artifact
      container:
        image: registry.internal/ml-tools:v1.4
        command: ["sh", "-c"]
        args:
          - |
            oras copy \
              {{workflow.parameters.model-ref}} \
              registry.internal/ml-models-{{workflow.parameters.target-env}}/llama-v2-finetuned:v1.0.42

    - name: cosign-sign-promoted
      container:
        image: registry.internal/ml-tools:v1.4
        command: ["sh", "-c"]
        args:
          - |
            cosign sign \
              --key gcpkms://projects/my-proj/locations/global/keyRings/ml/cryptoKeys/staging-signer \
              registry.internal/ml-models-{{workflow.parameters.target-env}}/llama-v2-finetuned:v1.0.42

Model Version Tracking with Integrity

# model_registry_client.py - typed client for model registry operations
import hashlib
import json
import subprocess
from dataclasses import dataclass
from typing import Optional


@dataclass
class ModelVersion:
    name: str
    version: str
    registry: str
    digest: str
    signed: bool
    environment: str  # dev, staging, prod
    training_run_id: str
    eval_scores: dict


class ModelRegistryClient:
    """Client for managing model versions with integrity checks."""

    def __init__(self, registry_base: str, kms_key: str):
        self.registry_base = registry_base
        self.kms_key = kms_key

    def get_model_digest(self, model_ref: str) -> str:
        """Get the OCI digest for a model reference."""
        result = subprocess.run(
            ["oras", "manifest", "fetch", "--descriptor", model_ref],
            capture_output=True,
            text=True,
            check=True,
        )
        descriptor = json.loads(result.stdout)
        return descriptor["digest"]

    def verify_model(self, model_ref: str) -> bool:
        """Verify cosign signature on a model artifact."""
        result = subprocess.run(
            ["cosign", "verify", "--key", self.kms_key, model_ref],
            capture_output=True,
            text=True,
        )
        return result.returncode == 0

    def list_versions(self, model_name: str, environment: str) -> list[str]:
        """List all versions of a model in a given environment."""
        repo = f"{self.registry_base}/ml-models-{environment}/{model_name}"
        result = subprocess.run(
            ["oras", "repo", "tags", repo],
            capture_output=True,
            text=True,
            check=True,
        )
        return result.stdout.strip().split("\n")

    def compare_digests(
        self, model_name: str, version: str, env_a: str, env_b: str
    ) -> bool:
        """Verify that a model in two environments has the same content."""
        ref_a = (
            f"{self.registry_base}/ml-models-{env_a}/{model_name}:{version}"
        )
        ref_b = (
            f"{self.registry_base}/ml-models-{env_b}/{model_name}:{version}"
        )
        digest_a = self.get_model_digest(ref_a)
        digest_b = self.get_model_digest(ref_b)
        return digest_a == digest_b

Expected Behaviour

  • Separate registry credentials per environment (dev, staging, prod) with least-privilege permissions
  • Production registry is pull-only; no human or training job can push directly to it
  • Every model artifact is signed with cosign using a KMS-backed key
  • Promotion between environments requires passing signature verification, security scan, and evaluation benchmarks
  • Models carry SLSA-style provenance attestations linking them to their training run, config, and data
  • Admission controller blocks unsigned models from running in production namespaces

Trade-offs

Control Impact Risk Mitigation
Separate registries per environment More infrastructure to manage Registry drift or misconfiguration Use IaC (Terraform) to manage registry configuration consistently
Cosign signing on every model Adds 30-60 seconds per model push Developers skip signing in a rush Integrate signing into the CI/CD pipeline so it happens automatically
Admission controller enforcement Unsigned models cannot be deployed, even in emergencies Blocks a critical hotfix deployment Maintain a break-glass procedure with audit logging. Use a separate emergency signing key with two-person approval.
Evaluation gates before promotion Slows promotion by 10-30 minutes per model Legitimate model blocked by flaky benchmark Use stable benchmarks. Allow manual override with approval from two engineers (logged).
OCI-based model storage Requires ORAS tooling; not all ML frameworks support OCI natively Tool compatibility issues Wrap ORAS in helper scripts. Most modern registries (Harbor, ECR, GCR) support OCI artifacts natively.

Failure Modes

Failure Symptom Detection Recovery
Unsigned model pushed to dev registry Promotion pipeline rejects the model at signature verification Workflow fails at verify-signature step Re-push the model with signing enabled. Investigate why signing was skipped.
KMS key rotation breaks verification All model signatures fail verification Promotion pipeline blocks all promotions; alerts fire Use the old key for verification during transition. Re-sign models with the new key. Update verification policies.
Model passes eval but behaves badly in production Users report degraded quality or harmful outputs Monitoring dashboards show increased error rates or toxicity scores Roll back to previous model version. Add the failure case to the evaluation benchmark.
Unauthorized push to production registry Unknown model serving production traffic Registry audit logs show push from unexpected account Immediately roll back. Revoke compromised credentials. Audit all models pushed by that account.

When to Consider a Managed Alternative

Managed model registries provide versioning, access control, and promotion workflows out of the box.

  • Weights and Biases: Model registry with versioning, lineage tracking, and team-based access control.
  • MLflow (managed): Model registry with staging/production stages and approval workflows.
  • Modal (#132): Serverless deployment with built-in model versioning.
  • Baseten (#140): Model deployment platform with registry and promotion features.
  • Snyk (#48): Scan model container images and base layers for vulnerabilities.

Premium content pack: Harbor registry configuration for ML model RBAC, cosign signing automation scripts, Kyverno admission policies for model verification, and Argo Workflow promotion pipeline templates.