A/B Model Deployment Safety: Canary Rollouts, Traffic Splitting, and Automated Rollback for ML Models
Problem
Deploying a new ML model version is not the same as deploying a new application version. A container image that passes health checks can still serve a model that produces subtly wrong, toxic, or degraded outputs. Standard Kubernetes rolling updates check liveness and readiness, not output quality. A model that returns HTTP 200 with confidently wrong answers passes every infrastructure health check.
Teams that deploy models with kubectl apply and a rolling update strategy risk sending 100% of production traffic to a model that has regressed on accuracy, increased latency due to a larger architecture, or developed new failure modes on specific input categories. By the time someone notices, thousands of bad predictions have been served.
Model deployments need canary rollouts that evaluate model-specific metrics (accuracy, latency percentiles, toxicity scores) before increasing traffic, and automated rollback when those metrics degrade.
Target systems: Kubernetes model serving deployments with Istio service mesh or Envoy-based gateways. Works with any model serving framework (TorchServe, Triton, vLLM) behind an HTTP/gRPC endpoint.
Threat Model
- Adversary: Not primarily an external attacker. The threat is an untested or degraded model version reaching production traffic. However, an attacker who can trigger a model deployment (compromised CI/CD) can use this as a vector.
- Objective: Deploy a model that produces harmful, biased, or incorrect outputs at scale. Exhaust GPU resources with a model that has higher latency characteristics. Cause a denial-of-service by deploying a model that crashes on certain inputs.
- Blast radius: Degraded user experience (quality). Financial loss from incorrect predictions (integrity). Reputational damage from toxic or biased outputs (safety).
Configuration
Istio Traffic Splitting for Model Versions
Deploy the new model version alongside the existing one. Use Istio VirtualService to control what percentage of traffic reaches each version.
# model-v1-deployment.yaml - current production model
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference-v1
namespace: ai-inference
labels:
app: llm-inference
version: v1
model-version: "1.0.42"
spec:
replicas: 3
selector:
matchLabels:
app: llm-inference
version: v1
template:
metadata:
labels:
app: llm-inference
version: v1
model-version: "1.0.42"
annotations:
sidecar.istio.io/inject: "true"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: inference
image: registry.internal/ml-serving:v2.1.0
args: ["--model=/models/llm-v1.0.42"]
ports:
- containerPort: 8080
name: http
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: models
mountPath: /models
readOnly: true
volumes:
- name: models
persistentVolumeClaim:
claimName: model-storage-v1
readOnly: true
---
# model-v2-deployment.yaml - canary model (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference-v2
namespace: ai-inference
labels:
app: llm-inference
version: v2
model-version: "1.0.43"
spec:
replicas: 1 # Start with minimal replicas for canary
selector:
matchLabels:
app: llm-inference
version: v2
template:
metadata:
labels:
app: llm-inference
version: v2
model-version: "1.0.43"
annotations:
sidecar.istio.io/inject: "true"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: inference
image: registry.internal/ml-serving:v2.1.0
args: ["--model=/models/llm-v1.0.43"]
ports:
- containerPort: 8080
name: http
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: models
mountPath: /models
readOnly: true
volumes:
- name: models
persistentVolumeClaim:
claimName: model-storage-v2
readOnly: true
# istio-traffic-split.yaml - start with 5% canary traffic
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: llm-inference
namespace: ai-inference
spec:
hosts:
- llm-inference.ai-inference.svc.cluster.local
http:
- route:
- destination:
host: llm-inference.ai-inference.svc.cluster.local
subset: v1
weight: 95
- destination:
host: llm-inference.ai-inference.svc.cluster.local
subset: v2
weight: 5
timeout: 30s
retries:
attempts: 2
perTryTimeout: 15s
retryOn: 5xx,reset,connect-failure
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: llm-inference
namespace: ai-inference
spec:
host: llm-inference.ai-inference.svc.cluster.local
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 60s
maxEjectionPercent: 100
Flagger Canary with ML-Specific Metrics
Use Flagger to automate the canary progression based on model-specific metrics, not just HTTP success rates.
# flagger-canary.yaml - automated canary with ML metrics
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: llm-inference
namespace: ai-inference
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
service:
port: 8080
targetPort: 8080
gateways:
- mesh
hosts:
- llm-inference.ai-inference.svc.cluster.local
analysis:
# Canary progression schedule
interval: 2m # Check metrics every 2 minutes
threshold: 3 # Max failed checks before rollback
maxWeight: 50 # Max canary traffic percentage
stepWeight: 10 # Increase by 10% each step
# ML-specific metrics for canary analysis
metrics:
# Standard: request success rate must stay above 99%
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
# Standard: p99 latency must stay under 2 seconds
- name: request-duration
thresholdRange:
max: 2000
interval: 1m
# Custom: model-specific quality metric from Prometheus
- name: model-accuracy-score
templateRef:
name: model-accuracy
namespace: ai-inference
thresholdRange:
min: 0.85
interval: 2m
# Custom: toxicity score must stay below threshold
- name: model-toxicity-score
templateRef:
name: model-toxicity
namespace: ai-inference
thresholdRange:
max: 0.02
interval: 2m
---
# Prometheus metric template for model accuracy
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: model-accuracy
namespace: ai-inference
spec:
provider:
type: prometheus
address: http://prometheus.monitoring:9090
query: |
sum(rate(model_correct_predictions_total{
deployment=~"{{ target }}",
namespace="{{ namespace }}"
}[2m]))
/
sum(rate(model_total_predictions_total{
deployment=~"{{ target }}",
namespace="{{ namespace }}"
}[2m]))
---
# Prometheus metric template for toxicity
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: model-toxicity
namespace: ai-inference
spec:
provider:
type: prometheus
address: http://prometheus.monitoring:9090
query: |
avg(model_toxicity_score{
deployment=~"{{ target }}",
namespace="{{ namespace }}"
})
Model Metrics Instrumentation
Instrument your inference endpoint to emit the custom metrics that Flagger uses for canary analysis.
# metrics_middleware.py - Prometheus metrics for model quality
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Counters for accuracy tracking
PREDICTIONS_TOTAL = Counter(
"model_total_predictions_total",
"Total number of predictions",
["deployment", "model_version"],
)
CORRECT_PREDICTIONS = Counter(
"model_correct_predictions_total",
"Predictions matching quality threshold",
["deployment", "model_version"],
)
# Histogram for inference latency
INFERENCE_LATENCY = Histogram(
"model_inference_duration_seconds",
"Time spent on model inference",
["deployment", "model_version"],
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0],
)
# Gauge for toxicity score
TOXICITY_SCORE = Gauge(
"model_toxicity_score",
"Rolling average toxicity score of model outputs",
["deployment", "model_version"],
)
# Gauge for confidence score
CONFIDENCE_SCORE = Histogram(
"model_confidence_score",
"Distribution of model confidence scores",
["deployment", "model_version"],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99],
)
class ModelMetricsMiddleware:
"""Wrap inference calls with Prometheus metrics."""
def __init__(self, deployment_name: str, model_version: str):
self.deployment = deployment_name
self.version = model_version
self.labels = {
"deployment": deployment_name,
"model_version": model_version,
}
def record_prediction(
self,
latency_seconds: float,
confidence: float,
toxicity: float,
quality_pass: bool,
):
PREDICTIONS_TOTAL.labels(**self.labels).inc()
if quality_pass:
CORRECT_PREDICTIONS.labels(**self.labels).inc()
INFERENCE_LATENCY.labels(**self.labels).observe(latency_seconds)
TOXICITY_SCORE.labels(**self.labels).set(toxicity)
CONFIDENCE_SCORE.labels(**self.labels).observe(confidence)
Automated Rollback Script
For environments without Flagger, use a script that monitors metrics and triggers rollback.
#!/bin/bash
# model_rollback_monitor.sh - monitor canary and rollback if degraded
set -euo pipefail
NAMESPACE="ai-inference"
CANARY_DEPLOYMENT="llm-inference-v2"
STABLE_DEPLOYMENT="llm-inference-v1"
PROMETHEUS_URL="http://prometheus.monitoring:9090"
CHECK_INTERVAL=120 # seconds
MAX_FAILURES=3
FAILURE_COUNT=0
check_metric() {
local query="$1"
local threshold="$2"
local operator="$3" # "gt" or "lt"
value=$(curl -s "${PROMETHEUS_URL}/api/v1/query" \
--data-urlencode "query=${query}" \
| jq -r '.data.result[0].value[1] // "0"')
if [ "$operator" = "lt" ] && [ "$(echo "$value < $threshold" | bc -l)" -eq 1 ]; then
return 1 # Below minimum threshold
fi
if [ "$operator" = "gt" ] && [ "$(echo "$value > $threshold" | bc -l)" -eq 1 ]; then
return 1 # Above maximum threshold
fi
return 0
}
rollback() {
echo "ROLLBACK: Shifting all traffic to stable version"
# Set canary weight to 0
kubectl -n "$NAMESPACE" patch virtualservice llm-inference --type=json \
-p='[
{"op": "replace", "path": "/spec/http/0/route/0/weight", "value": 100},
{"op": "replace", "path": "/spec/http/0/route/1/weight", "value": 0}
]'
# Scale down canary
kubectl -n "$NAMESPACE" scale deployment "$CANARY_DEPLOYMENT" --replicas=0
echo "Rollback complete. Canary traffic set to 0%."
exit 1
}
echo "Monitoring canary deployment: ${CANARY_DEPLOYMENT}"
while true; do
echo "$(date): Checking canary metrics..."
# Check accuracy (must be above 0.85)
if ! check_metric \
"sum(rate(model_correct_predictions_total{deployment=\"${CANARY_DEPLOYMENT}\"}[5m])) / sum(rate(model_total_predictions_total{deployment=\"${CANARY_DEPLOYMENT}\"}[5m]))" \
"0.85" "lt"; then
echo "WARNING: Accuracy below threshold"
FAILURE_COUNT=$((FAILURE_COUNT + 1))
fi
# Check toxicity (must be below 0.02)
if ! check_metric \
"avg(model_toxicity_score{deployment=\"${CANARY_DEPLOYMENT}\"})" \
"0.02" "gt"; then
echo "WARNING: Toxicity above threshold"
FAILURE_COUNT=$((FAILURE_COUNT + 1))
fi
# Check p99 latency (must be below 2s)
if ! check_metric \
"histogram_quantile(0.99, rate(model_inference_duration_seconds_bucket{deployment=\"${CANARY_DEPLOYMENT}\"}[5m]))" \
"2.0" "gt"; then
echo "WARNING: P99 latency above threshold"
FAILURE_COUNT=$((FAILURE_COUNT + 1))
fi
if [ "$FAILURE_COUNT" -ge "$MAX_FAILURES" ]; then
rollback
fi
# Reset failure count on successful check
if [ "$FAILURE_COUNT" -eq 0 ]; then
echo "$(date): All metrics healthy"
fi
FAILURE_COUNT=0
sleep "$CHECK_INTERVAL"
done
Traffic Shifting Schedule
# progressive-traffic-shift.yaml - CronJob to gradually increase canary traffic
apiVersion: batch/v1
kind: CronJob
metadata:
name: canary-traffic-increase
namespace: ai-inference
spec:
schedule: "*/30 * * * *" # Every 30 minutes
jobTemplate:
spec:
template:
spec:
serviceAccountName: canary-manager
containers:
- name: traffic-shift
image: bitnami/kubectl:1.30
command: ["sh", "-c"]
args:
- |
# Get current canary weight
CURRENT=$(kubectl -n ai-inference get virtualservice llm-inference \
-o jsonpath='{.spec.http[0].route[1].weight}')
if [ "$CURRENT" -ge 50 ]; then
echo "Canary at max weight (${CURRENT}%). No further increase."
exit 0
fi
NEW_WEIGHT=$((CURRENT + 10))
STABLE_WEIGHT=$((100 - NEW_WEIGHT))
echo "Increasing canary from ${CURRENT}% to ${NEW_WEIGHT}%"
kubectl -n ai-inference patch virtualservice llm-inference --type=json \
-p="[
{\"op\": \"replace\", \"path\": \"/spec/http/0/route/0/weight\", \"value\": ${STABLE_WEIGHT}},
{\"op\": \"replace\", \"path\": \"/spec/http/0/route/1/weight\", \"value\": ${NEW_WEIGHT}}
]"
restartPolicy: OnFailure
Expected Behaviour
- New model versions start receiving 5% of traffic, increasing by 10% every step
- Canary progression halts and rolls back if accuracy drops below 85%, toxicity exceeds 2%, or p99 latency exceeds 2 seconds
- Istio outlier detection ejects unhealthy canary pods after 3 consecutive 5xx errors
- All model versions emit Prometheus metrics for accuracy, latency, toxicity, and confidence
- Rollback shifts 100% of traffic to the stable version and scales the canary to zero
- Traffic splitting is transparent to clients; all requests go through the same service endpoint
Trade-offs
| Control | Impact | Risk | Mitigation |
|---|---|---|---|
| 5% initial canary traffic | New model is tested on real traffic but on a small percentage | 5% of users may see degraded results during testing | Use synthetic traffic for initial validation before real traffic exposure |
| Automated rollback on accuracy drop | Fast recovery from bad models | Flaky metrics cause unnecessary rollbacks | Set thresholds with appropriate margins. Require 3 consecutive failures before rollback. |
| Istio sidecar on GPU pods | Adds ~50MB memory and slight latency overhead | Resource overhead on expensive GPU nodes | Sidecar resource usage is negligible compared to GPU workload. Latency overhead is typically under 1ms. |
| Progressive traffic increase every 30 minutes | Full rollout takes 2-3 hours | Slower time to full deployment | Acceptable trade-off for production safety. Use faster schedules (10 min) for low-risk updates. |
| Separate PVCs per model version | Doubles storage during canary | Storage cost increase | Clean up old model PVCs after successful full rollout. Use shared storage with version-specific paths. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Canary model crashes on specific inputs | Istio reports 5xx errors from canary subset | Outlier detection ejects canary pods; Flagger detects success rate drop | Automatic rollback via Flagger or manual VirtualService patch. Investigate crash-inducing inputs. |
| Model produces subtly wrong outputs (no errors) | Users report incorrect answers; downstream systems produce bad results | Custom accuracy metric drops below threshold; Flagger triggers rollback | Roll back. Add the failing cases to the evaluation benchmark for future gate checks. |
| Metrics pipeline delay | Flagger sees stale metrics during canary analysis | Flagger reports “no data” for custom metrics | Configure Flagger to treat missing metrics as failure. Fix Prometheus scrape interval. |
| VirtualService misconfiguration | All traffic goes to canary or stable, ignoring weights | Traffic monitoring shows unexpected distribution | Validate VirtualService with istioctl analyze. Use admission webhook to validate traffic split configs. |
| Rollback fails (stable version also broken) | Both model versions serve bad results | Monitoring shows degradation across all subsets | Scale down both deployments. Deploy a known-good model version from the model registry. |
When to Consider a Managed Alternative
Managed ML deployment platforms handle canary rollouts, traffic splitting, and automated rollback.
- Modal (#132): Serverless model deployment with built-in rollback.
- Baseten (#140): Model deployment with traffic splitting and monitoring.
- Replicate (#133): Managed model hosting with versioning.
- Cloudflare (#29): Edge-level traffic management and load balancing in front of model endpoints.
- Kong (#86): API gateway with built-in canary release plugins.
Premium content pack: Complete Istio and Flagger configurations for ML canary deployments, Prometheus metric templates for model quality monitoring, and automated rollback scripts.