Implementing AI Guardrails: Input Validation, Output Filtering, and Safety Classifiers in Production

Implementing AI Guardrails: Input Validation, Output Filtering, and Safety Classifiers in Production

Problem

Deploying an LLM without guardrails is deploying an application where any user can make it say or do anything. Guardrails are the engineering controls that sit between users and the model: validating input before it reaches inference, filtering output before it reaches users, and classifying content for safety violations at both stages.

Most teams bolt guardrails on as an afterthought, a regex check here, a content filter there. This produces gaps. The input filter catches “how to hack” but not “explain the security testing methodology for unauthorized access.” The output filter blocks profanity but misses a detailed social engineering script. The PII detector catches US Social Security numbers but not UK National Insurance numbers.

Production guardrails require a pipeline architecture: pre-processing, inference, post-processing, each stage with independent controls, each stage monitored independently.

Threat Model

  • Adversary: (1) Users attempting to misuse the model (jailbreaks, harmful content). (2) Attackers injecting malicious content through the application. (3) The model itself generating unsafe content without adversarial input (hallucination, off-topic responses, PII leakage).
  • Objective: Generate harmful, biased, or policy-violating content. Extract PII from training data or context. Bypass topic restrictions. Exfiltrate data through model outputs.
  • Blast radius: Regulatory penalties (PII exposure). Brand damage (harmful content). Liability (incorrect advice in regulated domains). Data breach (context exfiltration).

Configuration

Guardrails Pipeline Architecture

# guardrails_pipeline.py - three-stage guardrails pipeline
from dataclasses import dataclass, field
from typing import List, Optional, Callable
import time

@dataclass
class GuardrailResult:
    passed: bool
    stage: str
    checks: List[dict] = field(default_factory=list)
    blocked_reason: Optional[str] = None
    latency_ms: float = 0.0

class GuardrailsPipeline:
    """
    Three-stage guardrails pipeline:
    1. Pre-processing: validate and sanitise input before inference
    2. Inference: the model call (not managed by guardrails)
    3. Post-processing: filter and validate output before returning to user
    """

    def __init__(self):
        self.pre_checks: List[Callable] = []
        self.post_checks: List[Callable] = []

    def add_pre_check(self, name: str, check_fn: Callable, blocking: bool = True):
        self.pre_checks.append({"name": name, "fn": check_fn, "blocking": blocking})

    def add_post_check(self, name: str, check_fn: Callable, blocking: bool = True):
        self.post_checks.append({"name": name, "fn": check_fn, "blocking": blocking})

    def run_pre_processing(self, user_input: str) -> GuardrailResult:
        start = time.time()
        checks = []

        for check in self.pre_checks:
            result = check["fn"](user_input)
            checks.append({"name": check["name"], "result": result})

            if not result.get("passed", True) and check["blocking"]:
                return GuardrailResult(
                    passed=False,
                    stage="pre-processing",
                    checks=checks,
                    blocked_reason=result.get("reason", check["name"]),
                    latency_ms=(time.time() - start) * 1000,
                )

        return GuardrailResult(
            passed=True,
            stage="pre-processing",
            checks=checks,
            latency_ms=(time.time() - start) * 1000,
        )

    def run_post_processing(self, model_output: str, user_input: str) -> GuardrailResult:
        start = time.time()
        checks = []

        for check in self.post_checks:
            result = check["fn"](model_output, user_input)
            checks.append({"name": check["name"], "result": result})

            if not result.get("passed", True) and check["blocking"]:
                return GuardrailResult(
                    passed=False,
                    stage="post-processing",
                    checks=checks,
                    blocked_reason=result.get("reason", check["name"]),
                    latency_ms=(time.time() - start) * 1000,
                )

        return GuardrailResult(
            passed=True,
            stage="post-processing",
            checks=checks,
            latency_ms=(time.time() - start) * 1000,
        )

Content Safety Classifier

# safety_classifier.py - content safety classification for input and output
from transformers import pipeline
from typing import Dict

class ContentSafetyClassifier:
    """
    Classify content for safety violations using a fine-tuned model.
    Categories: harassment, hate_speech, self_harm, sexual_content,
    violence, dangerous_content, regulated_advice.
    """

    def __init__(self, model_name: str = "models/content-safety-classifier"):
        self.classifier = pipeline(
            "text-classification",
            model=model_name,
            top_k=None,  # return all labels with scores
            truncation=True,
            max_length=512,
        )
        self.thresholds = {
            "harassment": 0.7,
            "hate_speech": 0.7,
            "self_harm": 0.5,       # lower threshold for safety-critical
            "sexual_content": 0.7,
            "violence": 0.7,
            "dangerous_content": 0.6,
            "regulated_advice": 0.6,
        }

    def classify(self, text: str) -> Dict:
        results = self.classifier(text)[0]
        scores = {r["label"]: r["score"] for r in results}

        violations = []
        for category, threshold in self.thresholds.items():
            if scores.get(category, 0) > threshold:
                violations.append({
                    "category": category,
                    "score": round(scores[category], 4),
                    "threshold": threshold,
                })

        return {
            "passed": len(violations) == 0,
            "violations": violations,
            "scores": {k: round(v, 4) for k, v in scores.items()},
            "reason": f"safety_violation: {violations[0]['category']}" if violations else None,
        }

PII Detection and Redaction

# pii_guardrail.py - detect and redact PII in both input and output
import re
from typing import Dict, List, Tuple

class PIIGuardrail:
    """
    Detect and redact personally identifiable information.
    Runs on both input (prevent PII from reaching the model)
    and output (prevent PII from reaching the user).
    """

    PII_PATTERNS = {
        "email": (r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]"),
        "phone_us": (r"\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b", "[PHONE]"),
        "phone_uk": (r"\b(?:\+44[-.\s]?|0)(?:\d[-.\s]?){9,10}\b", "[PHONE]"),
        "ssn": (r"\b\d{3}-\d{2}-\d{4}\b", "[SSN]"),
        "nino": (r"\b[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]\b", "[NINO]"),
        "credit_card": (r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"),
        "ipv4": (r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", "[IP]"),
        "date_of_birth": (r"\b(?:0[1-9]|[12]\d|3[01])[/-](?:0[1-9]|1[0-2])[/-]\d{4}\b", "[DOB]"),
    }

    def detect(self, text: str) -> List[dict]:
        findings = []
        for pii_type, (pattern, _) in self.PII_PATTERNS.items():
            matches = re.findall(pattern, text)
            if matches:
                findings.append({"type": pii_type, "count": len(matches)})
        return findings

    def redact(self, text: str) -> Tuple[str, List[dict]]:
        findings = self.detect(text)
        redacted = text
        for pii_type, (pattern, replacement) in self.PII_PATTERNS.items():
            redacted = re.sub(pattern, replacement, redacted)
        return redacted, findings

    def check_input(self, text: str) -> dict:
        findings = self.detect(text)
        return {
            "passed": len(findings) == 0,
            "pii_found": findings,
            "reason": f"pii_in_input: {findings[0]['type']}" if findings else None,
        }

    def check_output(self, output: str, user_input: str) -> dict:
        redacted, findings = self.redact(output)
        return {
            "passed": len(findings) == 0,
            "pii_found": findings,
            "redacted_output": redacted,
            "reason": f"pii_in_output: {findings[0]['type']}" if findings else None,
        }

Topic Restriction Enforcement

# topic_restrictor.py - enforce topic boundaries for the LLM
from transformers import pipeline

class TopicRestrictor:
    """
    Restrict LLM responses to approved topics.
    Uses zero-shot classification to determine if input/output
    falls within allowed topic boundaries.
    """

    def __init__(
        self,
        allowed_topics: list,
        blocked_topics: list,
        model_name: str = "facebook/bart-large-mnli",
    ):
        self.allowed_topics = allowed_topics
        self.blocked_topics = blocked_topics
        self.classifier = pipeline(
            "zero-shot-classification",
            model=model_name,
        )
        self.allowed_threshold = 0.4
        self.blocked_threshold = 0.6

    def check(self, text: str) -> dict:
        # Check blocked topics first
        if self.blocked_topics:
            blocked_result = self.classifier(text, self.blocked_topics)
            top_blocked = blocked_result["labels"][0]
            top_blocked_score = blocked_result["scores"][0]

            if top_blocked_score > self.blocked_threshold:
                return {
                    "passed": False,
                    "reason": f"blocked_topic: {top_blocked} (score: {top_blocked_score:.2f})",
                    "blocked_topic": top_blocked,
                    "score": round(top_blocked_score, 4),
                }

        # Check allowed topics
        allowed_result = self.classifier(text, self.allowed_topics)
        top_allowed = allowed_result["labels"][0]
        top_allowed_score = allowed_result["scores"][0]

        if top_allowed_score < self.allowed_threshold:
            return {
                "passed": False,
                "reason": f"off_topic: best match '{top_allowed}' scored {top_allowed_score:.2f}",
                "best_topic": top_allowed,
                "score": round(top_allowed_score, 4),
            }

        return {
            "passed": True,
            "matched_topic": top_allowed,
            "score": round(top_allowed_score, 4),
        }

Kubernetes Deployment with Latency-Aware Architecture

# guardrails-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: guardrails-service
  namespace: ai-services
spec:
  replicas: 3
  selector:
    matchLabels:
      app: guardrails-service
  template:
    metadata:
      labels:
        app: guardrails-service
    spec:
      containers:
        # Fast checks (pattern matching, PII regex) - runs synchronously
        - name: fast-guardrails
          image: internal-registry/guardrails-fast:2.0.0
          ports:
            - containerPort: 8080
          env:
            - name: CHECKS
              value: "pii,injection_patterns,length_limit"
            - name: TIMEOUT_MS
              value: "50"
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
        # Slow checks (ML classifiers) - runs asynchronously where possible
        - name: ml-guardrails
          image: internal-registry/guardrails-ml:2.0.0
          ports:
            - containerPort: 8081
          env:
            - name: CHECKS
              value: "safety_classifier,topic_restriction,jailbreak_classifier"
            - name: TIMEOUT_MS
              value: "500"
            - name: ASYNC_MODE
              value: "true"
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
            limits:
              cpu: "2"
              memory: 4Gi
          volumeMounts:
            - name: models
              mountPath: /models
              readOnly: true
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: guardrails-models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: guardrails-service
  namespace: ai-services
spec:
  selector:
    app: guardrails-service
  ports:
    - name: fast
      port: 8080
      targetPort: 8080
    - name: ml
      port: 8081
      targetPort: 8081

Prometheus Monitoring

# prometheus-guardrails.yaml
groups:
  - name: guardrails
    interval: 1m
    rules:
      - alert: GuardrailBlockRate
        expr: >
          rate(guardrail_blocked_total[5m])
          / rate(guardrail_requests_total[5m]) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: ">20% of requests blocked by guardrails"
          description: "Check guardrail_blocked_total by reason label for breakdown."

      - alert: GuardrailLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(guardrail_duration_seconds_bucket{stage="pre-processing"}[5m])) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pre-processing guardrail P99 latency exceeds 200ms"

      - alert: PIILeakage
        expr: increase(guardrail_pii_detected_total{stage="output"}[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "PII detected in model output"
          description: "PII was found in model output and redacted. Investigate the source."

      - alert: SafetyViolation
        expr: increase(guardrail_safety_violation_total{stage="output"}[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Safety violation detected in model output"

Expected Behaviour

  • Fast guardrails (pattern matching, PII regex) complete within 5-50ms
  • ML-based guardrails (safety classifier, topic restriction) complete within 100-500ms
  • PII detected in input is logged and optionally redacted before reaching the model
  • PII detected in output is always redacted before reaching the user
  • Safety violations in output trigger blocking and alerting
  • Off-topic queries are rejected with a helpful redirect message
  • Block rate above 20% triggers investigation (may indicate overly aggressive rules or an attack)

Trade-offs

Control Impact Risk Mitigation
Synchronous ML guardrails Adds 100-500ms to every request User-facing latency exceeds SLA Run ML checks asynchronously. Block only on fast checks; flag on slow checks.
PII redaction on input Removes PII before model sees it Model cannot reference user-provided PII even when legitimate (e.g., “what does my email john@example.com do?”) Allow PII passthrough for authenticated, consented use cases. Redact by default.
Topic restriction Keeps model on-topic Legitimate edge-case queries are rejected Tune allowed topics broadly. Log rejected queries for periodic review and topic expansion.
Safety classifier threshold 0.5 for self-harm Very aggressive blocking for self-harm content False positives on mental health support queries Route self-harm detections to human review rather than hard blocking. Provide crisis resources.

Failure Modes

Failure Symptom Detection Recovery
ML guardrail service down All requests pass without ML-level checking Health check failures; guardrail check count drops to zero Fast guardrails continue operating. Alert on ML guardrail downtime. Queue requests for retroactive scanning.
PII regex too broad Legitimate numbers (order IDs, dates) redacted User reports; redaction rate spikes for specific PII types Refine regex patterns. Add context-aware rules (only flag numbers in certain positions).
Topic classifier drift Allowed topics score too low; everything rejected Block rate climbs over time without changes to traffic Retrain or recalibrate topic classifier. Update allowed topic list.
Safety classifier false negative Harmful content passes all checks User report; external disclosure; output monitoring on historical data Add the missed content pattern to the training set. Retrain classifier. Tighten output monitoring.

When to Consider a Managed Alternative

Building and maintaining a full guardrails pipeline (pattern matching, PII detection, safety classification, topic restriction, monitoring) is significant engineering investment. Model retraining, pattern updates, and threshold tuning are ongoing.

  • Lakera (#142): Managed guardrails API with input/output filtering, jailbreak detection, PII detection, and content safety. Sub-50ms latency. Continuously updated models.
  • Cloudflare (#29) AI Gateway: Edge-level guardrails for AI endpoints. Content filtering, rate limiting, and observability.

Premium content pack: Guardrails implementation pack. Three-stage pipeline framework (Python), content safety classifier training pipeline, PII detection library with international patterns, topic restriction configuration, Kubernetes deployment manifests, and Prometheus monitoring rules.