LLM Jailbreak Defence: Detecting and Preventing System Prompt Bypasses in Production

LLM Jailbreak Defence: Detecting and Preventing System Prompt Bypasses in Production

Problem

LLM jailbreaks are inputs that cause a model to ignore its system prompt, safety training, or usage policies. Unlike prompt injection (which targets application-level instructions), jailbreaks target the model’s own alignment. A successful jailbreak makes the model produce content it was trained to refuse: harmful instructions, policy-violating outputs, or bypasses of content restrictions.

Jailbreak techniques evolve rapidly. New methods appear weekly: role-playing scenarios, multi-turn context manipulation, encoding-based evasion, payload splitting across messages, and hypothetical framing. A defence that catches today’s jailbreaks will miss tomorrow’s variants unless it combines static pattern matching with learned classification and output verification.

The core challenge is that jailbreaks exploit the model’s instruction-following capability. The model is designed to follow instructions, and jailbreaks are instructions. The defence must distinguish between legitimate instructions and adversarial ones without breaking normal usage.

Threat Model

  • Adversary: Any user with access to the LLM interface. Ranges from curious users testing boundaries to motivated attackers seeking to weaponize the model.
  • Objective: Generate harmful content (weapons, exploitation, harassment). Bypass content restrictions for competitive advantage. Extract system prompt or confidential context. Demonstrate model vulnerabilities publicly (reputation damage).
  • Blast radius: Harmful content generation leads to regulatory action, brand damage, and liability. System prompt exposure reveals business logic. Weaponized outputs enable real-world harm.

Configuration

Multi-Layer Input Filtering

# jailbreak_input_filter.py - multi-layer jailbreak detection on input
import re
from typing import List, Tuple
from dataclasses import dataclass

@dataclass
class JailbreakDetection:
    detected: bool
    layer: str
    category: str
    confidence: float
    details: str

class JailbreakInputFilter:
    """
    Multi-layer jailbreak detection.
    Layer 1: Pattern matching (fast, catches known techniques)
    Layer 2: Heuristic analysis (medium, catches structural patterns)
    Layer 3: Classifier-based (slow, catches novel techniques)
    """

    JAILBREAK_PATTERNS = [
        # Role-playing jailbreaks
        (r"(you are|act as|pretend to be|roleplay as)\s+(DAN|evil|unrestricted|unfiltered)", "roleplay", 0.9),
        (r"(jailbreak|jailbroken)\s+mode", "direct_jailbreak", 0.95),
        (r"developer\s+mode\s+(enabled|activated|on)", "developer_mode", 0.9),

        # Hypothetical framing
        (r"(hypothetically|in theory|for educational purposes|for research)\s*(,\s*)?(how|what|can you)", "hypothetical_framing", 0.6),
        (r"(imagine|suppose|what if)\s+you\s+(had no|were free of|could ignore)\s+(restrictions|rules|guidelines)", "restriction_removal", 0.85),

        # Multi-persona
        (r"(two|dual|split)\s+(personality|persona|mode)", "multi_persona", 0.8),
        (r"respond\s+(twice|two ways|as both)", "multi_persona", 0.8),

        # Encoding evasion
        (r"(base64|rot13|hex|binary)\s*(encode|decode|translate)", "encoding_evasion", 0.7),
        (r"translate\s+(to|into|from)\s+(leetspeak|pig latin|morse)", "encoding_evasion", 0.7),

        # Payload splitting
        (r"(first part|second part|combine|concatenate)\s+(of the|these|the following)", "payload_split", 0.6),
    ]

    STRUCTURAL_SIGNALS = {
        "excessive_system_references": r"(system\s+prompt|instructions|guidelines|rules){3,}",
        "nested_quotes": r'["\'].*["\'].*["\'].*["\']',
        "token_manipulation": r"(\[/?(?:INST|SYS)\]|<\|(?:im_start|im_end|system)\|>)",
        "repeat_override": r"(ignore|forget|disregard|override).*\b(rules?|instructions?|prompt)\b.*\1",
    }

    def check_patterns(self, text: str) -> List[JailbreakDetection]:
        detections = []
        for pattern, category, confidence in self.JAILBREAK_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                detections.append(JailbreakDetection(
                    detected=True,
                    layer="pattern",
                    category=category,
                    confidence=confidence,
                    details=f"Matched pattern: {category}",
                ))
        return detections

    def check_structural(self, text: str) -> List[JailbreakDetection]:
        detections = []
        for signal_name, pattern in self.STRUCTURAL_SIGNALS.items():
            if re.search(pattern, text, re.IGNORECASE):
                detections.append(JailbreakDetection(
                    detected=True,
                    layer="structural",
                    category=signal_name,
                    confidence=0.7,
                    details=f"Structural signal: {signal_name}",
                ))
        return detections

    def check_length_anomaly(self, text: str) -> List[JailbreakDetection]:
        """Jailbreak prompts tend to be much longer than normal queries."""
        detections = []
        word_count = len(text.split())
        if word_count > 500:
            detections.append(JailbreakDetection(
                detected=True,
                layer="heuristic",
                category="length_anomaly",
                confidence=min(0.3 + (word_count - 500) / 2000, 0.8),
                details=f"Input length: {word_count} words (threshold: 500)",
            ))
        return detections

    def analyse(self, text: str) -> dict:
        all_detections = []
        all_detections.extend(self.check_patterns(text))
        all_detections.extend(self.check_structural(text))
        all_detections.extend(self.check_length_anomaly(text))

        max_confidence = max((d.confidence for d in all_detections), default=0.0)

        return {
            "jailbreak_detected": max_confidence > 0.5,
            "max_confidence": max_confidence,
            "detections": [
                {"layer": d.layer, "category": d.category,
                 "confidence": d.confidence, "details": d.details}
                for d in all_detections
            ],
            "action": "block" if max_confidence > 0.8 else "flag" if max_confidence > 0.5 else "allow",
        }

Classifier-Based Jailbreak Detection

# jailbreak_classifier.py - ML-based jailbreak classifier for Layer 3
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

class JailbreakClassifier:
    """
    ML-based jailbreak detection using a fine-tuned classifier.
    This catches novel jailbreak techniques that patterns miss.
    Models: fine-tuned DeBERTa or similar on jailbreak datasets.
    """

    def __init__(self, model_path: str = "models/jailbreak-classifier"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
        self.model.eval()
        self.threshold = 0.7

    def classify(self, text: str) -> dict:
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=512,
            padding=True,
        )

        with torch.no_grad():
            outputs = self.model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1).numpy()[0]

        # Assuming binary classifier: [benign, jailbreak]
        jailbreak_prob = float(probs[1])

        return {
            "jailbreak_probability": round(jailbreak_prob, 4),
            "is_jailbreak": jailbreak_prob > self.threshold,
            "confidence": round(max(probs), 4),
        }

Output Monitoring for Policy Violations

# output_policy_monitor.py - check model outputs for policy violations
import re
from typing import List

class OutputPolicyMonitor:
    """
    Monitor model outputs for content that indicates a successful jailbreak.
    Even if input filtering is bypassed, output monitoring catches the result.
    """

    VIOLATION_PATTERNS = [
        # Model acknowledging jailbreak
        (r"(DAN|developer)\s+mode\s+(activated|enabled)", "jailbreak_acknowledgment", "critical"),
        (r"(I('ll| will) (ignore|disregard|bypass)\s+(my|the)\s+(safety|content))", "safety_bypass", "critical"),

        # Content policy violations
        (r"(step[- ]by[- ]step|instructions?|guide)\s+(to|for)\s+(hack|exploit|attack|break into)", "harmful_instructions", "critical"),
        (r"(how\s+to\s+make|recipe\s+for|synthesize|manufacture)\s+(a\s+)?(bomb|explosive|weapon|drug)", "dangerous_content", "critical"),

        # System prompt leak indicators
        (r"(my\s+(system\s+)?instructions?\s+(are|say|tell)|here\s+(are|is)\s+my\s+(system\s+)?prompt)", "prompt_leak", "high"),
    ]

    def check_output(self, output: str) -> dict:
        violations = []
        for pattern, category, severity in self.VIOLATION_PATTERNS:
            if re.search(pattern, output, re.IGNORECASE):
                violations.append({
                    "category": category,
                    "severity": severity,
                })

        has_critical = any(v["severity"] == "critical" for v in violations)

        return {
            "violations": violations,
            "violation_count": len(violations),
            "has_critical": has_critical,
            "action": "block" if has_critical else "log" if violations else "pass",
        }

Automated Response Pipeline

# jailbreak-response-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jailbreak-defence
  namespace: ai-services
spec:
  replicas: 3
  selector:
    matchLabels:
      app: jailbreak-defence
  template:
    metadata:
      labels:
        app: jailbreak-defence
    spec:
      containers:
        - name: input-filter
          image: internal-registry/jailbreak-input-filter:2.1.0
          ports:
            - containerPort: 8080
              name: http
          env:
            - name: CLASSIFIER_MODEL_PATH
              value: "/models/jailbreak-classifier"
            - name: PATTERN_CONFIDENCE_THRESHOLD
              value: "0.8"
            - name: CLASSIFIER_THRESHOLD
              value: "0.7"
            - name: BLOCK_ON_DETECTION
              value: "true"
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: "2"
              memory: 2Gi
          volumeMounts:
            - name: model-volume
              mountPath: /models
              readOnly: true
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 30
        - name: output-monitor
          image: internal-registry/output-policy-monitor:1.3.0
          ports:
            - containerPort: 8081
              name: http
          env:
            - name: BLOCK_CRITICAL
              value: "true"
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
            limits:
              cpu: 500m
              memory: 1Gi
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: jailbreak-classifier-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: jailbreak-defence
  namespace: ai-services
spec:
  selector:
    app: jailbreak-defence
  ports:
    - name: input-filter
      port: 8080
      targetPort: 8080
    - name: output-monitor
      port: 8081
      targetPort: 8081

Prometheus Alerting

# prometheus-jailbreak.yaml
groups:
  - name: jailbreak-detection
    interval: 1m
    rules:
      - alert: JailbreakAttemptSpike
        expr: rate(jailbreak_detected_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elevated jailbreak attempts: {{ $value | humanize }}/sec"
          description: "Check jailbreak_detected_total by category for attack type breakdown."

      - alert: JailbreakBypassDetected
        expr: increase(output_policy_violation_total{severity="critical"}[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Jailbreak bypass detected - policy violation in model output"
          description: >
            A critical policy violation was detected in model output, indicating
            that a jailbreak attempt bypassed input filters. Review the request
            and output immediately.

      - alert: ClassifierLatencyHigh
        expr: histogram_quantile(0.99, rate(jailbreak_classifier_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Jailbreak classifier P99 latency exceeds 500ms"
          description: "Classifier inference is slow. Consider scaling replicas or optimising the model."

Expected Behaviour

  • Pattern-based detection blocks known jailbreak techniques within 1-5ms
  • Classifier-based detection catches novel techniques within 50-200ms
  • Output monitoring detects successful bypasses and blocks critical policy violations
  • Jailbreak attempt spikes trigger alerts within 5 minutes
  • Critical output policy violations trigger immediate alerts
  • Defence layers are independent: bypass of one layer is caught by others

Trade-offs

Control Impact Risk Mitigation
Pattern-based detection Blocks known jailbreaks instantly Misses novel techniques. High false positive rate on creative writing prompts. Use as first filter only. Combine with classifier for confirmation.
ML classifier (Layer 3) Catches novel jailbreaks with generalisation Adds 50-200ms latency. Requires training data and regular retraining. Run in parallel with inference (not sequentially). Cache classifier results for repeated inputs.
Output policy monitoring Catches bypasses that input filters miss Post-hoc: the model already generated the harmful content (even if not returned) Combine with streaming output scanning to abort generation mid-stream.
Aggressive blocking (threshold 0.5) Catches more jailbreaks Blocks legitimate creative and educational content Use 0.8 threshold for blocking, 0.5 for flagging and logging. Human review for flagged content.

Failure Modes

Failure Symptom Detection Recovery
Novel jailbreak bypasses all layers Harmful content reaches user User reports, output monitoring on historical logs, external disclosure Add the technique to pattern list. Retrain classifier with the new example. Review and patch immediately.
Classifier model corrupted All inputs classified as benign or all as jailbreaks Classification distribution becomes uniform; false positive/negative rate spikes Rollback classifier to previous version. Validate model integrity with test suite before deployment.
False positive storm Legitimate users blocked en masse Support ticket volume spikes; user engagement drops Emergency threshold raise. Deploy pattern exceptions. Investigate which layer triggered the false positives.
Output monitor latency Responses delayed while scanning output P99 latency increases; user experience degrades Stream output and scan in parallel. Abort generation only on critical violations.

When to Consider a Managed Alternative

Jailbreak defence is an arms race. New techniques appear weekly, and keeping pattern lists, classifiers, and policies current requires dedicated security research.

  • Lakera (#142): Managed jailbreak detection API. ML-based classification updated continuously with new jailbreak techniques. Real-time detection with sub-50ms latency.
  • Cloudflare (#29) AI Gateway: Edge-level input/output filtering for AI endpoints. Managed content safety policies.

Premium content pack: Jailbreak defence pack. Multi-layer input filter (Python), jailbreak classifier training pipeline and dataset, output policy monitor, Kubernetes deployment manifests, Prometheus alerting rules, and a jailbreak red team test suite with 500+ known techniques.