LLM Jailbreak Defence: Detecting and Preventing System Prompt Bypasses in Production
Problem
LLM jailbreaks are inputs that cause a model to ignore its system prompt, safety training, or usage policies. Unlike prompt injection (which targets application-level instructions), jailbreaks target the model’s own alignment. A successful jailbreak makes the model produce content it was trained to refuse: harmful instructions, policy-violating outputs, or bypasses of content restrictions.
Jailbreak techniques evolve rapidly. New methods appear weekly: role-playing scenarios, multi-turn context manipulation, encoding-based evasion, payload splitting across messages, and hypothetical framing. A defence that catches today’s jailbreaks will miss tomorrow’s variants unless it combines static pattern matching with learned classification and output verification.
The core challenge is that jailbreaks exploit the model’s instruction-following capability. The model is designed to follow instructions, and jailbreaks are instructions. The defence must distinguish between legitimate instructions and adversarial ones without breaking normal usage.
Threat Model
- Adversary: Any user with access to the LLM interface. Ranges from curious users testing boundaries to motivated attackers seeking to weaponize the model.
- Objective: Generate harmful content (weapons, exploitation, harassment). Bypass content restrictions for competitive advantage. Extract system prompt or confidential context. Demonstrate model vulnerabilities publicly (reputation damage).
- Blast radius: Harmful content generation leads to regulatory action, brand damage, and liability. System prompt exposure reveals business logic. Weaponized outputs enable real-world harm.
Configuration
Multi-Layer Input Filtering
# jailbreak_input_filter.py - multi-layer jailbreak detection on input
import re
from typing import List, Tuple
from dataclasses import dataclass
@dataclass
class JailbreakDetection:
detected: bool
layer: str
category: str
confidence: float
details: str
class JailbreakInputFilter:
"""
Multi-layer jailbreak detection.
Layer 1: Pattern matching (fast, catches known techniques)
Layer 2: Heuristic analysis (medium, catches structural patterns)
Layer 3: Classifier-based (slow, catches novel techniques)
"""
JAILBREAK_PATTERNS = [
# Role-playing jailbreaks
(r"(you are|act as|pretend to be|roleplay as)\s+(DAN|evil|unrestricted|unfiltered)", "roleplay", 0.9),
(r"(jailbreak|jailbroken)\s+mode", "direct_jailbreak", 0.95),
(r"developer\s+mode\s+(enabled|activated|on)", "developer_mode", 0.9),
# Hypothetical framing
(r"(hypothetically|in theory|for educational purposes|for research)\s*(,\s*)?(how|what|can you)", "hypothetical_framing", 0.6),
(r"(imagine|suppose|what if)\s+you\s+(had no|were free of|could ignore)\s+(restrictions|rules|guidelines)", "restriction_removal", 0.85),
# Multi-persona
(r"(two|dual|split)\s+(personality|persona|mode)", "multi_persona", 0.8),
(r"respond\s+(twice|two ways|as both)", "multi_persona", 0.8),
# Encoding evasion
(r"(base64|rot13|hex|binary)\s*(encode|decode|translate)", "encoding_evasion", 0.7),
(r"translate\s+(to|into|from)\s+(leetspeak|pig latin|morse)", "encoding_evasion", 0.7),
# Payload splitting
(r"(first part|second part|combine|concatenate)\s+(of the|these|the following)", "payload_split", 0.6),
]
STRUCTURAL_SIGNALS = {
"excessive_system_references": r"(system\s+prompt|instructions|guidelines|rules){3,}",
"nested_quotes": r'["\'].*["\'].*["\'].*["\']',
"token_manipulation": r"(\[/?(?:INST|SYS)\]|<\|(?:im_start|im_end|system)\|>)",
"repeat_override": r"(ignore|forget|disregard|override).*\b(rules?|instructions?|prompt)\b.*\1",
}
def check_patterns(self, text: str) -> List[JailbreakDetection]:
detections = []
for pattern, category, confidence in self.JAILBREAK_PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
detections.append(JailbreakDetection(
detected=True,
layer="pattern",
category=category,
confidence=confidence,
details=f"Matched pattern: {category}",
))
return detections
def check_structural(self, text: str) -> List[JailbreakDetection]:
detections = []
for signal_name, pattern in self.STRUCTURAL_SIGNALS.items():
if re.search(pattern, text, re.IGNORECASE):
detections.append(JailbreakDetection(
detected=True,
layer="structural",
category=signal_name,
confidence=0.7,
details=f"Structural signal: {signal_name}",
))
return detections
def check_length_anomaly(self, text: str) -> List[JailbreakDetection]:
"""Jailbreak prompts tend to be much longer than normal queries."""
detections = []
word_count = len(text.split())
if word_count > 500:
detections.append(JailbreakDetection(
detected=True,
layer="heuristic",
category="length_anomaly",
confidence=min(0.3 + (word_count - 500) / 2000, 0.8),
details=f"Input length: {word_count} words (threshold: 500)",
))
return detections
def analyse(self, text: str) -> dict:
all_detections = []
all_detections.extend(self.check_patterns(text))
all_detections.extend(self.check_structural(text))
all_detections.extend(self.check_length_anomaly(text))
max_confidence = max((d.confidence for d in all_detections), default=0.0)
return {
"jailbreak_detected": max_confidence > 0.5,
"max_confidence": max_confidence,
"detections": [
{"layer": d.layer, "category": d.category,
"confidence": d.confidence, "details": d.details}
for d in all_detections
],
"action": "block" if max_confidence > 0.8 else "flag" if max_confidence > 0.5 else "allow",
}
Classifier-Based Jailbreak Detection
# jailbreak_classifier.py - ML-based jailbreak classifier for Layer 3
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
class JailbreakClassifier:
"""
ML-based jailbreak detection using a fine-tuned classifier.
This catches novel jailbreak techniques that patterns miss.
Models: fine-tuned DeBERTa or similar on jailbreak datasets.
"""
def __init__(self, model_path: str = "models/jailbreak-classifier"):
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
self.model.eval()
self.threshold = 0.7
def classify(self, text: str) -> dict:
inputs = self.tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
padding=True,
)
with torch.no_grad():
outputs = self.model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1).numpy()[0]
# Assuming binary classifier: [benign, jailbreak]
jailbreak_prob = float(probs[1])
return {
"jailbreak_probability": round(jailbreak_prob, 4),
"is_jailbreak": jailbreak_prob > self.threshold,
"confidence": round(max(probs), 4),
}
Output Monitoring for Policy Violations
# output_policy_monitor.py - check model outputs for policy violations
import re
from typing import List
class OutputPolicyMonitor:
"""
Monitor model outputs for content that indicates a successful jailbreak.
Even if input filtering is bypassed, output monitoring catches the result.
"""
VIOLATION_PATTERNS = [
# Model acknowledging jailbreak
(r"(DAN|developer)\s+mode\s+(activated|enabled)", "jailbreak_acknowledgment", "critical"),
(r"(I('ll| will) (ignore|disregard|bypass)\s+(my|the)\s+(safety|content))", "safety_bypass", "critical"),
# Content policy violations
(r"(step[- ]by[- ]step|instructions?|guide)\s+(to|for)\s+(hack|exploit|attack|break into)", "harmful_instructions", "critical"),
(r"(how\s+to\s+make|recipe\s+for|synthesize|manufacture)\s+(a\s+)?(bomb|explosive|weapon|drug)", "dangerous_content", "critical"),
# System prompt leak indicators
(r"(my\s+(system\s+)?instructions?\s+(are|say|tell)|here\s+(are|is)\s+my\s+(system\s+)?prompt)", "prompt_leak", "high"),
]
def check_output(self, output: str) -> dict:
violations = []
for pattern, category, severity in self.VIOLATION_PATTERNS:
if re.search(pattern, output, re.IGNORECASE):
violations.append({
"category": category,
"severity": severity,
})
has_critical = any(v["severity"] == "critical" for v in violations)
return {
"violations": violations,
"violation_count": len(violations),
"has_critical": has_critical,
"action": "block" if has_critical else "log" if violations else "pass",
}
Automated Response Pipeline
# jailbreak-response-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jailbreak-defence
namespace: ai-services
spec:
replicas: 3
selector:
matchLabels:
app: jailbreak-defence
template:
metadata:
labels:
app: jailbreak-defence
spec:
containers:
- name: input-filter
image: internal-registry/jailbreak-input-filter:2.1.0
ports:
- containerPort: 8080
name: http
env:
- name: CLASSIFIER_MODEL_PATH
value: "/models/jailbreak-classifier"
- name: PATTERN_CONFIDENCE_THRESHOLD
value: "0.8"
- name: CLASSIFIER_THRESHOLD
value: "0.7"
- name: BLOCK_ON_DETECTION
value: "true"
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: "2"
memory: 2Gi
volumeMounts:
- name: model-volume
mountPath: /models
readOnly: true
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 30
- name: output-monitor
image: internal-registry/output-policy-monitor:1.3.0
ports:
- containerPort: 8081
name: http
env:
- name: BLOCK_CRITICAL
value: "true"
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: jailbreak-classifier-pvc
---
apiVersion: v1
kind: Service
metadata:
name: jailbreak-defence
namespace: ai-services
spec:
selector:
app: jailbreak-defence
ports:
- name: input-filter
port: 8080
targetPort: 8080
- name: output-monitor
port: 8081
targetPort: 8081
Prometheus Alerting
# prometheus-jailbreak.yaml
groups:
- name: jailbreak-detection
interval: 1m
rules:
- alert: JailbreakAttemptSpike
expr: rate(jailbreak_detected_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Elevated jailbreak attempts: {{ $value | humanize }}/sec"
description: "Check jailbreak_detected_total by category for attack type breakdown."
- alert: JailbreakBypassDetected
expr: increase(output_policy_violation_total{severity="critical"}[5m]) > 0
labels:
severity: critical
annotations:
summary: "Jailbreak bypass detected - policy violation in model output"
description: >
A critical policy violation was detected in model output, indicating
that a jailbreak attempt bypassed input filters. Review the request
and output immediately.
- alert: ClassifierLatencyHigh
expr: histogram_quantile(0.99, rate(jailbreak_classifier_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Jailbreak classifier P99 latency exceeds 500ms"
description: "Classifier inference is slow. Consider scaling replicas or optimising the model."
Expected Behaviour
- Pattern-based detection blocks known jailbreak techniques within 1-5ms
- Classifier-based detection catches novel techniques within 50-200ms
- Output monitoring detects successful bypasses and blocks critical policy violations
- Jailbreak attempt spikes trigger alerts within 5 minutes
- Critical output policy violations trigger immediate alerts
- Defence layers are independent: bypass of one layer is caught by others
Trade-offs
| Control | Impact | Risk | Mitigation |
|---|---|---|---|
| Pattern-based detection | Blocks known jailbreaks instantly | Misses novel techniques. High false positive rate on creative writing prompts. | Use as first filter only. Combine with classifier for confirmation. |
| ML classifier (Layer 3) | Catches novel jailbreaks with generalisation | Adds 50-200ms latency. Requires training data and regular retraining. | Run in parallel with inference (not sequentially). Cache classifier results for repeated inputs. |
| Output policy monitoring | Catches bypasses that input filters miss | Post-hoc: the model already generated the harmful content (even if not returned) | Combine with streaming output scanning to abort generation mid-stream. |
| Aggressive blocking (threshold 0.5) | Catches more jailbreaks | Blocks legitimate creative and educational content | Use 0.8 threshold for blocking, 0.5 for flagging and logging. Human review for flagged content. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Novel jailbreak bypasses all layers | Harmful content reaches user | User reports, output monitoring on historical logs, external disclosure | Add the technique to pattern list. Retrain classifier with the new example. Review and patch immediately. |
| Classifier model corrupted | All inputs classified as benign or all as jailbreaks | Classification distribution becomes uniform; false positive/negative rate spikes | Rollback classifier to previous version. Validate model integrity with test suite before deployment. |
| False positive storm | Legitimate users blocked en masse | Support ticket volume spikes; user engagement drops | Emergency threshold raise. Deploy pattern exceptions. Investigate which layer triggered the false positives. |
| Output monitor latency | Responses delayed while scanning output | P99 latency increases; user experience degrades | Stream output and scan in parallel. Abort generation only on critical violations. |
When to Consider a Managed Alternative
Jailbreak defence is an arms race. New techniques appear weekly, and keeping pattern lists, classifiers, and policies current requires dedicated security research.
- Lakera (#142): Managed jailbreak detection API. ML-based classification updated continuously with new jailbreak techniques. Real-time detection with sub-50ms latency.
- Cloudflare (#29) AI Gateway: Edge-level input/output filtering for AI endpoints. Managed content safety policies.
Premium content pack: Jailbreak defence pack. Multi-layer input filter (Python), jailbreak classifier training pipeline and dataset, output policy monitor, Kubernetes deployment manifests, Prometheus alerting rules, and a jailbreak red team test suite with 500+ known techniques.