RLHF Data Protection: Securing Human Feedback Loops, Preference Data, and Reward Models

Problem

Reinforcement Learning from Human Feedback (RLHF) pipelines introduce unique security surfaces that standard ML training workflows do not have. Human annotators interact with a labeling interface to rank model outputs, generating preference data that directly shapes model behavior. The reward model trained on this data becomes the optimization target for the policy model. Compromise any part of this chain and the final model’s behavior changes.

Most RLHF setups run the annotation platform, preference database, reward model training, and PPO training as loosely connected services with shared credentials, flat network access, and no integrity checks on feedback data. An attacker who manipulates even a small percentage of preference labels can shift model behavior toward generating harmful, biased, or incorrect outputs without triggering obvious metrics regressions.

Target systems: Kubernetes-hosted RLHF pipelines using Label Studio, Argilla, or custom annotation tools, with PostgreSQL or MongoDB preference stores and PyTorch-based reward model training.

Threat Model

Adversary: Malicious annotator (insider threat), compromised annotation platform, or attacker with network access to the preference database.
Objective: Preference data manipulation (flip labels to reward harmful outputs). Reward model poisoning (tamper with reward model weights to shift optimization). Annotator data exfiltration (steal PII from annotator metadata or prompt content). Feedback pipeline disruption (corrupt or delete preference data to stall training).
Blast radius: Subtly poisoned model deployed to production (safety). Annotator PII leaked (compliance/legal). Training pipeline stalled for days while data is re-collected (availability/financial).

Configuration

Annotation Platform Isolation

Run the annotation platform in a dedicated namespace with strict network controls. Annotators should only reach the labeling UI, never the preference database or training infrastructure.

# annotation-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: annotation
  labels:
    purpose: human-feedback
    data-classification: pii
---
# annotation-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: annotation-platform-isolation
  namespace: annotation
spec:
  podSelector:
    matchLabels:
      component: annotation-ui
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Only allow traffic from ingress controller (annotator access)
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-system
      ports:
        - port: 8080
          protocol: TCP
  egress:
    # Only allow writes to the preference database
    - to:
        - podSelector:
            matchLabels:
              component: preference-db
      ports:
        - port: 5432
          protocol: TCP
    # DNS
    - to:
        - namespaceSelector: {}
      ports:
        - port: 53
          protocol: UDP

Preference Database Hardening

# preference-db.yaml - PostgreSQL with encryption and access controls
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: preference-db
  namespace: annotation
spec:
  serviceName: preference-db
  replicas: 1
  selector:
    matchLabels:
      component: preference-db
  template:
    metadata:
      labels:
        component: preference-db
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 999
        fsGroup: 999
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: postgres
          image: postgres:16-bookworm
          ports:
            - containerPort: 5432
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          env:
            - name: POSTGRES_DB
              value: preferences
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: preference-db-credentials
                  key: username
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: preference-db-credentials
                  key: password
          args:
            - "-c"
            - "ssl=on"
            - "-c"
            - "ssl_cert_file=/certs/tls.crt"
            - "-c"
            - "ssl_key_file=/certs/tls.key"
            - "-c"
            - "log_statement=mod"
            - "-c"
            - "log_connections=on"
            - "-c"
            - "log_disconnections=on"
            - "-c"
            - "pgaudit.log=write,ddl"
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
            - name: certs
              mountPath: /certs
              readOnly: true
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
      volumes:
        - name: certs
          secret:
            secretName: preference-db-tls
            defaultMode: 0440
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi
        storageClassName: encrypted-gp3

Annotator Authentication and Audit Logging

# annotator_middleware.py - Flask middleware for annotation platform
import functools
import hashlib
import json
import logging
import time
from datetime import datetime, timezone

from flask import Flask, g, request, jsonify

audit_logger = logging.getLogger("annotator.audit")
audit_handler = logging.FileHandler("/var/log/annotation/audit.jsonl")
audit_logger.addHandler(audit_handler)
audit_logger.setLevel(logging.INFO)


def require_annotator_auth(f):
    """Verify annotator identity via OIDC token from the ingress."""
    @functools.wraps(f)
    def decorated(*args, **kwargs):
        # Token validated by ingress/gateway; headers contain verified claims
        annotator_id = request.headers.get("X-Auth-User-Id")
        annotator_email = request.headers.get("X-Auth-User-Email")
        annotator_groups = request.headers.get("X-Auth-Groups", "")

        if not annotator_id:
            return jsonify({"error": "authentication required"}), 401

        if "annotators" not in annotator_groups.split(","):
            return jsonify({"error": "not authorized as annotator"}), 403

        g.annotator_id = annotator_id
        g.annotator_email = annotator_email
        return f(*args, **kwargs)
    return decorated


def log_feedback_submission(annotator_id: str, task_id: str, preference: dict):
    """Write an immutable audit log entry for every feedback submission."""
    entry = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "event": "feedback_submitted",
        "annotator_id": annotator_id,
        "task_id": task_id,
        "preference_hash": hashlib.sha256(
            json.dumps(preference, sort_keys=True).encode()
        ).hexdigest(),
        "source_ip": request.remote_addr,
        "user_agent": request.headers.get("User-Agent", ""),
    }
    audit_logger.info(json.dumps(entry))


def detect_anomalous_labeling(annotator_id: str, db_session) -> bool:
    """Flag annotators with suspicious labeling patterns."""
    # Check for rapid-fire submissions (bot or gaming)
    recent_count = db_session.execute(
        """
        SELECT COUNT(*) FROM feedback
        WHERE annotator_id = :aid
        AND submitted_at > NOW() - INTERVAL '5 minutes'
        """,
        {"aid": annotator_id},
    ).scalar()

    if recent_count > 50:  # More than 50 submissions in 5 minutes
        audit_logger.warning(json.dumps({
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "event": "anomalous_labeling_rate",
            "annotator_id": annotator_id,
            "count_5min": recent_count,
        }))
        return True

    # Check for disagreement with consensus
    disagreement_rate = db_session.execute(
        """
        SELECT
            COUNT(*) FILTER (WHERE f.chosen != consensus.chosen) * 1.0
            / NULLIF(COUNT(*), 0)
        FROM feedback f
        JOIN consensus ON f.task_id = consensus.task_id
        WHERE f.annotator_id = :aid
        AND f.submitted_at > NOW() - INTERVAL '1 hour'
        """,
        {"aid": annotator_id},
    ).scalar() or 0.0

    if disagreement_rate > 0.7:  # 70%+ disagreement with consensus
        audit_logger.warning(json.dumps({
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "event": "high_disagreement_rate",
            "annotator_id": annotator_id,
            "disagreement_rate": float(disagreement_rate),
        }))
        return True

    return False

Reward Model Tampering Detection

# reward_model_integrity.py - verify reward model before PPO training
import hashlib
import json
import subprocess
import sys
from pathlib import Path

import torch
import numpy as np


def verify_reward_model(model_path: str, reference_dataset_path: str) -> bool:
    """
    Verify reward model integrity by checking:
    1. Cryptographic signature on model files
    2. Output consistency on a held-out reference dataset
    """
    # Step 1: Verify signature
    manifest_path = f"{model_path}/manifest.json"
    sig_result = subprocess.run(
        [
            "cosign", "verify-blob",
            "--key", "k8s://ml-training/reward-model-signing-key",
            "--signature", f"{manifest_path}.sig",
            manifest_path,
        ],
        capture_output=True,
        text=True,
    )
    if sig_result.returncode != 0:
        print(f"Signature verification failed: {sig_result.stderr}")
        return False

    # Step 2: Verify file hashes against signed manifest
    with open(manifest_path) as f:
        manifest = json.load(f)

    for relative_path, expected_hash in manifest["files"].items():
        file_path = Path(model_path) / relative_path
        sha256 = hashlib.sha256()
        with open(file_path, "rb") as fh:
            for chunk in iter(lambda: fh.read(8192), b""):
                sha256.update(chunk)
        if sha256.hexdigest() != expected_hash:
            print(f"Hash mismatch: {relative_path}")
            return False

    # Step 3: Output consistency check on reference prompts
    model = torch.load(
        f"{model_path}/reward_model.pt",
        map_location="cpu",
        weights_only=True,
    )
    with open(reference_dataset_path) as f:
        reference = json.load(f)

    for sample in reference["samples"]:
        output = run_reward_inference(model, sample["input"])
        expected = sample["expected_score"]
        if abs(output - expected) > 0.1:  # Tolerance threshold
            print(
                f"Output drift detected: expected {expected}, got {output}"
            )
            return False

    return True

Feedback Pipeline Network Policy

Restrict the reward model training job so it can only read from the preference database and write to checkpoint storage. It should not be able to reach the annotation UI or the internet.

# reward-training-network.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: reward-training-isolation
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      component: reward-training
  policyTypes:
    - Ingress
    - Egress
  ingress: []
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: annotation
          podSelector:
            matchLabels:
              component: preference-db
      ports:
        - port: 5432
          protocol: TCP
    - to:
        - namespaceSelector: {}
      ports:
        - port: 53
          protocol: UDP

RBAC for RLHF Infrastructure

# rlhf-rbac.yaml - separate roles for each RLHF component
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: annotation-platform-admin
  namespace: annotation
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    resourceNames: ["annotation-ui"]
    verbs: ["get", "list", "patch"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list"]
  # Cannot access preference-db secrets or PVCs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: reward-model-trainer
  namespace: ml-training
rules:
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["preference-db-readonly-credentials"]
    verbs: ["get"]

Expected Behaviour

Annotators authenticate via OIDC and can only reach the labeling UI, not the database or training infrastructure
Every feedback submission is audit-logged with annotator ID, timestamp, and content hash
Anomalous labeling patterns (rapid submissions, high disagreement) trigger alerts and optional automatic suspension
Preference database uses TLS, encrypted storage, and audit logging for all write operations
Reward model is signed after training and verified before use in PPO
Network policies prevent lateral movement between annotation, database, and training components

Trade-offs

Control	Impact	Risk	Mitigation
Per-submission audit logging	Storage growth of 1-5 KB per label	Log volume becomes significant at scale (millions of labels)	Rotate logs to object storage. Retain only hashes after 90 days.
Anomaly detection on annotators	Legitimate annotators may be flagged	False positives slow annotation throughput	Use warning thresholds before suspension. Require manual review for suspension.
Read-only preference DB access for training	Training job cannot write intermediate state to the preference DB	Some frameworks expect write access	Use a separate staging database for training state. Export preferences as read-only snapshots.
Network isolation between components	Debugging becomes harder	Engineers cannot easily query the preference DB from their workstation	Provide a read-only database replica in a dev namespace for analysis.
Reward model signature verification	Adds latency to PPO training startup	Delays training pipeline by 1-3 minutes	Run verification in parallel with other setup tasks. Cache verification results for unchanged models.

Failure Modes

Failure	Symptom	Detection	Recovery
Preference data manipulation	Reward model learns to reward harmful outputs	Evaluation benchmarks show toxicity increase; held-out reference dataset scores drift	Identify compromised annotations via audit log. Remove tainted labels. Retrain reward model from clean data.
Annotator account compromise	Suspicious labeling patterns from a trusted account	Anomaly detection flags rapid or inconsistent submissions	Suspend account. Review all submissions from that account. Re-label affected tasks with other annotators.
Reward model replaced with backdoored version	PPO training optimizes for wrong objective	Signature verification fails; reference dataset consistency check shows output drift	Block PPO training. Restore reward model from last verified checkpoint.
Preference database breach	Annotator PII or prompt content leaked	Database audit logs show unauthorized queries; connection from unexpected source IP	Rotate database credentials. Notify affected annotators per breach policy. Review network policies.

When to Consider a Managed Alternative

Managed RLHF platforms handle annotator management, data security, and pipeline orchestration.

Scale AI: Managed annotation workforce with built-in quality controls and data security.
Surge AI: Specialized RLHF annotation with consensus mechanisms and annotator vetting.
Cloudflare (#29): WAF and bot protection in front of self-hosted annotation platforms.
Snyk (#48): Scan annotation platform and training containers for vulnerabilities.

Premium content pack: Complete RLHF pipeline security configurations including annotator authentication middleware, preference database hardening scripts, reward model signing workflows, and anomaly detection queries.