Hardening Model Inference Endpoints: Authentication, Rate Limiting, and Input Validation

Hardening Model Inference Endpoints: Authentication, Rate Limiting, and Input Validation

Problem

Model inference endpoints are GPU-backed and expensive, $2-30 per hour per GPU. A single unprotected endpoint exposed to the internet can accumulate thousands of dollars in compute costs within hours from abuse, intentional or accidental. Most model serving frameworks (TorchServe, Triton, vLLM) ship with management APIs exposed without authentication, no rate limiting, and no input validation. An attacker can exfiltrate data through carefully crafted prompts, exhaust GPU resources through oversized inputs, or abuse the model for purposes it was not intended for.

Target systems: Any model inference endpoint running on Kubernetes. Specific configurations for NGINX ingress, Kong (#86) gateway, and direct integration patterns.

Threat Model

  • Adversary: Unauthenticated user or compromised API key holder accessing the inference endpoint over HTTPS.
  • Objective: Cost exhaustion (flood endpoint with large requests, consuming GPU hours). Data exfiltration (prompt injection to extract training data or system prompts). Model abuse (use the model for unintended purposes, generating harmful content, automated spam, etc.).
  • Blast radius: GPU cost spike (financial). Data leakage (confidentiality). Model reputation damage (safety).

Configuration

API Key Authentication at the Gateway

Do not rely on the model serving framework for authentication. Place authentication at the API gateway or ingress layer:

# kong-inference-auth.yaml - Kong gateway with API key auth
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: inference-key-auth
spec:
  plugin: key-auth
  config:
    key_names: ["X-API-Key"]
    hide_credentials: true
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: inference-api
  annotations:
    konghq.com/plugins: inference-key-auth
spec:
  ingressClassName: kong
  rules:
    - host: inference.example.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: inference-service
                port:
                  number: 8080

For NGINX ingress without an API gateway:

# nginx-inference-auth.conf
# Simple API key validation at the NGINX level.
map $http_x_api_key $api_key_valid {
    default 0;
    "sk-production-abc123def456" 1;
    "sk-staging-xyz789ghi012" 1;
}

server {
    listen 443 ssl;
    server_name inference.example.com;

    # Reject requests without a valid API key
    if ($api_key_valid = 0) {
        return 401 '{"error": "Invalid or missing API key"}';
    }

    location /v1/ {
        proxy_pass http://inference-service:8080;
    }
}

GPU-Aware Rate Limiting

Standard rate limiting (requests per second) does not capture the real cost of inference requests. A single request with a 10,000-token input costs 100x more GPU time than a 100-token request. Rate limit by both request count AND token/input size:

# Rate limit by request count (baseline protection)
limit_req_zone $http_x_api_key zone=inference_rate:10m rate=10r/s;

# Rate limit by request body size (proxy for token count)
# Large request bodies → large token inputs → more GPU time
limit_req_zone $http_x_api_key zone=inference_size:10m rate=2r/s;

location /v1/completions {
    # Standard rate limit
    limit_req zone=inference_rate burst=20 nodelay;

    # Strict limit for large requests (>10KB body ≈ >2000 tokens)
    limit_req zone=inference_size burst=5 nodelay;

    # Hard limit on request body size
    client_max_body_size 100k;  # ~25,000 tokens max

    proxy_pass http://inference-service:8080;
}

For edge-level rate limiting before traffic reaches GPU infrastructure:

# Cloudflare (#29) rate limiting rule (via dashboard or API):
# - Match: hostname = inference.example.com AND path begins with /v1/
# - Rate: 100 requests per minute per API key
# - Action: Block (return 429)
#
# This absorbs abuse at the edge before it consumes GPU resources.

Input Validation

# input_validator.py - middleware for inference endpoints
# Validate and sanitise inputs before they reach the model.

import re
from fastapi import Request, HTTPException

MAX_INPUT_LENGTH = 25000  # characters (~6000 tokens)
MAX_TOKENS_REQUESTED = 4096

async def validate_inference_input(request: Request):
    body = await request.json()

    # 1. Input length limit
    prompt = body.get("prompt", "") or body.get("messages", [{}])[-1].get("content", "")
    if len(prompt) > MAX_INPUT_LENGTH:
        raise HTTPException(
            status_code=400,
            detail=f"Input exceeds maximum length of {MAX_INPUT_LENGTH} characters"
        )

    # 2. Max tokens limit
    max_tokens = body.get("max_tokens", 0)
    if max_tokens > MAX_TOKENS_REQUESTED:
        raise HTTPException(
            status_code=400,
            detail=f"max_tokens exceeds limit of {MAX_TOKENS_REQUESTED}"
        )

    # 3. Basic prompt injection detection (pattern-based)
    # This catches obvious injection attempts. For production,
    # use Lakera (#142) for ML-based detection.
    injection_patterns = [
        r"ignore (previous|all|above) instructions",
        r"you are now",
        r"disregard (your|the) (instructions|system prompt)",
        r"repeat (your|the) system (prompt|message|instructions)",
        r"what (is|are) your (instructions|system prompt|rules)",
    ]
    for pattern in injection_patterns:
        if re.search(pattern, prompt, re.IGNORECASE):
            # Log the attempt but don't reveal the detection
            # (attacker would adapt their technique)
            raise HTTPException(status_code=400, detail="Invalid input")

    return body

Output Filtering

# output_filter.py - filter model outputs for sensitive data leakage

import re

# PII patterns (basic - use a dedicated NLP model for production)
PII_PATTERNS = {
    "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    "ssn": r'\b\d{3}-\d{2}-\d{4}\b',
    "credit_card": r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
    "api_key": r'\b(sk|pk|api[_-]?key)[_-][a-zA-Z0-9]{20,}\b',
}

def filter_output(text: str) -> str:
    """Redact PII from model output before returning to the client."""
    for pii_type, pattern in PII_PATTERNS.items():
        matches = re.findall(pattern, text)
        for match in matches:
            text = text.replace(match, f"[REDACTED:{pii_type}]")
            # Log the redaction for audit
            log_redaction(pii_type, len(matches))
    return text

Observability for Inference Security

# Prometheus metrics for inference endpoint monitoring
# Instrument the inference service or gateway to export these:

# Token usage per API key (cost tracking)
# inference_tokens_total{api_key="sk-xxx", type="prompt|completion"}

# Request latency by API key (detect abuse - automated requests are faster)
# inference_request_duration_seconds{api_key="sk-xxx"}

# Input validation rejections
# inference_input_rejected_total{reason="too_long|injection_detected|invalid_format"}

# Output redactions
# inference_output_redacted_total{pii_type="email|ssn|api_key"}
# Alert: cost spike per API key
groups:
  - name: inference-security
    rules:
      - alert: InferenceCostSpike
        expr: >
          sum by (api_key) (rate(inference_tokens_total{type="completion"}[1h]))
          > 5 * avg_over_time(sum by (api_key) (rate(inference_tokens_total{type="completion"}[1h]))[7d:1h])
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "API key {{ $labels.api_key }} token usage 5x above baseline"

      - alert: PromptInjectionSpike
        expr: rate(inference_input_rejected_total{reason="injection_detected"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elevated prompt injection attempts: {{ $value | humanize }}/sec"

Network Isolation

# Inference service should only be reachable from the API gateway.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inference-ingress
  namespace: ai-inference
spec:
  podSelector:
    matchLabels:
      app: inference-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kong-system
      ports:
        - port: 8080
          protocol: TCP

Expected Behaviour

  • All inference requests authenticated via API key at the gateway layer
  • Rate limiting blocks >100 requests/minute per API key at the edge (Cloudflare #29)
  • Input length capped at 25,000 characters; max_tokens capped at 4,096
  • Obvious prompt injection patterns rejected with 400 (without revealing detection logic)
  • PII in model outputs redacted before returning to client
  • Token usage per API key tracked; cost spike alerts fire within 15 minutes
  • Inference service only reachable from the API gateway (network policy enforced)

Trade-offs

Control Impact Risk Mitigation
API key auth at gateway Adds 1-5ms latency per request Gateway becomes a single point of failure Run gateway in HA (3+ replicas).
Input length limit (25K chars) Blocks legitimate long-context use cases Some users may need longer inputs Offer a higher tier with longer limits (paid plan, higher rate).
Prompt injection pattern matching Blocks obvious injections Sophisticated injections bypass regex patterns Supplement with Lakera (#142) ML-based detection for production. Pattern matching is a first layer, not the only layer.
Output PII filtering Redacts sensitive data False positives: redacts strings that look like PII but aren’t Review redaction logs weekly. Tune patterns. For regulated industries: use a dedicated PII detection model.
GPU-aware rate limiting Prevents cost exhaustion Legitimate batch users may hit limits Per-key rate tiers. Exempt internal service accounts with higher limits.

Failure Modes

Failure Symptom Detection Recovery
API key leaked Unauthorized usage of the inference endpoint Cost spike alert; usage from unexpected IPs Revoke the key immediately. Issue a new key. Audit usage during the exposure window.
Rate limit too aggressive Legitimate users get 429 errors User reports; 429 rate metric increases for known-good keys Increase per-key rate limit. Or: move user to a higher tier.
Prompt injection bypasses detection Attacker extracts system prompt or training data Output monitoring detects unexpected content (system prompt text in response) Add the bypass technique to the pattern list. Implement Lakera (#142) for ML-based detection.
Input validator crashes All requests fail with 500 Error rate metric spikes; all inference requests return 500 Fail-open or fail-closed: decide your policy. For safety-critical: fail-closed (reject all). For availability-critical: fail-open (allow through without validation).

When to Consider a Managed Alternative

  • Cloudflare (#29): Edge rate limiting before traffic reaches GPU infrastructure. Bot detection. API Shield for endpoint-specific security.
  • Lakera (#142): Managed prompt injection detection API. ML-based, not pattern-based. Real-time classification.
  • Kong (#86) Enterprise: Per-key rate limiting, analytics, and access control. Managed Konnect platform.

Premium content pack: Inference endpoint security configuration pack. NGINX rate limiting configs, Kong gateway setup, input validation middleware (Python, Go, Node), output PII filtering, Prometheus alert rules for cost and injection monitoring, and Kubernetes network policies for inference namespace isolation.