LLM Observability in Production: Monitoring Latency, Token Usage, Safety Violations, and Drift

LLM Observability in Production: Monitoring Latency, Token Usage, Safety Violations, and Drift

Problem

Traditional application monitoring (CPU, memory, HTTP status codes, latency) tells you nothing about what an LLM is doing. A model can return 200 OK while generating hallucinated medical advice, leaking PII, or producing biased content. LLM observability requires a new layer of metrics: token consumption (cost), generation latency (user experience), safety violations (risk), content quality (drift), and user feedback (satisfaction).

Most teams deploy LLM applications with the same monitoring they use for REST APIs: is it up? Is it fast? This misses the fundamental question: is it behaving correctly? A model that drifts in quality, starts hallucinating more frequently, or subtly changes its response style will not trigger any traditional alert.

LLM observability covers four dimensions: operational metrics (latency, throughput, errors), economic metrics (tokens, costs), safety metrics (violations, guardrail triggers), and quality metrics (semantic drift, user feedback, hallucination rate).

Threat Model

  • Adversary: This article addresses operational risk, not adversarial threats. The “adversary” is entropy: model degradation, distribution shift, cost overruns, and quality drift.
  • Objective: Detect and alert on model behaviour changes before they impact users or costs. Maintain visibility into model performance across all four dimensions.
  • Blast radius: Unmonitored degradation leads to: cost overruns (token usage spikes), user experience degradation (latency increases), safety incidents (undetected violations), and quality erosion (gradual drift that is invisible until users complain).

Configuration

Metrics Collection: Custom Prometheus Metrics

# llm_metrics.py - comprehensive LLM metrics collection
from prometheus_client import (
    Counter, Histogram, Gauge, Summary,
    CollectorRegistry, generate_latest,
)
import time

# Create a dedicated registry for LLM metrics
LLM_REGISTRY = CollectorRegistry()

# Operational metrics
REQUEST_LATENCY = Histogram(
    "llm_request_duration_seconds",
    "End-to-end request latency including guardrails",
    ["model", "endpoint"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0],
    registry=LLM_REGISTRY,
)

INFERENCE_LATENCY = Histogram(
    "llm_inference_duration_seconds",
    "Model inference latency only (excluding guardrails)",
    ["model"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0],
    registry=LLM_REGISTRY,
)

TIME_TO_FIRST_TOKEN = Histogram(
    "llm_time_to_first_token_seconds",
    "Time from request to first token in streaming responses",
    ["model"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
    registry=LLM_REGISTRY,
)

REQUEST_TOTAL = Counter(
    "llm_requests_total",
    "Total LLM requests",
    ["model", "endpoint", "status"],
    registry=LLM_REGISTRY,
)

# Economic metrics
INPUT_TOKENS = Counter(
    "llm_input_tokens_total",
    "Total input tokens consumed",
    ["model"],
    registry=LLM_REGISTRY,
)

OUTPUT_TOKENS = Counter(
    "llm_output_tokens_total",
    "Total output tokens generated",
    ["model"],
    registry=LLM_REGISTRY,
)

ESTIMATED_COST = Counter(
    "llm_estimated_cost_usd_total",
    "Estimated cost in USD",
    ["model"],
    registry=LLM_REGISTRY,
)

# Safety metrics
SAFETY_VIOLATIONS = Counter(
    "llm_safety_violations_total",
    "Safety violations detected in output",
    ["model", "violation_type", "severity"],
    registry=LLM_REGISTRY,
)

GUARDRAIL_BLOCKS = Counter(
    "llm_guardrail_blocks_total",
    "Requests blocked by guardrails",
    ["model", "stage", "reason"],
    registry=LLM_REGISTRY,
)

PII_DETECTIONS = Counter(
    "llm_pii_detections_total",
    "PII detected in model output",
    ["model", "pii_type"],
    registry=LLM_REGISTRY,
)

# Quality metrics
USER_FEEDBACK = Counter(
    "llm_user_feedback_total",
    "User feedback (thumbs up/down)",
    ["model", "feedback_type"],
    registry=LLM_REGISTRY,
)

RESPONSE_LENGTH = Histogram(
    "llm_response_length_tokens",
    "Response length in tokens",
    ["model"],
    buckets=[10, 50, 100, 250, 500, 1000, 2000, 4000],
    registry=LLM_REGISTRY,
)


class LLMMetricsCollector:
    """Collect and record LLM metrics for each request."""

    COST_PER_TOKEN = {
        # Approximate costs per 1K tokens (USD)
        "gpt-4": {"input": 0.03, "output": 0.06},
        "gpt-4-turbo": {"input": 0.01, "output": 0.03},
        "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
        "claude-3-opus": {"input": 0.015, "output": 0.075},
        "claude-3-sonnet": {"input": 0.003, "output": 0.015},
    }

    def record_request(self, model: str, endpoint: str, input_tokens: int,
                       output_tokens: int, latency_seconds: float,
                       inference_seconds: float, status: str = "success",
                       ttft_seconds: float = None):
        REQUEST_LATENCY.labels(model=model, endpoint=endpoint).observe(latency_seconds)
        INFERENCE_LATENCY.labels(model=model).observe(inference_seconds)
        REQUEST_TOTAL.labels(model=model, endpoint=endpoint, status=status).inc()
        INPUT_TOKENS.labels(model=model).inc(input_tokens)
        OUTPUT_TOKENS.labels(model=model).inc(output_tokens)
        RESPONSE_LENGTH.labels(model=model).observe(output_tokens)

        if ttft_seconds is not None:
            TIME_TO_FIRST_TOKEN.labels(model=model).observe(ttft_seconds)

        # Calculate and record cost
        costs = self.COST_PER_TOKEN.get(model, {"input": 0.01, "output": 0.03})
        cost = (input_tokens / 1000 * costs["input"]) + (output_tokens / 1000 * costs["output"])
        ESTIMATED_COST.labels(model=model).inc(cost)

    def record_safety_violation(self, model: str, violation_type: str, severity: str):
        SAFETY_VIOLATIONS.labels(
            model=model, violation_type=violation_type, severity=severity
        ).inc()

    def record_guardrail_block(self, model: str, stage: str, reason: str):
        GUARDRAIL_BLOCKS.labels(model=model, stage=stage, reason=reason).inc()

    def record_pii_detection(self, model: str, pii_type: str):
        PII_DETECTIONS.labels(model=model, pii_type=pii_type).inc()

    def record_feedback(self, model: str, feedback_type: str):
        USER_FEEDBACK.labels(model=model, feedback_type=feedback_type).inc()

Semantic Drift Detection

# drift_detector.py - detect semantic drift in LLM responses
import numpy as np
from collections import deque
from typing import Optional

class SemanticDriftDetector:
    """
    Detect when LLM responses drift from expected behaviour.
    Uses embedding similarity to compare recent responses against a baseline.
    """

    def __init__(self, embedding_fn, baseline_window: int = 1000,
                 detection_window: int = 100, threshold: float = 0.15):
        self.embedding_fn = embedding_fn
        self.baseline_embeddings = deque(maxlen=baseline_window)
        self.recent_embeddings = deque(maxlen=detection_window)
        self.threshold = threshold
        self.baseline_centroid: Optional[np.ndarray] = None

    def add_response(self, response_text: str) -> dict:
        embedding = self.embedding_fn(response_text)
        self.recent_embeddings.append(embedding)

        # Build baseline from first N responses
        if len(self.baseline_embeddings) < self.baseline_embeddings.maxlen:
            self.baseline_embeddings.append(embedding)
            if len(self.baseline_embeddings) == self.baseline_embeddings.maxlen:
                self.baseline_centroid = np.mean(list(self.baseline_embeddings), axis=0)
            return {"drift_detected": False, "status": "building_baseline"}

        if self.baseline_centroid is None:
            return {"drift_detected": False, "status": "building_baseline"}

        # Compare recent window centroid to baseline centroid
        recent_centroid = np.mean(list(self.recent_embeddings), axis=0)
        drift_distance = float(np.linalg.norm(recent_centroid - self.baseline_centroid))

        # Cosine similarity
        cosine_sim = float(
            np.dot(recent_centroid, self.baseline_centroid) /
            (np.linalg.norm(recent_centroid) * np.linalg.norm(self.baseline_centroid) + 1e-8)
        )

        drift_detected = drift_distance > self.threshold

        return {
            "drift_detected": drift_detected,
            "drift_distance": round(drift_distance, 4),
            "cosine_similarity": round(cosine_sim, 4),
            "baseline_size": len(self.baseline_embeddings),
            "recent_window_size": len(self.recent_embeddings),
            "status": "monitoring",
        }

Prometheus Alerting Rules

# prometheus-llm-observability.yaml
groups:
  - name: llm-operational
    interval: 1m
    rules:
      # Latency alerts
      - alert: LLMLatencyP99High
        expr: >
          histogram_quantile(0.99,
            rate(llm_request_duration_seconds_bucket[5m])
          ) > 5.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM P99 latency exceeds 5s for {{ $labels.model }}"

      - alert: LLMTimeToFirstTokenSlow
        expr: >
          histogram_quantile(0.95,
            rate(llm_time_to_first_token_seconds_bucket[5m])
          ) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Time to first token P95 exceeds 2s for {{ $labels.model }}"

      # Error rate
      - alert: LLMErrorRateHigh
        expr: >
          rate(llm_requests_total{status="error"}[5m])
          / rate(llm_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate exceeds 5% for {{ $labels.model }}"

  - name: llm-economic
    interval: 5m
    rules:
      # Cost alerts
      - alert: LLMHourlyCostHigh
        expr: >
          increase(llm_estimated_cost_usd_total[1h]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM cost exceeds $100/hour for {{ $labels.model }}"

      - alert: LLMDailyCostSpike
        expr: >
          increase(llm_estimated_cost_usd_total[1h])
          > 2 * avg_over_time(increase(llm_estimated_cost_usd_total[1h])[24h:1h])
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "LLM cost spike: current hourly rate is 2x the 24h average"

      # Token usage anomaly
      - alert: LLMOutputTokenSpike
        expr: >
          rate(llm_output_tokens_total[5m])
          > 2 * avg_over_time(rate(llm_output_tokens_total[5m])[24h:5m])
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Output token rate is 2x the 24h average for {{ $labels.model }}"

  - name: llm-safety
    interval: 1m
    rules:
      # Safety violation alerts
      - alert: LLMSafetyViolation
        expr: increase(llm_safety_violations_total{severity="critical"}[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Critical safety violation detected in {{ $labels.model }} output"
          description: "Type: {{ $labels.violation_type }}. Investigate immediately."

      - alert: LLMGuardrailBlockSpike
        expr: >
          rate(llm_guardrail_blocks_total[5m])
          / rate(llm_requests_total[5m]) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: ">20% of requests blocked by guardrails for {{ $labels.model }}"

      - alert: LLMPIILeakage
        expr: increase(llm_pii_detections_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "PII detected in output of {{ $labels.model }}: {{ $labels.pii_type }}"

  - name: llm-quality
    interval: 5m
    rules:
      # Drift detection
      - alert: LLMSemanticDrift
        expr: llm_semantic_drift_distance > 0.15
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Semantic drift detected in {{ $labels.model }} responses"
          description: "Drift distance: {{ $value }}. Response style may have changed."

      # User feedback degradation
      - alert: LLMNegativeFeedbackSpike
        expr: >
          rate(llm_user_feedback_total{feedback_type="negative"}[1h])
          / rate(llm_user_feedback_total[1h]) > 0.3
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: ">30% negative feedback for {{ $labels.model }} in the last hour"

Grafana Dashboard Configuration

{
  "dashboard": {
    "title": "LLM Production Observability",
    "uid": "llm-observability",
    "panels": [
      {
        "title": "Request Latency (P50/P95/P99)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.5, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P50 - {{ model }}"
          },
          {
            "expr": "histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P95 - {{ model }}"
          },
          {
            "expr": "histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P99 - {{ model }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "title": "Hourly Cost by Model",
        "type": "timeseries",
        "targets": [
          {
            "expr": "increase(llm_estimated_cost_usd_total[1h])",
            "legendFormat": "{{ model }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "title": "Token Throughput",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(llm_input_tokens_total[5m])",
            "legendFormat": "Input - {{ model }}"
          },
          {
            "expr": "rate(llm_output_tokens_total[5m])",
            "legendFormat": "Output - {{ model }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "title": "Safety Violations",
        "type": "stat",
        "targets": [
          {
            "expr": "increase(llm_safety_violations_total[24h])",
            "legendFormat": "{{ violation_type }}"
          }
        ],
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 8}
      },
      {
        "title": "Guardrail Block Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(llm_guardrail_blocks_total[5m]) / rate(llm_requests_total[5m])",
            "legendFormat": "{{ reason }}"
          }
        ],
        "gridPos": {"h": 8, "w": 6, "x": 18, "y": 8}
      },
      {
        "title": "User Feedback Ratio",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(llm_user_feedback_total{feedback_type='positive'}[1h]) / rate(llm_user_feedback_total[1h])",
            "legendFormat": "Positive ratio"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
      },
      {
        "title": "Semantic Drift Distance",
        "type": "timeseries",
        "targets": [
          {
            "expr": "llm_semantic_drift_distance",
            "legendFormat": "{{ model }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}
      }
    ]
  }
}

Expected Behaviour

  • All four dimensions of LLM observability are monitored: operational, economic, safety, quality
  • Latency alerts fire when P99 exceeds 5 seconds or time-to-first-token exceeds 2 seconds
  • Cost alerts fire when hourly spend exceeds budget or spikes 2x above the 24-hour average
  • Safety violations trigger immediate critical alerts
  • Semantic drift alerts fire after 30 minutes of sustained drift above threshold
  • User feedback ratio below 70% positive triggers investigation
  • Grafana dashboards provide real-time visibility across all dimensions

Trade-offs

Control Impact Risk Mitigation
Comprehensive metrics collection Full visibility into LLM behaviour Metric cardinality grows with models and endpoints Use bounded label values. Aggregate by model, not by user.
Semantic drift detection Catches quality degradation Requires computing embeddings for every response (CPU/cost) Sample responses (embed 10% of responses). Use lightweight embedding models.
Cost tracking Prevents budget overruns Estimated costs may not match actual provider bills Reconcile estimated costs with provider invoices weekly. Adjust cost-per-token tables.
User feedback collection Direct signal on quality Low feedback rate makes signal noisy Make feedback low-friction (thumbs up/down). Prompt for feedback selectively.

Failure Modes

Failure Symptom Detection Recovery
Metrics exporter down No metrics flowing to Prometheus Prometheus target down alert; gaps in dashboard Restart metrics exporter. Use sidecar pattern for reliability.
Cost estimate drift Budget alerts not firing despite high actual spend Monthly reconciliation shows divergence Update cost-per-token tables. Add provider API billing integration.
Drift detector false alarm Drift alert on expected model update Alert fires after intentional model version change Reset drift baseline after planned model updates. Add deployment annotations to dashboards.
Alert fatigue Too many low-severity alerts lead to critical alerts being missed Alert response time increases; incidents discovered late Tune thresholds quarterly. Route critical alerts to PagerDuty, warnings to Slack.

When to Consider a Managed Alternative

LLM observability requires custom metrics, specialised dashboards, and domain-specific alerting that generic APM tools do not provide out of the box.

  • Grafana Cloud (#108): Managed Prometheus and Grafana with long-term metric storage. ML-powered anomaly detection. Custom dashboards for LLM metrics.
  • Datadog (#104): APM with LLM observability features. Token tracking, cost analytics, and integration with major LLM providers.

Premium content pack: LLM observability dashboard pack. Prometheus metrics library (Python), complete alerting rules across four dimensions, Grafana dashboard JSON (7 panels), semantic drift detection service, cost tracking middleware, and user feedback collection widget.