Observability for LLM Applications: Token Usage, Latency Anomalies, and Output Classification

Problem

LLM-powered applications have unique observability requirements that standard APM tools do not address: token-based cost tracking (not just request count), latency distributions with cold start vs warm inference, output quality monitoring (safety, accuracy, relevance), and prompt injection attempt detection. Without LLM-specific observability, you cannot detect model degradation, cost overruns, or abuse patterns.

Threat Model

Adversary: Cost abuse (automated requests consuming expensive GPU inference), model abuse (using the model for unintended purposes), or quality degradation (model performance declines without detection).

Configuration

Token Usage Metrics with OpenTelemetry

# otel_llm_instrumentation.py
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

provider = MeterProvider()
metrics.set_meter_provider(provider)
meter = metrics.get_meter("llm-service")

# Token counters
prompt_tokens = meter.create_counter(
    "llm.tokens.prompt",
    description="Number of prompt/input tokens processed",
    unit="tokens"
)
completion_tokens = meter.create_counter(
    "llm.tokens.completion",
    description="Number of completion/output tokens generated",
    unit="tokens"
)
total_cost = meter.create_counter(
    "llm.cost.usd",
    description="Estimated cost in USD",
    unit="usd"
)

# Latency histogram
request_duration = meter.create_histogram(
    "llm.request.duration",
    description="End-to-end request duration",
    unit="ms"
)
first_token_latency = meter.create_histogram(
    "llm.first_token.latency",
    description="Time to first token (TTFT)",
    unit="ms"
)

def record_inference(api_key: str, model: str, input_tokens: int,
                      output_tokens: int, duration_ms: float, ttft_ms: float):
    labels = {"api_key": api_key, "model": model}
    prompt_tokens.add(input_tokens, labels)
    completion_tokens.add(output_tokens, labels)

    # Cost estimation (adjust per model pricing)
    cost = (input_tokens * 0.000003) + (output_tokens * 0.000015)  # Example: GPT-4 pricing
    total_cost.add(cost, labels)

    request_duration.record(duration_ms, labels)
    first_token_latency.record(ttft_ms, labels)

Prometheus Alert Rules for LLM Monitoring

groups:
  - name: llm-monitoring
    rules:
      # Cost spike per API key
      - alert: LLMCostSpike
        expr: >
          sum by (api_key) (rate(llm_cost_usd_total[1h]))
          > 5 * avg_over_time(sum by (api_key) (rate(llm_cost_usd_total[1h]))[7d:1h])
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "API key {{ $labels.api_key }} cost 5x above baseline"

      # First-token latency degradation (model performance issue)
      - alert: LLMLatencyDegradation
        expr: >
          histogram_quantile(0.95, sum by (le, model) (rate(llm_first_token_latency_bucket[5m])))
          > 2 * histogram_quantile(0.95, sum by (le, model) (rate(llm_first_token_latency_bucket[1h])))
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Model {{ $labels.model }} P95 TTFT doubled, possible GPU saturation or model issue"

      # Token throughput drop (model serving degradation)
      - alert: LLMThroughputDrop
        expr: >
          sum(rate(llm_tokens_completion_total[5m])) < 0.5 * avg_over_time(sum(rate(llm_tokens_completion_total[5m]))[7d:5m])
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "LLM throughput dropped to 50% of baseline, check GPU health and model serving status"

Output Quality Monitoring

# output_monitor.py - classify model outputs for safety and quality

import re
from typing import Dict

def classify_output(output: str, expected_topic: str = None) -> Dict[str, bool]:
    """Basic output classification. For production, use a dedicated classifier model."""
    classifications = {
        "contains_pii": bool(re.search(
            r'\b\d{3}-\d{2}-\d{4}\b|\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b',
            output, re.IGNORECASE
        )),
        "contains_code": bool(re.search(
            r'(import |def |class |function |const |var |let )', output
        )),
        "excessive_length": len(output) > 10000,
        "empty_response": len(output.strip()) < 10,
        "possible_system_prompt_leak": bool(re.search(
            r'(you are|your instructions|system prompt|your role is)',
            output, re.IGNORECASE
        )),
    }
    return classifications

Grafana Dashboard Design

Key panels for an LLM observability dashboard:

Token usage per API key per hour, time series, stacked by key
Cost per API key per day, table with daily/weekly/monthly projections
P50/P95/P99 TTFT (time to first token), heatmap by model
Tokens per second throughput, gauge showing current vs capacity
Output classification distribution, pie chart (normal, PII detected, system prompt leak, excessive length)
Request error rate, 4xx/5xx by endpoint
Active inference requests, gauge showing current GPU utilisation

Expected Behaviour

Token usage tracked per API key with cost estimation
Cost spike alerts fire within 15 minutes of 5x baseline
P95 TTFT degradation detected within 10 minutes
Output classification runs on all responses, flagging PII and system prompt leaks
Dashboard provides real-time visibility into model performance, cost, and safety

Trade-offs

Control	Impact	Risk	Mitigation
Per-API-key token tracking	High-cardinality metrics (one series per key × model)	Prometheus storage grows with key count	Use recording rules to pre-aggregate. Or: use Grafana Cloud (#108) / Axiom (#112) for high-cardinality.
Output classification on every response	Adds 5-20ms per response; CPU overhead	Latency increase on the response path	Run classification asynchronously (log output, classify in background).
Cost estimation in metrics	Provides real-time cost visibility	Pricing changes require metric update	Use configurable cost-per-token map. Update when pricing changes.

Failure Modes

Failure	Symptom	Detection	Recovery
OTel exporter fails	Metrics stop updating; dashboards show gaps	Prometheus `up == 0` for the LLM service scrape target	Fix OTel exporter configuration. Check network connectivity to the collector.
Cost estimation wrong	Budgets based on incorrect cost data	Manual audit reveals discrepancy between estimated and actual costs	Update cost-per-token configuration. Validate against provider invoice monthly.
Output classifier false positive	Legitimate outputs flagged as PII	Output monitoring shows high PII rate with no actual PII in reviewed samples	Tune regex patterns. For production: use a dedicated NLP model for PII detection.

When to Consider a Managed Alternative

LLM metrics are high-cardinality (per-key × per-model × per-endpoint). Self-managed Prometheus struggles past 50K active series.

Grafana Cloud (#108): Handles high-cardinality metrics natively. Managed dashboards with team sharing. Start free (10K metrics).
Axiom (#112): Unlimited retention for LLM event data. Serverless query. 500GB/month free.

Premium content pack: LLM observability dashboard pack. Grafana dashboard JSON for token usage, cost tracking, latency distributions, output quality, and OTel instrumentation templates for Python, Go, and Node.