Observability for LLM Applications: Token Usage, Latency Anomalies, and Output Classification
Problem
LLM-powered applications have unique observability requirements that standard APM tools do not address: token-based cost tracking (not just request count), latency distributions with cold start vs warm inference, output quality monitoring (safety, accuracy, relevance), and prompt injection attempt detection. Without LLM-specific observability, you cannot detect model degradation, cost overruns, or abuse patterns.
Threat Model
- Adversary: Cost abuse (automated requests consuming expensive GPU inference), model abuse (using the model for unintended purposes), or quality degradation (model performance declines without detection).
Configuration
Token Usage Metrics with OpenTelemetry
# otel_llm_instrumentation.py
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
provider = MeterProvider()
metrics.set_meter_provider(provider)
meter = metrics.get_meter("llm-service")
# Token counters
prompt_tokens = meter.create_counter(
"llm.tokens.prompt",
description="Number of prompt/input tokens processed",
unit="tokens"
)
completion_tokens = meter.create_counter(
"llm.tokens.completion",
description="Number of completion/output tokens generated",
unit="tokens"
)
total_cost = meter.create_counter(
"llm.cost.usd",
description="Estimated cost in USD",
unit="usd"
)
# Latency histogram
request_duration = meter.create_histogram(
"llm.request.duration",
description="End-to-end request duration",
unit="ms"
)
first_token_latency = meter.create_histogram(
"llm.first_token.latency",
description="Time to first token (TTFT)",
unit="ms"
)
def record_inference(api_key: str, model: str, input_tokens: int,
output_tokens: int, duration_ms: float, ttft_ms: float):
labels = {"api_key": api_key, "model": model}
prompt_tokens.add(input_tokens, labels)
completion_tokens.add(output_tokens, labels)
# Cost estimation (adjust per model pricing)
cost = (input_tokens * 0.000003) + (output_tokens * 0.000015) # Example: GPT-4 pricing
total_cost.add(cost, labels)
request_duration.record(duration_ms, labels)
first_token_latency.record(ttft_ms, labels)
Prometheus Alert Rules for LLM Monitoring
groups:
- name: llm-monitoring
rules:
# Cost spike per API key
- alert: LLMCostSpike
expr: >
sum by (api_key) (rate(llm_cost_usd_total[1h]))
> 5 * avg_over_time(sum by (api_key) (rate(llm_cost_usd_total[1h]))[7d:1h])
for: 15m
labels:
severity: warning
annotations:
summary: "API key {{ $labels.api_key }} cost 5x above baseline"
# First-token latency degradation (model performance issue)
- alert: LLMLatencyDegradation
expr: >
histogram_quantile(0.95, sum by (le, model) (rate(llm_first_token_latency_bucket[5m])))
> 2 * histogram_quantile(0.95, sum by (le, model) (rate(llm_first_token_latency_bucket[1h])))
for: 10m
labels:
severity: warning
annotations:
summary: "Model {{ $labels.model }} P95 TTFT doubled, possible GPU saturation or model issue"
# Token throughput drop (model serving degradation)
- alert: LLMThroughputDrop
expr: >
sum(rate(llm_tokens_completion_total[5m])) < 0.5 * avg_over_time(sum(rate(llm_tokens_completion_total[5m]))[7d:5m])
for: 10m
labels:
severity: critical
annotations:
summary: "LLM throughput dropped to 50% of baseline, check GPU health and model serving status"
Output Quality Monitoring
# output_monitor.py - classify model outputs for safety and quality
import re
from typing import Dict
def classify_output(output: str, expected_topic: str = None) -> Dict[str, bool]:
"""Basic output classification. For production, use a dedicated classifier model."""
classifications = {
"contains_pii": bool(re.search(
r'\b\d{3}-\d{2}-\d{4}\b|\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b',
output, re.IGNORECASE
)),
"contains_code": bool(re.search(
r'(import |def |class |function |const |var |let )', output
)),
"excessive_length": len(output) > 10000,
"empty_response": len(output.strip()) < 10,
"possible_system_prompt_leak": bool(re.search(
r'(you are|your instructions|system prompt|your role is)',
output, re.IGNORECASE
)),
}
return classifications
Grafana Dashboard Design
Key panels for an LLM observability dashboard:
- Token usage per API key per hour, time series, stacked by key
- Cost per API key per day, table with daily/weekly/monthly projections
- P50/P95/P99 TTFT (time to first token), heatmap by model
- Tokens per second throughput, gauge showing current vs capacity
- Output classification distribution, pie chart (normal, PII detected, system prompt leak, excessive length)
- Request error rate, 4xx/5xx by endpoint
- Active inference requests, gauge showing current GPU utilisation
Expected Behaviour
- Token usage tracked per API key with cost estimation
- Cost spike alerts fire within 15 minutes of 5x baseline
- P95 TTFT degradation detected within 10 minutes
- Output classification runs on all responses, flagging PII and system prompt leaks
- Dashboard provides real-time visibility into model performance, cost, and safety
Trade-offs
| Control | Impact | Risk | Mitigation |
|---|---|---|---|
| Per-API-key token tracking | High-cardinality metrics (one series per key × model) | Prometheus storage grows with key count | Use recording rules to pre-aggregate. Or: use Grafana Cloud (#108) / Axiom (#112) for high-cardinality. |
| Output classification on every response | Adds 5-20ms per response; CPU overhead | Latency increase on the response path | Run classification asynchronously (log output, classify in background). |
| Cost estimation in metrics | Provides real-time cost visibility | Pricing changes require metric update | Use configurable cost-per-token map. Update when pricing changes. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| OTel exporter fails | Metrics stop updating; dashboards show gaps | Prometheus up == 0 for the LLM service scrape target |
Fix OTel exporter configuration. Check network connectivity to the collector. |
| Cost estimation wrong | Budgets based on incorrect cost data | Manual audit reveals discrepancy between estimated and actual costs | Update cost-per-token configuration. Validate against provider invoice monthly. |
| Output classifier false positive | Legitimate outputs flagged as PII | Output monitoring shows high PII rate with no actual PII in reviewed samples | Tune regex patterns. For production: use a dedicated NLP model for PII detection. |
When to Consider a Managed Alternative
LLM metrics are high-cardinality (per-key × per-model × per-endpoint). Self-managed Prometheus struggles past 50K active series.
- Grafana Cloud (#108): Handles high-cardinality metrics natively. Managed dashboards with team sharing. Start free (10K metrics).
- Axiom (#112): Unlimited retention for LLM event data. Serverless query. 500GB/month free.
Premium content pack: LLM observability dashboard pack. Grafana dashboard JSON for token usage, cost tracking, latency distributions, output quality, and OTel instrumentation templates for Python, Go, and Node.