OpenTelemetry for Security: Distributed Tracing of Authentication and Authorization Flows

OpenTelemetry for Security: Distributed Tracing of Authentication and Authorization Flows

Problem

Distributed tracing is standard for performance debugging, but almost no team uses it for security. Authentication flows that span multiple services (identity provider to API gateway to backend to database) are invisible without end-to-end tracing. When a credential stuffing attack hits your login endpoint, you see elevated error rates in metrics, but you cannot trace a single malicious request through every hop it touches.

The specific gaps:

  • Auth failures are logged, not traced. A failed login generates a log line. But you cannot see which downstream services were contacted during the failed attempt, how long each step took, or whether the request triggered any side effects.
  • Latency anomalies in auth flows go unnoticed. Brute force attacks cause elevated P99 latency on auth endpoints. Without tracing, you cannot distinguish between a slow database query and 10,000 concurrent credential stuffing requests.
  • Authorization decisions are scattered. A user requests access to a resource. The API gateway checks the JWT, the backend checks RBAC, the database checks row-level security. If any step fails, the context is lost. You see a 403 but not which authorization layer rejected the request or why.
  • Cross-service correlation requires trace context. When a suspicious request hits service A and you want to see what happened in service B, you need a trace ID that propagates across service boundaries. Without OTel instrumentation, each service is an island.

This article instruments auth flows with OpenTelemetry spans, configures the OTel Collector for security-relevant span filtering, and builds detection rules from trace data.

Target systems: Any application instrumented with OpenTelemetry SDK (Go, Python, Java, Node.js). OTel Collector for pipeline processing. Prometheus and Grafana for derived metrics.

Threat Model

  • Adversary: An attacker performing credential stuffing, brute force, or session hijacking against authentication endpoints. They may also exploit authorization bypass vulnerabilities to access resources they should not reach.
  • Blast radius: Without auth flow tracing, you detect attacks only through aggregate metrics (elevated error rates) with no request-level visibility. Investigation requires correlating logs across 3-5 services manually. With tracing, a single trace ID shows the complete attack path through every service hop.

Configuration

Instrumenting Auth Flows with OTel Spans

Add security-relevant attributes to authentication and authorization spans:

# Python example using OpenTelemetry SDK.
# Instrument the login handler to create a span with security attributes.
from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("auth-service")

def login_handler(request):
    with tracer.start_as_current_span("auth.login") as span:
        # Tag every auth span with security-relevant attributes.
        span.set_attribute("auth.method", "password")
        span.set_attribute("auth.username", request.username)
        span.set_attribute("net.peer.ip", request.client_ip)
        span.set_attribute("auth.user_agent", request.headers.get("User-Agent", ""))

        # Token validation as a child span.
        with tracer.start_as_current_span("auth.token_validate") as token_span:
            token = validate_credentials(request.username, request.password)
            if token is None:
                span.set_attribute("auth.result", "failure")
                span.set_attribute("auth.failure_reason", "invalid_credentials")
                span.set_status(StatusCode.ERROR, "Authentication failed")
                return Response(status=401)
            token_span.set_attribute("auth.token_type", "jwt")

        # Permission check as another child span.
        with tracer.start_as_current_span("auth.permission_check") as perm_span:
            permissions = check_permissions(token.user_id, request.resource)
            perm_span.set_attribute("auth.permissions_granted", len(permissions))
            perm_span.set_attribute("auth.resource", request.resource)

        span.set_attribute("auth.result", "success")
        return Response(status=200, body={"token": token.value})
// Go example: authorization middleware with OTel spans.
func AuthorizationMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx, span := tracer.Start(r.Context(), "auth.authorize",
            trace.WithAttributes(
                attribute.String("auth.method", "bearer"),
                attribute.String("net.peer.ip", r.RemoteAddr),
                attribute.String("http.target", r.URL.Path),
            ),
        )
        defer span.End()

        claims, err := validateJWT(r.Header.Get("Authorization"))
        if err != nil {
            span.SetAttributes(
                attribute.String("auth.result", "failure"),
                attribute.String("auth.failure_reason", err.Error()),
            )
            span.SetStatus(codes.Error, "Authorization failed")
            http.Error(w, "Unauthorized", 401)
            return
        }

        span.SetAttributes(
            attribute.String("auth.result", "success"),
            attribute.String("auth.subject", claims.Subject),
            attribute.StringSlice("auth.roles", claims.Roles),
        )
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

OTel Collector Configuration

Filter and process security-relevant spans before export:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Filter: keep only auth-related spans for the security pipeline.
  filter/security:
    spans:
      include:
        match_type: regexp
        span_names:
          - "auth\\..*"
          - ".*login.*"
          - ".*authorize.*"
          - ".*permission.*"

  # Add cluster metadata to all spans.
  resource:
    attributes:
      - key: k8s.cluster.name
        value: "production"
        action: upsert

  # Generate metrics from span data (RED metrics for auth flows).
  spanmetrics:
    metrics_exporter: prometheus
    dimensions:
      - name: auth.result
      - name: auth.method
      - name: auth.failure_reason
      - name: net.peer.ip
    histogram:
      explicit:
        boundaries: [10ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]

  # Tail sampling: keep 100% of failed auth traces, 10% of successful.
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: auth-failures-always
        type: string_attribute
        string_attribute:
          key: auth.result
          values: ["failure"]
      - name: high-latency-auth
        type: latency
        latency:
          threshold_ms: 1000
      - name: sample-success
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/tempo:
    endpoint: "tempo.monitoring:4317"
    tls:
      insecure: false
      ca_file: /etc/ssl/certs/ca.pem

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "otel_security"

service:
  pipelines:
    traces/security:
      receivers: [otlp]
      processors: [filter/security, resource, tail_sampling]
      exporters: [otlp/tempo]
    metrics/security:
      receivers: [otlp]
      processors: [filter/security, spanmetrics]
      exporters: [prometheus]

Detection Rules from Trace-Derived Metrics

The spanmetrics processor generates Prometheus metrics from auth spans:

# Prometheus alerting rules based on OTel-derived auth metrics.
groups:
  - name: otel-auth-security
    rules:
      # Brute force detection: auth failure rate from a single IP.
      - alert: BruteForceFromTraces
        expr: >
          sum by (net_peer_ip) (
            rate(otel_security_calls_total{
              span_name=~"auth\\.login",
              auth_result="failure"
            }[5m])
          ) > 0.5
        for: 2m
        labels:
          severity: warning
          detection_type: brute_force
        annotations:
          summary: >
            Brute force detected from {{ $labels.net_peer_ip }}:
            {{ $value | humanize }} failed auths/sec
          runbook_url: "https://systemshardening.com/runbooks/brute-force"

      # Auth latency anomaly: P99 > 2x baseline indicates load from attack.
      - alert: AuthLatencyAnomaly
        expr: >
          histogram_quantile(0.99,
            rate(otel_security_duration_bucket{
              span_name=~"auth\\..*"
            }[5m])
          ) > 2 * histogram_quantile(0.99,
            avg_over_time(
              rate(otel_security_duration_bucket{
                span_name=~"auth\\..*"
              }[5m])[7d:5m]
            )
          )
        for: 5m
        labels:
          severity: warning
          detection_type: auth_anomaly
        annotations:
          summary: "Auth P99 latency is 2x above baseline: {{ $value | humanizeDuration }}"

      # Credential stuffing: many unique usernames failing from same IP range.
      - alert: CredentialStuffing
        expr: >
          count by (net_peer_ip) (
            count by (net_peer_ip, auth_username) (
              rate(otel_security_calls_total{
                auth_result="failure"
              }[10m]) > 0
            )
          ) > 20
        for: 5m
        labels:
          severity: critical
          detection_type: credential_stuffing
        annotations:
          summary: >
            Credential stuffing: {{ $labels.net_peer_ip }} tried
            {{ $value }} unique usernames in 10 minutes

      # Authorization bypass attempt: successful auth followed by
      # excessive 403 responses (probing for accessible resources).
      - alert: AuthorizationProbing
        expr: >
          sum by (auth_subject) (
            rate(otel_security_calls_total{
              span_name="auth.authorize",
              auth_result="failure"
            }[10m])
          ) > 0.1
          and
          sum by (auth_subject) (
            rate(otel_security_calls_total{
              span_name="auth.authorize",
              auth_result="success"
            }[10m])
          ) > 0
        for: 5m
        labels:
          severity: warning
          detection_type: authz_probing
        annotations:
          summary: >
            Authorization probing: {{ $labels.auth_subject }}
            is authenticated but hitting excessive 403s

Security Dashboard Queries

# Grafana dashboard panels for auth security.

# Panel 1: Auth failure rate by source IP (top 10).
topk(10,
  sum by (net_peer_ip) (
    rate(otel_security_calls_total{auth_result="failure"}[5m])
  )
)

# Panel 2: Auth success/failure ratio over time.
sum(rate(otel_security_calls_total{auth_result="success"}[5m]))
/
sum(rate(otel_security_calls_total[5m]))

# Panel 3: P99 auth latency by span name.
histogram_quantile(0.99,
  sum by (le, span_name) (
    rate(otel_security_duration_bucket[5m])
  )
)

# Panel 4: Unique usernames per source IP (credential stuffing indicator).
count by (net_peer_ip) (
  count by (net_peer_ip, auth_username) (
    otel_security_calls_total{auth_result="failure"}
  )
)

Expected Behaviour

  • Every authentication attempt (success and failure) produces a trace with security attributes
  • Authorization decisions are visible as child spans within the request trace
  • Failed auth traces retained at 100%; successful auth traces sampled at 10%
  • Brute force attacks detected within 2 minutes via trace-derived metrics
  • Credential stuffing (many unique usernames from one source) detected within 5 minutes
  • Auth P99 latency anomalies (2x above baseline) trigger alerts within 5 minutes
  • Trace IDs in log entries enable single-click pivot from alert to full request trace

Trade-offs

Decision Impact Risk Mitigation
Tail sampling (100% failures, 10% success) 90% storage reduction for auth traces May miss slow-burn attacks that succeed on each attempt Keep 100% of traces with unusual attributes (new user-agent, new geo). Sample only routine successes.
Spanmetrics processor for Prometheus export Enables alerting without querying trace backend High-cardinality attributes (username, IP) cause metric explosion Limit dimensions to IP and result. Use trace queries for username-level investigation.
Security-only filter in Collector Reduces trace pipeline volume by 80-90% Non-auth security events (data access, admin actions) not traced Add additional span name patterns as security instrumentation grows.
Username in span attributes Enables per-user attack analysis PII in trace data may violate data retention policies Hash usernames before setting span attribute if PII compliance requires it. Set TTL on trace storage.

Failure Modes

Failure Symptom Detection Recovery
OTel Collector down No new traces or metrics; alerts stop firing up{job="otel-collector"} goes to 0 Deploy Collector as a DaemonSet or with replicas. SDKs buffer spans locally during outage.
Tail sampling drops attack traces Investigation finds no trace for a known attack timestamp Post-incident: search for trace ID from logs, trace not found in backend Increase decision_wait. Add explicit keep policy for high-severity attributes.
Spanmetrics cardinality explosion Prometheus OOM or slow queries; Collector memory spikes Collector memory usage exceeds 80%; Prometheus scrape duration increases Remove high-cardinality dimensions (username). Aggregate IPs to /24 subnets.
Trace context not propagated Auth spans appear as separate traces per service, not connected Traces show single-span roots instead of multi-service chains Verify W3C Trace Context headers propagated through load balancers and API gateways. Check SDK auto-instrumentation configuration.
SDK instrumentation missing in one service Gap in trace chain; auth flow incomplete Trace visualization shows a missing hop between services Audit all services in the auth path for OTel SDK initialization. Use auto-instrumentation agents where manual instrumentation is missing.

When to Consider a Managed Alternative

Self-managed OTel security tracing requires Collector deployment, SDK instrumentation across all auth services, span storage backend, and metric pipeline maintenance (6-10 hours/month).

  • Grafana Cloud (#108): Managed Tempo for traces, Mimir for metrics, Loki for logs. Unified view across all three signals. Native OTel Collector support with managed scaling.
  • Axiom (#112): Unified observability platform. Traces, metrics, and logs in one backend. Native OTel ingestion. No cardinality limits on trace attributes.
  • SigNoz (#117): OTel-native observability platform. Built specifically for OpenTelemetry data. Integrated trace-to-metric derivation without separate spanmetrics configuration.

Premium content pack: OTel security dashboard collection. Pre-built Grafana dashboards for auth flow analysis, brute force detection, and credential stuffing visualization. Includes OTel Collector configs and SDK instrumentation examples for Go, Python, Java, and Node.js.