Lateral Movement Detection: Network Patterns, Authentication Anomalies, and Alert Correlation

Problem

East-west traffic inside a Kubernetes cluster is a blind spot for most security teams. Once an attacker compromises a single pod, they pivot to other services using internal network paths that look identical to normal inter-service communication. Without baseline-aware monitoring of network flows and authentication events, lateral movement is invisible until the attacker reaches a high-value target like a database or secrets store.

The specific challenges:

Internal traffic is trusted by default. Kubernetes flat networking means every pod can reach every other pod unless network policies restrict it. A compromised frontend pod can probe the entire cluster.
Service meshes add encryption but not visibility. mTLS between sidecars encrypts east-west traffic, preventing network-level inspection. You need flow-level metadata (source, destination, port, protocol) rather than payload inspection.
Authentication logs are scattered. Each service logs its own auth events. Correlating a failed login attempt on service A with a successful login on service B from the same source requires centralized log aggregation and join logic.
Static allowlists break at scale. Manually maintaining a list of “allowed” communication paths between 200 microservices is impractical. You need automatic baseline generation from observed traffic.

This article covers baseline establishment with Cilium and Hubble, anomaly detection rules for new or unexpected network flows, authentication correlation across services, and automated response for confirmed lateral movement.

Target systems: Kubernetes clusters running Cilium CNI with Hubble enabled. Prometheus for metrics. Falco or Tetragon for runtime detection.

Threat Model

Adversary: An attacker who has compromised a single pod through an application vulnerability (SSRF, RCE, dependency exploit). Their goal is to move laterally to higher-value targets: databases, secrets management, CI/CD systems, or cluster control plane components.
Blast radius: Without lateral movement detection, the attacker can map the entire internal network, enumerate services, and pivot freely. Average dwell time for undetected lateral movement is 21 days. With flow-level monitoring and baseline alerting, initial pivot attempts generate alerts within minutes.

Configuration

Hubble Flow Monitoring

Enable Hubble to capture structured flow logs for all east-west traffic:

# cilium-config ConfigMap (or Helm values)
# Enable Hubble with L7 visibility for HTTP flows.
hubble:
  enabled: true
  metrics:
    enabled:
      - dns
      - drop
      - tcp
      - flow
      - icmp
      - http
    serviceMonitor:
      enabled: true
  relay:
    enabled: true
  ui:
    enabled: true

Baseline Traffic Pattern Recording

Capture 30 days of normal traffic patterns as Prometheus recording rules:

# Prometheus recording rules: build a baseline of normal communication pairs.
groups:
  - name: lateral-movement-baselines
    interval: 5m
    rules:
      # Count unique source-destination pairs observed over 7 days.
      - record: hubble:flow_pairs:count_7d
        expr: >
          count by (source_workload, destination_workload, destination_port) (
            rate(hubble_flows_processed_total{verdict="FORWARDED"}[7d]) > 0
          )

      # Average flow rate between known service pairs (bytes/sec).
      - record: hubble:flow_rate:avg_7d
        expr: >
          avg_over_time(
            rate(hubble_tcp_bytes_total[5m])[7d:5m]
          )

      # Auth failure rate per source workload across all destinations.
      - record: security:lateral_auth_failures:rate5m_7d
        expr: >
          avg_over_time(
            sum by (source_workload) (
              rate(auth_failures_total{result="failure"}[5m])
            )[7d:5m]
          )

Anomaly Detection Rules

New Network Destination (Never Seen Before)

# Alert: a workload is communicating with a destination not seen in the
# 30-day baseline. This is the highest-signal lateral movement indicator.
- alert: NewNetworkDestination
  expr: >
    (
      count by (source_workload, destination_workload, destination_port) (
        rate(hubble_flows_processed_total{verdict="FORWARDED"}[5m]) > 0
      )
    )
    unless
    (hubble:flow_pairs:count_7d > 0)
  for: 3m
  labels:
    severity: warning
    detection_type: lateral_movement
  annotations:
    summary: >
      New flow: {{ $labels.source_workload }} ->
      {{ $labels.destination_workload }}:{{ $labels.destination_port }}
    runbook_url: "https://systemshardening.com/runbooks/lateral-movement"
    false_positive_notes: |
      Common FP sources: new deployments, canary rollouts, feature flags
      enabling new service calls. Check if a deployment occurred in the
      last 30 minutes before escalating.

Port Scanning Detection

# Alert: a single source contacts more than 10 unique destination ports
# within a 5-minute window. Normal services contact 1-3 ports.
- alert: PortScanDetected
  expr: >
    count by (source_workload) (
      count by (source_workload, destination_port) (
        rate(hubble_flows_processed_total{verdict="FORWARDED"}[5m]) > 0
      )
    ) > 10
  for: 2m
  labels:
    severity: critical
    detection_type: lateral_movement
  annotations:
    summary: "Port scan: {{ $labels.source_workload }} contacted {{ $value }} unique ports"
    runbook_url: "https://systemshardening.com/runbooks/port-scan"

Authentication Anomaly Correlation

# Alert: a workload has auth failures against 3+ distinct services
# within 10 minutes. Normal services authenticate to 1-2 backends.
- alert: LateralAuthSweep
  expr: >
    count by (source_workload) (
      sum by (source_workload, destination_service) (
        rate(auth_failures_total{result="failure"}[10m])
      ) > 0
    ) > 3
  for: 5m
  labels:
    severity: critical
    detection_type: lateral_movement
  annotations:
    summary: >
      Auth sweep: {{ $labels.source_workload }} failed auth against
      {{ $value }} services in 10 minutes
    runbook_url: "https://systemshardening.com/runbooks/credential-sweep"

Correlated Multi-Signal Alert

Single anomalies are noisy. Combine network and auth signals for high-confidence detection:

# High-confidence lateral movement: new network destination AND auth
# failures from the same source workload.
- alert: ConfirmedLateralMovement
  expr: >
    (count by (source_workload) (
      ALERTS{alertname="NewNetworkDestination", alertstate="firing"}
    ) > 0)
    and on (source_workload)
    (count by (source_workload) (
      ALERTS{alertname="LateralAuthSweep", alertstate="firing"}
    ) > 0)
  labels:
    severity: critical
    detection_type: correlated_lateral_movement
  annotations:
    summary: >
      CORRELATED LATERAL MOVEMENT: {{ $labels.source_workload }}
      has new destinations AND auth failures
    description: |
      HIGH CONFIDENCE. This workload exhibits:
      1. Communication to destinations never seen in 30-day baseline
      2. Authentication failures against multiple services
      IMMEDIATE ACTION: Isolate the workload. Begin investigation.
    runbook_url: "https://systemshardening.com/runbooks/lateral-movement-confirmed"

Automated Response with Cilium Network Policy

When a correlated lateral movement alert fires, apply an isolation policy:

# CiliumNetworkPolicy: quarantine a compromised workload.
# Applied automatically via Alertmanager webhook or Falcosidekick.
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: quarantine-compromised
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      security.quarantine: "true"
  egressDeny:
    - toEntities:
        - cluster
        - world
  ingressDeny:
    - fromEntities:
        - cluster
        - world

The response webhook labels the suspected pod with security.quarantine: "true", and Cilium immediately drops all traffic to and from that pod.

Expected Behaviour

New service-to-service communication paths generate alerts within 5 minutes
Port scanning from a single source triggers a critical alert within 2 minutes
Authentication sweep across 3+ services triggers a critical alert within 5 minutes
Correlated alerts (network + auth) reduce false positive rate by 70-80% compared to single-signal detection
Automated quarantine isolates confirmed threats within 30 seconds of correlated alert
False positive rate below 3 per day after 30-day baseline period

Trade-offs

Decision	Impact	Risk	Mitigation
30-day baseline learning period	Accurate traffic map; fewer false positives	No anomaly detection for new workloads during learning	Use strict Cilium network policies (deny by default) for new workloads instead of behavioural detection.
Hubble L7 metrics enabled	HTTP-level visibility (paths, methods)	15-25% increase in Hubble relay memory usage	Limit L7 visibility to security-critical namespaces. Use L3/L4 for everything else.
Automated quarantine on correlated alert	Fast containment (seconds vs minutes)	False positive quarantine disrupts legitimate traffic	Require two correlated signals before auto-quarantine. Single-signal alerts page but do not isolate.
`unless` baseline matching	Zero alerts for known traffic pairs	Baseline includes attacker traffic if compromise occurred before monitoring	Re-baseline periodically. Audit baseline entries against expected architecture.

Failure Modes

Failure	Symptom	Detection	Recovery
Hubble relay down	No flow data; all network alerts stop firing	Absent metric alert: `absent(hubble_flows_processed_total)`	Restart Hubble relay. Check Cilium agent health on each node.
Baseline too broad	Attacker traffic matches existing patterns; no alert fires	Post-incident review shows lateral movement within baseline	Review baseline entries quarterly. Remove overly broad pairs. Tighten destination port specificity.
Deployment triggers flood of NewNetworkDestination alerts	20+ alerts during a rollout; on-call ignores real alerts	Deployment suppression rule not configured	Add inhibit rule: suppress `lateral_movement` alerts when `DeploymentInProgress` is firing in the same namespace.
Auto-quarantine false positive	Production service isolated; downstream failures	Service health checks fail; dependent services report errors	Quarantine policy has a 5-minute TTL. Require manual confirmation to extend. Automated rollback if health checks fail within 60 seconds.
Prometheus recording rules lagging	Baseline calculations stale; false positives increase	Recording rule evaluation duration exceeds interval	Reduce baseline calculation frequency (5m to 15m). Increase Prometheus resources.

When to Consider a Managed Alternative

Self-managed lateral movement detection requires Cilium + Hubble operation, Prometheus recording rules, and ongoing baseline maintenance (4-6 hours/month for tuning).

Sysdig (#122): Network security monitoring with automatic baseline generation. ML-powered lateral movement detection across multi-cluster environments. Managed Falco rules updated for emerging techniques.
Isovalent (#54): Cilium Enterprise with built-in network flow analytics, automatic policy recommendation, and threat detection. Native Hubble integration without self-managed relay scaling.
Grafana Cloud (#108): Centralized Hubble metric storage with managed Prometheus. Pre-built dashboards for network flow analysis. Alert correlation across metrics and logs.

Premium content pack: Lateral movement detection rule library. 15+ Prometheus alert rules, Cilium network policies for automated quarantine, Alertmanager webhook configurations, and Grafana dashboards for flow visualization.