Building a Security Audit Log Pipeline That Scales: auditd to Elasticsearch
Problem
Linux audit logs are the ground truth for security investigation. auditd captures kernel-level events that no userspace tool can see: file access by any process, syscall execution, user command logging, privilege changes, and authentication events. But auditd’s raw output is cryptic multi-line records, high-volume (1-5GB per host per day under standard rules), and local to each host. Without a pipeline that collects, normalises, ships, and indexes these logs centrally, they are useless for incident response and invisible to security monitoring.
The specific challenges:
- Raw format is unreadable. auditd produces multi-line records with numeric syscall codes and hex-encoded arguments. Searching for “who read /etc/shadow” requires joining multiple record types (SYSCALL + PATH + CWD + PROCTITLE) by audit ID.
- Volume grows fast. Standard CIS-level audit rules generate 1-5GB per host per day. A 20-host fleet produces 20-100GB per day. Without retention management, storage costs are unbounded.
- Local logs are useless for central monitoring. auditd writes to
/var/log/audit/audit.logon each host. If the host is compromised, the attacker deletes the local logs. If you need to search across hosts, you log into each one individually. - Self-managed Elasticsearch is a full-time job. Running Elasticsearch for security logs requires index lifecycle management, shard sizing, capacity planning, cluster health monitoring, and backup, effectively a dedicated engineering role at scale.
This article provides the complete pipeline: auditd rules → log shipping → structured transformation → centralized storage → security alerting.
Target systems: Ubuntu 24.04 LTS, RHEL 9, any Linux with auditd. Vector or Fluentd for shipping. Elasticsearch, Loki, or managed backend for storage.
Threat Model
- Adversary: Any attacker operating on Linux hosts. Audit logs detect: privilege escalation (sudo, setuid), unauthorized file access (shadow, SSH keys), user creation/modification, suspicious process execution, and network connections from unexpected processes.
- Blast radius: Without centralized audit logs, attacker deletes local logs after compromise, leaving zero evidence. With centralized logs shipped in near-real-time, forensic evidence is preserved off-host before the attacker can destroy it.
Configuration
auditd Rule Design
# /etc/audit/rules.d/hardening.rules
# Security-relevant audit rules for production hosts.
# Applied with: sudo augenrules --load
# --- Rule ordering for performance ---
# Exit rules are evaluated first. Put high-frequency exclusions at the top
# to reduce processing overhead.
# Exclude high-volume, low-value events
-a always,exclude -F msgtype=CWD
-a always,exclude -F msgtype=EOE
# --- File access monitoring ---
# Monitor reads/writes to sensitive files.
-w /etc/shadow -p rwa -k shadow_access
-w /etc/passwd -p rwa -k passwd_access
-w /etc/group -p rwa -k group_access
-w /etc/sudoers -p rwa -k sudoers_access
-w /etc/ssh/sshd_config -p rwa -k sshd_config
-w /root/.ssh -p rwa -k root_ssh
-w /etc/crontab -p rwa -k cron_access
-w /etc/cron.d -p rwa -k cron_access
-w /var/spool/cron -p rwa -k cron_access
# --- User/group changes ---
-w /usr/sbin/useradd -p x -k user_modification
-w /usr/sbin/userdel -p x -k user_modification
-w /usr/sbin/usermod -p x -k user_modification
-w /usr/sbin/groupadd -p x -k group_modification
-w /usr/sbin/groupmod -p x -k group_modification
# --- Privilege escalation ---
# Monitor execve with setuid/setgid bits
-a always,exit -F arch=b64 -S execve -F euid=0 -F auid!=0 -F auid!=4294967295 -k privilege_escalation
# Monitor su and sudo usage
-w /usr/bin/su -p x -k su_usage
-w /usr/bin/sudo -p x -k sudo_usage
# --- Process execution logging ---
# Log all process execution (WARNING: high volume on busy systems)
# Enable selectively or use a lower-volume alternative below.
# -a always,exit -F arch=b64 -S execve -k exec_log
# Lower-volume alternative: log execution only by non-system users
-a always,exit -F arch=b64 -S execve -F auid>=1000 -F auid!=4294967295 -k user_exec
# --- Kernel module loading ---
-a always,exit -F arch=b64 -S init_module -S finit_module -k module_load
-a always,exit -F arch=b64 -S delete_module -k module_unload
# --- Network connections (optional - high volume) ---
# Log outbound connections from non-standard processes
# -a always,exit -F arch=b64 -S connect -F a0!=10 -k network_connect
# --- Make rules immutable (must reboot to change) ---
# Uncomment after testing is complete:
# -e 2
auditd.conf Tuning
# /etc/audit/auditd.conf - tuned for production
# Prevent audit event loss under high load.
# Buffer size: number of events to buffer before writing.
# Default is 8192. Increase for busy systems.
max_log_file_action = rotate
max_log_file = 50
num_logs = 10
# Backlog limit: kernel buffer for audit events.
# If this fills, events are LOST (or system hangs, depending on failure_mode).
# 8192 is the minimum for production. 32768 for busy systems.
backlog_limit = 32768
# What happens when the disk is full:
space_left = 100
space_left_action = email
admin_space_left = 50
admin_space_left_action = halt
# 'halt' stops the system when audit can't log. Use 'syslog' if availability > audit integrity.
# Flush frequency: higher = less data loss on crash, more I/O.
freq = 50
# Apply rules and verify:
sudo augenrules --load
sudo auditctl -l # List active rules
sudo auditctl -s # Show audit status (check 'lost' counter = 0)
# If 'lost' > 0: increase backlog_limit and restart auditd.
Log Shipping with Vector
Vector (by Datadog, Apache 2.0 licensed) is recommended over Fluentd for new deployments: lower memory usage, faster processing, and native structured log support.
# /etc/vector/vector.yaml
# Ship audit logs from auditd to centralized backend.
sources:
audit_logs:
type: journald
include_units:
- auditd.service
# Alternative: read from file directly
# type: file
# include:
# - /var/log/audit/audit.log
transforms:
parse_audit:
type: remap
inputs:
- audit_logs
source: |
# Parse auditd log format into structured fields
. = parse_key_value!(.message, key_value_delimiter: "=", field_delimiter: " ")
.timestamp = now()
.host = get_hostname!()
.source = "auditd"
enrich:
type: remap
inputs:
- parse_audit
source: |
# Add environment metadata
.environment = "production"
.cluster = "web-fleet"
sinks:
# Option 1: Elasticsearch
elasticsearch:
type: elasticsearch
inputs:
- enrich
endpoints:
- "https://elasticsearch.example.com:9200"
index: "audit-logs-%Y.%m.%d"
auth:
strategy: basic
user: "${ES_USER}"
password: "${ES_PASSWORD}"
# Option 2: Grafana Cloud Loki
# loki:
# type: loki
# inputs:
# - enrich
# endpoint: "https://logs-prod-us-central1.grafana.net"
# auth:
# strategy: basic
# user: "${LOKI_USER}"
# password: "${LOKI_API_KEY}"
# labels:
# host: "{{ host }}"
# source: "auditd"
# environment: "{{ environment }}"
# Option 3: Axiom
# axiom:
# type: axiom
# inputs:
# - enrich
# dataset: "audit-logs"
# token: "${AXIOM_API_TOKEN}"
Security Alert Rules
# Prometheus alert rules (via Elasticsearch exporter or Loki alerting)
# These detect the most critical security events in audit logs.
groups:
- name: audit-security-alerts
rules:
- alert: ShadowFileAccessed
expr: count_over_time({source="auditd"} |= "shadow_access" [5m]) > 0
labels:
severity: critical
annotations:
summary: "/etc/shadow was accessed"
runbook: "Investigate: which process, which user, from which host. Check for unauthorized access."
- alert: UserCreatedOrModified
expr: count_over_time({source="auditd"} |= "user_modification" [5m]) > 0
labels:
severity: warning
annotations:
summary: "User account created or modified"
runbook: "Verify this was an authorised change. Check for attacker persistence (new user account)."
- alert: KernelModuleLoaded
expr: count_over_time({source="auditd"} |= "module_load" [5m]) > 0
labels:
severity: warning
annotations:
summary: "Kernel module loaded"
runbook: "Verify the module is expected. Unexpected module loading may indicate rootkit installation."
- alert: PrivilegeEscalation
expr: count_over_time({source="auditd"} |= "privilege_escalation" [5m]) > 5
for: 1m
labels:
severity: critical
annotations:
summary: "Multiple privilege escalation events detected"
runbook: "Check for brute-force sudo/su attempts or exploitation of setuid binaries."
Expected Behaviour
- Audit logs from all hosts arrive in centralized storage within 30 seconds of generation
auditctl -sshowslost = 0(no events dropped)- Security queries return results within 5 seconds for the 30-day retention window
- Alert fires within 2 minutes of a security event (shadow access, user creation, privilege escalation)
- No audit event loss under sustained load (verified with auditctl status)
- 30-day retention minimum for investigation; 12-month archival for compliance
Trade-offs
| Decision | Impact | Risk | Mitigation |
|---|---|---|---|
| Comprehensive audit rules (CIS-level) | 1-5GB/host/day log volume | Storage costs grow linearly with fleet size; auditd adds 1-3% CPU overhead | Exclude high-volume, low-value events. Ship to cost-effective storage (Loki over Elasticsearch). |
| Elasticsearch backend | Full-text search, mature alerting ecosystem | Cluster management is a full-time job past 20 hosts | This is the primary transition trigger. switch to managed backend when ES management exceeds available engineering time. |
| Loki backend | 5-10x cheaper than Elasticsearch; simpler operations | Label-based queries only (no full-text search across log content) | Use Loki for cost-effective retention; supplement with Grafana dashboards for common security queries. |
| Real-time shipping (Vector) | Sub-30-second delivery; attacker cannot delete logs before shipping | Network bandwidth (1-5GB/host/day); Vector resource usage (50-100MB RAM) | Vector’s disk buffer prevents loss during network outages. |
backlog_limit = 32768 |
Prevents audit event loss on busy systems | Higher kernel memory usage (~256KB) | Negligible on modern systems. Monitor auditctl -s lost counter. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| auditd buffer overflow | Events lost; auditctl -s shows lost > 0 |
Monitor lost counter via Prometheus node_exporter textfile |
Increase backlog_limit in auditd.conf. Reduce rule scope for non-critical hosts. |
| Vector crashes | Logs buffer on disk but don’t ship | Vector health check fails; log delivery delay metric increases | Restart Vector. Disk buffer replays missed events automatically. Check Vector logs for the crash cause. |
| Elasticsearch cluster unhealthy | New logs rejected; queries fail or timeout | ES cluster health API shows yellow/red; Prometheus ES exporter alerts | Fix shard allocation; add nodes; or migrate to managed backend (Grafana Cloud #108, Axiom #112). |
| Attacker deletes local logs before shipping | Gap in centralized logs for the compromised host (if shipping delay > attacker speed) | Log gap detection: expected log rate per host drops to zero | Minimize shipping delay (sub-30-second with Vector). Use af_unix audisp plugin for lowest possible latency. Ship to immutable storage (Backblaze #161 B2 with write-only IAM). |
| Audit rules too broad | High volume; disk fills; performance degradation | Disk usage alerts; auditd CPU usage; host performance metrics | Disable exec_log rule (highest volume). Use user_exec (logs only non-system users). Exclude specific high-frequency programs. |
When to Consider a Managed Alternative
This is the primary observability conversion article. Audit log pain is universal and the transition trigger is clear:
- Self-managed Elasticsearch cluster management is a full-time job past 20 hosts (index lifecycle, shard management, capacity planning, version upgrades, backup verification).
- Storage costs for audit logs grow to terabytes within months for a fleet of any size.
- Query performance degrades without continuous index optimisation.
Recommended providers:
- Grafana Cloud (#108): Managed Loki for log storage, Prometheus-style alerting, native Grafana dashboards. Start free (50GB logs/month), scale as needed. Loki’s label-based query model covers 80% of security queries.
- Axiom (#112): 500GB/month free ingestion, unlimited retention, serverless query (zero cluster management). Best for teams that want to ingest everything and query later without managing infrastructure.
- Better Stack (#113): Integrated logging + uptime monitoring + incident management. Good for teams that want observability and incident response in one platform.
- Backblaze (#161) B2 / Wasabi (#162): For long-term immutable archival. Ship a copy of all audit logs to write-only object storage for 12-month compliance retention at $0.006/GB/month.
What you still control: auditd rule design (what to log), Vector/Fluentd pipeline configuration (how to ship), and alert rule logic (what to alert on). The managed provider handles storage, indexing, query infrastructure, and retention management.
Premium content pack: auditd rule collection. rules for CIS Level 1, CIS Level 2, SOC 2, and NIST 800-53 AU controls. Includes Vector pipeline configurations for each managed backend and Grafana dashboard templates for security investigation.