Cgroup v2 Resource Isolation: Preventing Resource Exhaustion Attacks on Shared Systems

Problem

Without resource limits, a single service, container, or compromised process can consume all available CPU, memory, I/O bandwidth, or PIDs on a host. This denies service to every other workload on the same machine:

A fork bomb (:(){ :|:& };:) creates processes exponentially until the system runs out of PIDs and becomes unresponsive.
A memory leak in one service triggers the kernel OOM killer, which may kill a different, healthy service that happens to be the largest consumer.
A runaway log rotation or backup job saturates disk I/O, causing database queries on the same host to time out.
A cryptocurrency miner deployed through a compromised dependency pins all CPU cores at 100%, starving legitimate workloads.

Cgroup v2 (the unified cgroup hierarchy) is the mechanism Linux provides to enforce per-service and per-container resource limits. It is built into the kernel and managed through systemd on modern distributions. Most teams do not configure limits until after an incident, because profiling workloads to set correct limits takes effort.

Target systems: Ubuntu 24.04 LTS, Debian 12, RHEL 9 / Rocky Linux 9, any system running systemd 252+ with kernel 5.15+.

Threat Model

Adversary: Compromised service performing resource exhaustion (either intentionally by an attacker, or unintentionally through a bug or misconfiguration). Or: an unprivileged local user running a fork bomb or memory-consuming process.
Access level: Any process running on the host, including containerized workloads.
Objective: Denial of service to other workloads on the same host. Force the OOM killer to terminate critical services. Saturate I/O to cause cascading timeouts.
Blast radius: All services on the host. Without resource isolation, one workload’s resource consumption affects every other workload sharing the same kernel.

Configuration

Verify Cgroup v2 is Active

# Check cgroup version
stat -fc %T /sys/fs/cgroup
# Expected output: "cgroup2fs" (cgroup v2)
# If output is "tmpfs", you are on cgroup v1

# Verify systemd is using the unified hierarchy
cat /proc/cmdline | grep -o 'systemd.unified_cgroup_hierarchy=[0-9]'
# Expected: systemd.unified_cgroup_hierarchy=1 (or absent, which defaults to v2 on modern distros)

If you are still on cgroup v1, migrate by adding the kernel parameter:

# /etc/default/grub
GRUB_CMDLINE_LINUX="$EXISTING_VALUES systemd.unified_cgroup_hierarchy=1"

sudo update-grub && sudo systemctl reboot

systemd Slice Configuration

systemd organises services into slices. Configure resource limits per slice to isolate categories of workloads.

Create a slice for web-facing services:

# /etc/systemd/system/web.slice
[Slice]
Description=Web Services Slice

# CPU: relative weight (1-10000, default 100)
# This slice gets 4x the CPU of a default slice when there is contention.
# When CPU is idle, there is no restriction.
CPUWeight=400

# Memory: hard limit. OOM killer activates if this is exceeded.
MemoryMax=4G

# Memory: high watermark. The kernel reclaims memory aggressively above this.
# Processes are not killed but are slowed by reclaim pressure.
MemoryHigh=3G

# I/O: relative weight (1-10000, default 100)
IOWeight=200

# PIDs: maximum number of tasks (processes + threads)
TasksMax=4096

Assign a service to the slice:

# /etc/systemd/system/myapp.service.d/resources.conf
[Service]
Slice=web.slice

# Per-service limits within the slice
MemoryMax=2G
MemoryHigh=1536M
CPUQuota=200%
TasksMax=512

Create a slice for background/batch work:

# /etc/systemd/system/batch.slice
[Slice]
Description=Batch Processing Slice
CPUWeight=50
MemoryMax=2G
MemoryHigh=1536M
IOWeight=50
TasksMax=1024

Apply the changes:

sudo systemctl daemon-reload

# Move a running service to the new slice
sudo systemctl set-property myapp.service Slice=web.slice

# Verify cgroup placement
systemd-cgls

Preventing Fork Bombs with PID Limits

The most effective fork bomb defence is a PID limit. Without one, a fork bomb will exhaust the system-wide PID space (default: 32768 on most systems).

# /etc/systemd/system/user-.slice.d/pid-limit.conf
[Slice]
# Limit each user session to 512 tasks
TasksMax=512

For the system-wide default:

# /etc/systemd/system.conf.d/pid-limits.conf
[Manager]
DefaultTasksMax=4096

Test the fork bomb defence:

# As an unprivileged user with the PID limit applied:
:(){ :|:& };:
# Expected: the fork bomb hits the TasksMax limit quickly.
# The user's session becomes slow but other services are unaffected.
# Check with: systemctl status user-1000.slice

Container Runtime Cgroup Settings

For containerd, configure default resource limits:

# /etc/containerd/config.toml
[plugins."io.containerd.cri.v1.runtime"]
  [plugins."io.containerd.cri.v1.runtime".containerd]
    [plugins."io.containerd.cri.v1.runtime".containerd.runtimes]
      [plugins."io.containerd.cri.v1.runtime".containerd.runtimes.runc]
        [plugins."io.containerd.cri.v1.runtime".containerd.runtimes.runc.options]
          SystemdCgroup = true

SystemdCgroup = true ensures containerd uses the systemd cgroup driver, which places containers in the systemd hierarchy and makes them visible to systemd-cgtop and systemctl tooling.

Kubernetes Resource Limits

In Kubernetes, resource limits map to cgroup constraints on the node:

# Pod specification with resource limits
apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: app
      image: myapp:v1.2.3
      resources:
        requests:
          memory: "256Mi"
          cpu: "250m"
        limits:
          memory: "512Mi"
          cpu: "1000m"

Enforce defaults across a namespace with a LimitRange:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - default:
        memory: "512Mi"
        cpu: "500m"
      defaultRequest:
        memory: "128Mi"
        cpu: "100m"
      max:
        memory: "4Gi"
        cpu: "4000m"
      type: Container

Monitoring with systemd-cgtop

# Real-time cgroup resource usage (like top, but for cgroups)
systemd-cgtop

# Check a specific slice
systemctl status web.slice
# Shows: CPU, memory, and task count for the slice and all services in it

# Check cgroup limits for a service
systemctl show myapp.service | grep -E 'Memory|CPU|Tasks'
# MemoryMax=2147483648
# CPUQuota=200%
# TasksMax=512

Detecting Cgroup Escapes

Monitor for processes running outside expected cgroup hierarchies:

#!/bin/bash
# detect-cgroup-escape.sh
# Alert if any non-kernel process is in the root cgroup

for pid in /proc/[0-9]*; do
    pid_num=$(basename "$pid")
    cgroup=$(cat "$pid/cgroup" 2>/dev/null)
    
    # Processes in the root cgroup (0::/) should only be kernel threads
    if echo "$cgroup" | grep -q "^0::/$" ; then
        name=$(cat "$pid/comm" 2>/dev/null)
        # Kernel threads have PPID 2 (kthreadd)
        ppid=$(awk '/^PPid:/{print $2}' "$pid/status" 2>/dev/null)
        if [ "$ppid" != "2" ] && [ "$ppid" != "0" ]; then
            echo "WARNING: Process $pid_num ($name) is in root cgroup"
        fi
    fi
done

Expected Behaviour

After configuring cgroup v2 resource limits:

stat -fc %T /sys/fs/cgroup returns cgroup2fs
systemd-cgtop shows resource usage broken down by slice and service
A service exceeding its MemoryMax is killed by the OOM killer within the cgroup (not the system-wide OOM killer)
A fork bomb in a user session hits TasksMax and new fork() calls return EAGAIN instead of creating new processes
CPU-bound processes in a low-weight slice are throttled when high-weight slices need CPU
I/O-bound batch jobs do not starve latency-sensitive web services
Container resource limits appear as cgroup constraints under /sys/fs/cgroup/system.slice/
Other services on the host continue operating normally during a resource exhaustion event in an isolated slice

Trade-offs

Control	Benefit	Cost	Mitigation
CPU limits (CPUQuota)	Prevents CPU starvation	Causes throttling and latency spikes during CPU contention even if other CPUs are idle. CFS bandwidth throttling can add up to 5ms per scheduling period.	Use CPUWeight (relative) instead of CPUQuota (absolute) when possible. CPUWeight only restricts when there is contention.
Memory limits (MemoryMax)	Prevents one service from consuming all RAM	Triggers OOM kill when the limit is hit. If the limit is too low, the service restarts repeatedly.	Set MemoryHigh to 75% of MemoryMax. This applies memory pressure (reclaim) before the hard kill. Profile workloads for 1-2 weeks before setting final limits.
PID limits (TasksMax)	Prevents fork bombs and runaway thread creation	Applications with large thread pools (Java, Go with many goroutines) may hit the limit under normal load.	Profile the application’s peak task count and set TasksMax to 2x that value. Monitor `tasks_current` via systemd or Prometheus.
I/O limits (IOWeight)	Prevents I/O starvation	Batch jobs take longer to complete when I/O-sensitive services need bandwidth.	Use IOWeight (relative) for fairness. Use IOReadBandwidthMax/IOWriteBandwidthMax only when you need a hard ceiling.

Failure Modes

Failure	Symptom	Detection	Recovery
MemoryMax too low	Service is OOM-killed repeatedly, enters a restart loop	`journalctl -u myapp.service` shows “Out of memory” and rapid restart/stop cycles; `systemctl status` shows “oom-kill”	Increase MemoryMax. Check `memory.peak` in the cgroup to find the actual peak usage: `cat /sys/fs/cgroup/web.slice/myapp.service/memory.peak`
CPUQuota too restrictive	Service responds slowly, request timeouts increase	Application latency metrics spike. `systemd-cgtop` shows the service at 100% of its quota while the host has idle CPU.	Switch from CPUQuota to CPUWeight, or increase the quota. CPUWeight is almost always the better choice for production services.
TasksMax hit during normal operation	Application fails to create new threads or processes	Application logs show “Resource temporarily unavailable” or “Cannot allocate memory” (misleading). `systemctl show myapp.service -p TasksCurrent` shows current equals max.	Increase TasksMax. Profile the application to understand its thread/process model.
Cgroup v1/v2 mismatch	Container runtime fails to start or cannot apply limits	containerd/Docker logs show “cgroup driver mismatch” or “failed to create cgroup”	Ensure both the kernel and the container runtime use the same cgroup version. Set `SystemdCgroup = true` in containerd config and `systemd.unified_cgroup_hierarchy=1` in boot params.
OOM killer targets wrong process	The OOM killer in a cgroup kills a critical subprocess instead of the one consuming the most memory	Post-mortem shows the wrong process was killed. `dmesg` shows OOM details.	Set `OOMPolicy=kill` in the systemd service to kill the entire service instead of individual processes. Use `oom_score_adj` to prioritize which processes survive.

When to Consider a Managed Alternative

Transition point: When you are fine-tuning cgroup limits per workload across more than 10 services, spending 2-4 hours profiling each service, and need to maintain those limits as workload patterns change.

What managed providers handle:

Managed Kubernetes providers (Civo (#22), DigitalOcean (#21), Vultr (#12), Linode (#13)) enforce resource isolation at the platform level. Kubernetes resource requests and limits translate to cgroup v2 constraints on the node, and the kubelet handles the cgroup hierarchy. You define resource requirements in your pod specs, and the platform handles the low-level enforcement.

Runtime security platforms (Sysdig (#122)) monitor for resource abuse patterns and cgroup escape attempts. They can detect when a process breaks out of its expected cgroup hierarchy, when resource consumption patterns indicate cryptomining, or when a fork bomb is in progress.

What you still control: Even on managed Kubernetes, you must set resource requests and limits in your pod specifications. The platform enforces them, but you define the values. Use Kubernetes LimitRange and ResourceQuota objects to set namespace-level defaults and ceilings so that no team can deploy without resource limits.

Automation path: For self-managed hosts, start with the systemd slice configuration in this article. Profile workloads for 1-2 weeks using systemd-cgtop and memory.peak readings before setting hard limits. For fleet-wide enforcement, integrate resource limit verification into your configuration management tool.