Pod Security Context Deep Dive: runAsNonRoot, readOnlyRootFilesystem, and Capabilities

Pod Security Context Deep Dive: runAsNonRoot, readOnlyRootFilesystem, and Capabilities

Problem

Kubernetes SecurityContext has over 15 configurable fields, but most teams only set runAsNonRoot: true and consider the job done. The remaining fields control critical security boundaries: whether the container can write to its filesystem, which Linux capabilities it holds, whether child processes can gain more privileges than the parent, and which seccomp profile restricts syscall access.

The specific challenges:

  • Missing fields leave default-open gaps. A container with runAsNonRoot: true but without readOnlyRootFilesystem: true can still write malicious binaries to the container filesystem. Without allowPrivilegeEscalation: false, a process can use setuid binaries to gain root. Without dropping capabilities, the container retains NET_RAW (enabling ARP spoofing) and other capabilities it does not need.
  • Pod-level vs. container-level settings cause confusion. SecurityContext exists at both spec.securityContext (pod level) and spec.containers[].securityContext (container level). Container-level settings override pod-level settings, but only for the fields that are set. Missing fields fall through to defaults, not to the pod-level value for all fields.
  • Common mistakes break workloads silently. Setting runAsUser: 0 alongside runAsNonRoot: true causes an admission error. Setting readOnlyRootFilesystem: true without providing writable volumes for /tmp or application caches causes crashes. Dropping ALL capabilities without adding back NET_BIND_SERVICE prevents web servers from binding to ports below 1024.
  • No built-in decision framework. Different workload types (web servers, databases, workers, init containers) need different SecurityContext configurations, but Kubernetes provides no guidance on which settings to apply to which workload type.

This article covers every SecurityContext field with practical examples, a decision matrix by workload type, common mistakes and how to avoid them, and enforcement using admission policies.

Target systems: Kubernetes 1.29+ with Pod Security Standards or a policy engine (Kyverno, OPA Gatekeeper) for enforcement.

Threat Model

  • Adversary: Attacker with code execution inside a container (via application vulnerability, compromised dependency, or malicious image).
  • Access level: Unprivileged process running inside a container with default SecurityContext settings.
  • Objective: Escalate from unprivileged container user to root (via setuid binaries or capability abuse), write persistent backdoors to the container filesystem, perform network attacks (ARP spoofing via NET_RAW), access host resources (via privileged mode or hostPID/hostNetwork), or escape the container entirely.
  • Blast radius: Without SecurityContext hardening, a compromised container can gain root inside the container, write and execute malicious binaries, spoof network traffic, and potentially escape to the host. With proper SecurityContext, the attacker is confined to a non-root, read-only, capability-dropped environment where privilege escalation paths are eliminated.

Configuration

Step 1: The Hardened Baseline SecurityContext

This is the recommended starting configuration for most workloads. Every field is explicitly set rather than relying on defaults:

# hardened-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: web
          image: registry.example.com/web-app:2.1.0
          ports:
            - containerPort: 8080
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: cache
              mountPath: /var/cache/app
      volumes:
        - name: tmp
          emptyDir:
            sizeLimit: 100Mi
        - name: cache
          emptyDir:
            sizeLimit: 500Mi

Step 2: SecurityContext Field Reference

Pod-level fields (under spec.securityContext):

Field Purpose Recommended Value
runAsNonRoot Prevents containers from running as UID 0 true
runAsUser Sets the UID for all containers Application-specific (1000+)
runAsGroup Sets the primary GID for all containers Match runAsUser
fsGroup Sets the GID for volume mounts; files created on volumes get this GID Match runAsGroup
fsGroupChangePolicy Controls when fsGroup ownership is applied to volumes OnRootMismatch (faster than default Always)
supplementalGroups Additional GIDs for the container process Only add groups needed for file access
seccompProfile Restricts which syscalls the container can make RuntimeDefault minimum
sysctls Kernel parameter tuning for the pod’s network namespace Only set when required (e.g., net.core.somaxconn)

Container-level fields (under spec.containers[].securityContext):

Field Purpose Recommended Value
allowPrivilegeEscalation Controls whether a process can gain more privileges than its parent false
readOnlyRootFilesystem Mounts the container’s root filesystem as read-only true
capabilities.drop Linux capabilities to remove ALL
capabilities.add Linux capabilities to add back after dropping Only what is needed
privileged Gives the container full host access false (never set to true)
procMount Controls what /proc exposes Default (masked proc)
seccompProfile Per-container seccomp override Set if container needs a different profile than pod default

Step 3: Workload-Specific Configurations

Web server (nginx, reverse proxy) that needs to bind to port 80/443:

# nginx-security-context.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 101
        runAsGroup: 101
        fsGroup: 101
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: nginx
          image: registry.example.com/nginx:1.27.0
          ports:
            - containerPort: 8080
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: cache
              mountPath: /var/cache/nginx
            - name: run
              mountPath: /var/run
      volumes:
        - name: tmp
          emptyDir: {}
        - name: cache
          emptyDir: {}
        - name: run
          emptyDir: {}

Note: Modern nginx images support running as non-root on ports above 1024. Configure nginx to listen on 8080 instead of 80, and use a Service to map port 80 to 8080. This avoids needing the NET_BIND_SERVICE capability entirely.

Database (PostgreSQL) with persistent storage:

# postgres-security-context.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 999
        runAsGroup: 999
        fsGroup: 999
        fsGroupChangePolicy: OnRootMismatch
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: postgres
          image: registry.example.com/postgres:16.2
          ports:
            - containerPort: 5432
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
            - name: run
              mountPath: /var/run/postgresql
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: run
          emptyDir: {}
        - name: tmp
          emptyDir: {}
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi

Init container that needs temporary elevated access:

# init-container-example.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-with-init
  namespace: production
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app-with-init
  template:
    metadata:
      labels:
        app: app-with-init
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      initContainers:
        - name: fix-permissions
          image: registry.example.com/busybox:1.36
          command: ["sh", "-c", "chown -R 1000:1000 /data"]
          securityContext:
            runAsNonRoot: false
            runAsUser: 0
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
              add:
                - CHOWN
                - FOWNER
          volumeMounts:
            - name: data
              mountPath: /data
      containers:
        - name: app
          image: registry.example.com/app:1.0.0
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: app-data

Note: The init container runs as root with only CHOWN and FOWNER capabilities, then exits. The main container runs as non-root with all capabilities dropped.

Step 4: Decision Matrix by Workload Type

Workload Type runAsNonRoot readOnlyRootFilesystem Capabilities allowPrivilegeEscalation Notes
Stateless web app true true Drop ALL false Add emptyDir for /tmp
API server (Go, Java) true true Drop ALL false Add emptyDir for temp files and caches
nginx/reverse proxy true true Drop ALL false Listen on 8080+; Service maps to 80
PostgreSQL/MySQL true true Drop ALL false fsGroup must match image UID; emptyDir for /run
Redis true true Drop ALL false emptyDir for /data if not using persistence
Worker/queue consumer true true Drop ALL false Simplest case; no special requirements
Init container (chown) false (root) false Drop ALL, add CHOWN + FOWNER false Runs briefly, then exits
CronJob/batch true true Drop ALL false Same as worker
Monitoring agent true true Drop ALL false May need hostPath mounts for node metrics

Step 5: Enforce with Admission Policy

Use Kyverno to enforce SecurityContext requirements across the cluster:

# kyverno-require-security-context.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-security-context
  annotations:
    policies.kyverno.io/title: Require Security Context
    policies.kyverno.io/description: >-
      Requires all containers to set readOnlyRootFilesystem,
      drop ALL capabilities, and disable privilege escalation.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-read-only-root
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "All containers must set readOnlyRootFilesystem: true"
        pattern:
          spec:
            containers:
              - securityContext:
                  readOnlyRootFilesystem: true
    - name: require-drop-all-capabilities
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "All containers must drop ALL capabilities"
        pattern:
          spec:
            containers:
              - securityContext:
                  capabilities:
                    drop:
                      - ALL
    - name: require-no-privilege-escalation
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "All containers must set allowPrivilegeEscalation: false"
        pattern:
          spec:
            containers:
              - securityContext:
                  allowPrivilegeEscalation: false

Step 6: Test SecurityContext Configurations

Verify that the settings are applied correctly inside the running container:

# Check the running user
kubectl exec -n production deploy/web-app -- id
# Expected: uid=1000 gid=1000 groups=1000

# Check filesystem is read-only
kubectl exec -n production deploy/web-app -- touch /test-file 2>&1
# Expected: touch: /test-file: Read-only file system

# Check writable emptyDir volumes
kubectl exec -n production deploy/web-app -- touch /tmp/test-file
# Expected: no error

# Check capabilities
kubectl exec -n production deploy/web-app -- cat /proc/1/status | grep Cap
# Expected: CapBnd and CapEff should show 0000000000000000 (no capabilities)

# Verify no privilege escalation
kubectl exec -n production deploy/web-app -- cat /proc/1/status | grep NoNewPrivs
# Expected: NoNewPrivs: 1

# Test that a privileged pod is rejected by admission policy
kubectl run test-privileged --image=busybox --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"test","image":"busybox","securityContext":{"privileged":true}}]}}'
# Expected: Error from server: admission webhook denied the request

Expected Behaviour

After applying SecurityContext configurations:

  • All containers run as non-root (UID 1000+), verified by id command output
  • Container root filesystems are read-only; writes to non-volume paths fail with “Read-only file system”
  • Application writes to emptyDir volumes at /tmp and application-specific cache directories succeed normally
  • Linux capabilities are fully dropped; cat /proc/1/status shows zeroed capability bitmasks
  • Privilege escalation is disabled; setuid binaries inside the container have no effect
  • Admission policies block pods that do not meet SecurityContext requirements
  • Init containers that require temporary elevated access run successfully with minimal capabilities, then exit before the main container starts

Trade-offs

Control Impact Risk Mitigation
readOnlyRootFilesystem Prevents writing backdoors or modifying binaries in the container Applications that write to the local filesystem (log files, temp files, caches, PID files) crash Add emptyDir volumes for every writable path. Check application documentation for writable directories
Drop ALL capabilities Eliminates capability-based privilege escalation and network attacks Containers that need specific capabilities (NET_BIND_SERVICE for port 80, SYS_PTRACE for debugging) fail Drop ALL, then add back only the specific capabilities needed. Never add SYS_ADMIN
runAsNonRoot + specific UID Prevents root-level access inside the container Images built to run as root (many Docker Hub images) fail to start Use -nonroot image variants or rebuild images with a non-root USER instruction
allowPrivilegeEscalation: false Blocks setuid binaries and capability inheritance Some legacy applications depend on setuid for operation (older versions of ping, su, sudo) Replace setuid-dependent functionality with capability-based or redesigned alternatives
Admission policy enforcement Prevents non-compliant pods cluster-wide Blocks legitimate workloads that have not been updated to meet requirements Roll out in audit mode first. Exclude system namespaces (kube-system). Give teams time to update manifests

Failure Modes

Failure Symptom Detection Recovery
readOnlyRootFilesystem without writable volumes Application crashes on startup with “read-only file system” errors Application logs; pod enters CrashLoopBackOff Identify which paths the application writes to (strace or error messages), add emptyDir volumes for those paths
runAsUser conflicts with image Container process cannot read its own binary or config files because they are owned by a different UID Permission denied errors in application logs Set fsGroup to match the expected GID, or rebuild the image with correct file ownership
runAsNonRoot: true with image that defaults to root Pod fails admission with “container has runAsNonRoot and image will run as root” kubectl describe pod shows the error; pod stays in Pending Set an explicit runAsUser to a non-root UID, or use an image built with a non-root USER
Capabilities dropped that application needs Application-specific functionality fails (e.g., cannot bind to port 443, cannot send raw packets) Feature-specific errors in application logs Identify the required capability and add it back minimally. Never re-add ALL
Kyverno policy blocks system pods kube-system pods fail to deploy after cluster upgrade System pods in Pending state; Kyverno audit logs show denials Exclude kube-system and other system namespaces from the policy using exclude rules

When to Consider a Managed Alternative

Transition point: Writing SecurityContext for a handful of workloads is straightforward. When your cluster runs 50+ deployments across multiple teams, ensuring every workload has a correct SecurityContext becomes a governance challenge. If teams regularly deploy pods that fail admission policies or run with incomplete security settings, automated scanning and remediation tools reduce friction.

Recommended providers:

  • Snyk (#48): Scans Kubernetes manifests, Helm charts, and Kustomize overlays for missing or misconfigured SecurityContext fields during CI/CD. Identifies containers running as root, missing readOnlyRootFilesystem, or retaining unnecessary capabilities before deployment.

What you still control: The SecurityContext values for each workload, the decision matrix for which settings apply to which workload type, admission policy configuration and exceptions, and the testing process for validating security settings against running containers.