Pod Security Context Deep Dive: runAsNonRoot, readOnlyRootFilesystem, and Capabilities
Problem
Kubernetes SecurityContext has over 15 configurable fields, but most teams only set runAsNonRoot: true and consider the job done. The remaining fields control critical security boundaries: whether the container can write to its filesystem, which Linux capabilities it holds, whether child processes can gain more privileges than the parent, and which seccomp profile restricts syscall access.
The specific challenges:
- Missing fields leave default-open gaps. A container with
runAsNonRoot: truebut withoutreadOnlyRootFilesystem: truecan still write malicious binaries to the container filesystem. WithoutallowPrivilegeEscalation: false, a process can use setuid binaries to gain root. Without dropping capabilities, the container retainsNET_RAW(enabling ARP spoofing) and other capabilities it does not need. - Pod-level vs. container-level settings cause confusion. SecurityContext exists at both
spec.securityContext(pod level) andspec.containers[].securityContext(container level). Container-level settings override pod-level settings, but only for the fields that are set. Missing fields fall through to defaults, not to the pod-level value for all fields. - Common mistakes break workloads silently. Setting
runAsUser: 0alongsiderunAsNonRoot: truecauses an admission error. SettingreadOnlyRootFilesystem: truewithout providing writable volumes for/tmpor application caches causes crashes. DroppingALLcapabilities without adding backNET_BIND_SERVICEprevents web servers from binding to ports below 1024. - No built-in decision framework. Different workload types (web servers, databases, workers, init containers) need different SecurityContext configurations, but Kubernetes provides no guidance on which settings to apply to which workload type.
This article covers every SecurityContext field with practical examples, a decision matrix by workload type, common mistakes and how to avoid them, and enforcement using admission policies.
Target systems: Kubernetes 1.29+ with Pod Security Standards or a policy engine (Kyverno, OPA Gatekeeper) for enforcement.
Threat Model
- Adversary: Attacker with code execution inside a container (via application vulnerability, compromised dependency, or malicious image).
- Access level: Unprivileged process running inside a container with default SecurityContext settings.
- Objective: Escalate from unprivileged container user to root (via setuid binaries or capability abuse), write persistent backdoors to the container filesystem, perform network attacks (ARP spoofing via NET_RAW), access host resources (via privileged mode or hostPID/hostNetwork), or escape the container entirely.
- Blast radius: Without SecurityContext hardening, a compromised container can gain root inside the container, write and execute malicious binaries, spoof network traffic, and potentially escape to the host. With proper SecurityContext, the attacker is confined to a non-root, read-only, capability-dropped environment where privilege escalation paths are eliminated.
Configuration
Step 1: The Hardened Baseline SecurityContext
This is the recommended starting configuration for most workloads. Every field is explicitly set rather than relying on defaults:
# hardened-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: web
image: registry.example.com/web-app:2.1.0
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/cache/app
volumes:
- name: tmp
emptyDir:
sizeLimit: 100Mi
- name: cache
emptyDir:
sizeLimit: 500Mi
Step 2: SecurityContext Field Reference
Pod-level fields (under spec.securityContext):
| Field | Purpose | Recommended Value |
|---|---|---|
runAsNonRoot |
Prevents containers from running as UID 0 | true |
runAsUser |
Sets the UID for all containers | Application-specific (1000+) |
runAsGroup |
Sets the primary GID for all containers | Match runAsUser |
fsGroup |
Sets the GID for volume mounts; files created on volumes get this GID | Match runAsGroup |
fsGroupChangePolicy |
Controls when fsGroup ownership is applied to volumes | OnRootMismatch (faster than default Always) |
supplementalGroups |
Additional GIDs for the container process | Only add groups needed for file access |
seccompProfile |
Restricts which syscalls the container can make | RuntimeDefault minimum |
sysctls |
Kernel parameter tuning for the pod’s network namespace | Only set when required (e.g., net.core.somaxconn) |
Container-level fields (under spec.containers[].securityContext):
| Field | Purpose | Recommended Value |
|---|---|---|
allowPrivilegeEscalation |
Controls whether a process can gain more privileges than its parent | false |
readOnlyRootFilesystem |
Mounts the container’s root filesystem as read-only | true |
capabilities.drop |
Linux capabilities to remove | ALL |
capabilities.add |
Linux capabilities to add back after dropping | Only what is needed |
privileged |
Gives the container full host access | false (never set to true) |
procMount |
Controls what /proc exposes | Default (masked proc) |
seccompProfile |
Per-container seccomp override | Set if container needs a different profile than pod default |
Step 3: Workload-Specific Configurations
Web server (nginx, reverse proxy) that needs to bind to port 80/443:
# nginx-security-context.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: production
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
securityContext:
runAsNonRoot: true
runAsUser: 101
runAsGroup: 101
fsGroup: 101
seccompProfile:
type: RuntimeDefault
containers:
- name: nginx
image: registry.example.com/nginx:1.27.0
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/cache/nginx
- name: run
mountPath: /var/run
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
- name: run
emptyDir: {}
Note: Modern nginx images support running as non-root on ports above 1024. Configure nginx to listen on 8080 instead of 80, and use a Service to map port 80 to 8080. This avoids needing the NET_BIND_SERVICE capability entirely.
Database (PostgreSQL) with persistent storage:
# postgres-security-context.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: production
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
securityContext:
runAsNonRoot: true
runAsUser: 999
runAsGroup: 999
fsGroup: 999
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
containers:
- name: postgres
image: registry.example.com/postgres:16.2
ports:
- containerPort: 5432
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: run
mountPath: /var/run/postgresql
- name: tmp
mountPath: /tmp
volumes:
- name: run
emptyDir: {}
- name: tmp
emptyDir: {}
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
Init container that needs temporary elevated access:
# init-container-example.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-with-init
namespace: production
spec:
replicas: 1
selector:
matchLabels:
app: app-with-init
template:
metadata:
labels:
app: app-with-init
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
initContainers:
- name: fix-permissions
image: registry.example.com/busybox:1.36
command: ["sh", "-c", "chown -R 1000:1000 /data"]
securityContext:
runAsNonRoot: false
runAsUser: 0
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
add:
- CHOWN
- FOWNER
volumeMounts:
- name: data
mountPath: /data
containers:
- name: app
image: registry.example.com/app:1.0.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: app-data
Note: The init container runs as root with only CHOWN and FOWNER capabilities, then exits. The main container runs as non-root with all capabilities dropped.
Step 4: Decision Matrix by Workload Type
| Workload Type | runAsNonRoot | readOnlyRootFilesystem | Capabilities | allowPrivilegeEscalation | Notes |
|---|---|---|---|---|---|
| Stateless web app | true | true | Drop ALL | false | Add emptyDir for /tmp |
| API server (Go, Java) | true | true | Drop ALL | false | Add emptyDir for temp files and caches |
| nginx/reverse proxy | true | true | Drop ALL | false | Listen on 8080+; Service maps to 80 |
| PostgreSQL/MySQL | true | true | Drop ALL | false | fsGroup must match image UID; emptyDir for /run |
| Redis | true | true | Drop ALL | false | emptyDir for /data if not using persistence |
| Worker/queue consumer | true | true | Drop ALL | false | Simplest case; no special requirements |
| Init container (chown) | false (root) | false | Drop ALL, add CHOWN + FOWNER | false | Runs briefly, then exits |
| CronJob/batch | true | true | Drop ALL | false | Same as worker |
| Monitoring agent | true | true | Drop ALL | false | May need hostPath mounts for node metrics |
Step 5: Enforce with Admission Policy
Use Kyverno to enforce SecurityContext requirements across the cluster:
# kyverno-require-security-context.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-security-context
annotations:
policies.kyverno.io/title: Require Security Context
policies.kyverno.io/description: >-
Requires all containers to set readOnlyRootFilesystem,
drop ALL capabilities, and disable privilege escalation.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: require-read-only-root
match:
any:
- resources:
kinds:
- Pod
validate:
message: "All containers must set readOnlyRootFilesystem: true"
pattern:
spec:
containers:
- securityContext:
readOnlyRootFilesystem: true
- name: require-drop-all-capabilities
match:
any:
- resources:
kinds:
- Pod
validate:
message: "All containers must drop ALL capabilities"
pattern:
spec:
containers:
- securityContext:
capabilities:
drop:
- ALL
- name: require-no-privilege-escalation
match:
any:
- resources:
kinds:
- Pod
validate:
message: "All containers must set allowPrivilegeEscalation: false"
pattern:
spec:
containers:
- securityContext:
allowPrivilegeEscalation: false
Step 6: Test SecurityContext Configurations
Verify that the settings are applied correctly inside the running container:
# Check the running user
kubectl exec -n production deploy/web-app -- id
# Expected: uid=1000 gid=1000 groups=1000
# Check filesystem is read-only
kubectl exec -n production deploy/web-app -- touch /test-file 2>&1
# Expected: touch: /test-file: Read-only file system
# Check writable emptyDir volumes
kubectl exec -n production deploy/web-app -- touch /tmp/test-file
# Expected: no error
# Check capabilities
kubectl exec -n production deploy/web-app -- cat /proc/1/status | grep Cap
# Expected: CapBnd and CapEff should show 0000000000000000 (no capabilities)
# Verify no privilege escalation
kubectl exec -n production deploy/web-app -- cat /proc/1/status | grep NoNewPrivs
# Expected: NoNewPrivs: 1
# Test that a privileged pod is rejected by admission policy
kubectl run test-privileged --image=busybox --restart=Never \
--overrides='{"spec":{"containers":[{"name":"test","image":"busybox","securityContext":{"privileged":true}}]}}'
# Expected: Error from server: admission webhook denied the request
Expected Behaviour
After applying SecurityContext configurations:
- All containers run as non-root (UID 1000+), verified by
idcommand output - Container root filesystems are read-only; writes to non-volume paths fail with “Read-only file system”
- Application writes to emptyDir volumes at
/tmpand application-specific cache directories succeed normally - Linux capabilities are fully dropped;
cat /proc/1/statusshows zeroed capability bitmasks - Privilege escalation is disabled; setuid binaries inside the container have no effect
- Admission policies block pods that do not meet SecurityContext requirements
- Init containers that require temporary elevated access run successfully with minimal capabilities, then exit before the main container starts
Trade-offs
| Control | Impact | Risk | Mitigation |
|---|---|---|---|
| readOnlyRootFilesystem | Prevents writing backdoors or modifying binaries in the container | Applications that write to the local filesystem (log files, temp files, caches, PID files) crash | Add emptyDir volumes for every writable path. Check application documentation for writable directories |
| Drop ALL capabilities | Eliminates capability-based privilege escalation and network attacks | Containers that need specific capabilities (NET_BIND_SERVICE for port 80, SYS_PTRACE for debugging) fail | Drop ALL, then add back only the specific capabilities needed. Never add SYS_ADMIN |
| runAsNonRoot + specific UID | Prevents root-level access inside the container | Images built to run as root (many Docker Hub images) fail to start | Use -nonroot image variants or rebuild images with a non-root USER instruction |
| allowPrivilegeEscalation: false | Blocks setuid binaries and capability inheritance | Some legacy applications depend on setuid for operation (older versions of ping, su, sudo) | Replace setuid-dependent functionality with capability-based or redesigned alternatives |
| Admission policy enforcement | Prevents non-compliant pods cluster-wide | Blocks legitimate workloads that have not been updated to meet requirements | Roll out in audit mode first. Exclude system namespaces (kube-system). Give teams time to update manifests |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| readOnlyRootFilesystem without writable volumes | Application crashes on startup with “read-only file system” errors | Application logs; pod enters CrashLoopBackOff | Identify which paths the application writes to (strace or error messages), add emptyDir volumes for those paths |
| runAsUser conflicts with image | Container process cannot read its own binary or config files because they are owned by a different UID | Permission denied errors in application logs | Set fsGroup to match the expected GID, or rebuild the image with correct file ownership |
| runAsNonRoot: true with image that defaults to root | Pod fails admission with “container has runAsNonRoot and image will run as root” | kubectl describe pod shows the error; pod stays in Pending |
Set an explicit runAsUser to a non-root UID, or use an image built with a non-root USER |
| Capabilities dropped that application needs | Application-specific functionality fails (e.g., cannot bind to port 443, cannot send raw packets) | Feature-specific errors in application logs | Identify the required capability and add it back minimally. Never re-add ALL |
| Kyverno policy blocks system pods | kube-system pods fail to deploy after cluster upgrade | System pods in Pending state; Kyverno audit logs show denials | Exclude kube-system and other system namespaces from the policy using exclude rules |
When to Consider a Managed Alternative
Transition point: Writing SecurityContext for a handful of workloads is straightforward. When your cluster runs 50+ deployments across multiple teams, ensuring every workload has a correct SecurityContext becomes a governance challenge. If teams regularly deploy pods that fail admission policies or run with incomplete security settings, automated scanning and remediation tools reduce friction.
Recommended providers:
- Snyk (#48): Scans Kubernetes manifests, Helm charts, and Kustomize overlays for missing or misconfigured SecurityContext fields during CI/CD. Identifies containers running as root, missing readOnlyRootFilesystem, or retaining unnecessary capabilities before deployment.
What you still control: The SecurityContext values for each workload, the decision matrix for which settings apply to which workload type, admission policy configuration and exceptions, and the testing process for validating security settings against running containers.