Hardening the Kubernetes Scheduler: Topology Constraints and Security-Aware Placement
Problem
The Kubernetes scheduler places pods on nodes based on resource availability and basic constraints. By default, it does not consider security boundaries. A sensitive payment-processing pod can land on the same node as an untrusted third-party integration pod. If the third-party pod is compromised and the attacker achieves container escape, every pod on that node is exposed, including the payment processor.
This is not just a theoretical concern:
- Co-location of sensitive and untrusted workloads. Without scheduling constraints, the scheduler optimizes for resource packing. It will place high-security and low-trust workloads on the same node if that node has available resources.
- Replicas on the same node defeat high availability. If all replicas of a critical service land on one node, a single node failure takes down the entire service. The scheduler can spread replicas, but only if you configure topology spread constraints.
- Multi-tenant clusters share node pools by default. Without taints and tolerations, tenant A’s pods can run on the same nodes as tenant B’s pods. A noisy neighbour or a compromised pod affects co-located tenants.
- Compliance requirements may mandate physical separation. PCI-DSS and similar standards require that cardholder data environments are isolated. Logical namespace separation is not sufficient; workloads must run on dedicated infrastructure.
This article covers node affinity, taints and tolerations, topology spread constraints, pod anti-affinity, and multi-tenant scheduling patterns.
Target systems: Kubernetes 1.29+ with any scheduler (default kube-scheduler). Works with managed and self-managed clusters.
Threat Model
- Adversary: Attacker who has compromised a low-trust workload (third-party integration, development pod, or untrusted tenant pod) and is attempting to pivot to sensitive workloads via container escape or shared-node resources.
- Access level: Code execution inside a container, escalating to node-level access via kernel exploit or runtime vulnerability.
- Objective: Access sensitive data or processes on co-located pods. Exploit shared resources (node filesystem, container runtime socket, kubelet API, network namespace).
- Blast radius: Without scheduling constraints, all pods on the same node are in the blast radius of a container escape. With security-aware scheduling, sensitive workloads run on dedicated nodes where only trusted pods are present, reducing the blast radius to the dedicated node pool.
Configuration
Step 1: Dedicated Node Pools with Labels
Create separate node pools for workloads with different security levels:
# Label nodes for security tiers
kubectl label node worker-01 worker-02 \
security-tier=sensitive
kubectl label node worker-03 worker-04 \
security-tier=general
kubectl label node worker-05 \
security-tier=untrusted
# Verify labels
kubectl get nodes -L security-tier
Step 2: Taints and Tolerations for Hard Isolation
Taints prevent pods from scheduling on a node unless they explicitly tolerate the taint. This is the strongest scheduling constraint.
# Taint sensitive nodes so only approved workloads can run there
kubectl taint nodes worker-01 worker-02 \
security-tier=sensitive:NoSchedule
# Taint untrusted nodes so general workloads avoid them
kubectl taint nodes worker-05 \
security-tier=untrusted:NoSchedule
Deploy a sensitive workload that tolerates the taint:
# payment-processor.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-processor
namespace: payments
spec:
replicas: 3
selector:
matchLabels:
app: payment-processor
template:
metadata:
labels:
app: payment-processor
spec:
# Tolerate the sensitive node taint
tolerations:
- key: "security-tier"
operator: "Equal"
value: "sensitive"
effect: "NoSchedule"
# Require placement on sensitive nodes
nodeSelector:
security-tier: sensitive
containers:
- name: processor
image: registry.example.com/payment-processor:2.3.1
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Step 3: Node Affinity for Preferred Placement
Node affinity provides more flexible placement rules than nodeSelector, including preferred (soft) and required (hard) constraints:
# database-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
namespace: data
spec:
replicas: 2
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
affinity:
nodeAffinity:
# Hard requirement: must be on sensitive nodes
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: security-tier
operator: In
values:
- sensitive
# Soft preference: prefer nodes with SSD storage
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: disk-type
operator: In
values:
- ssd
containers:
- name: postgres
image: registry.example.com/postgres:16.2
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
Step 4: Pod Anti-Affinity for Replica Spreading
Prevent replicas of the same service from landing on the same node:
# web-frontend.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
namespace: production
spec:
replicas: 4
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
affinity:
podAntiAffinity:
# Hard: never put two replicas on the same node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-frontend
topologyKey: kubernetes.io/hostname
containers:
- name: frontend
image: registry.example.com/web-frontend:3.1.0
resources:
requests:
cpu: "200m"
memory: "256Mi"
Step 5: Topology Spread Constraints
Distribute pods evenly across failure domains (zones, nodes) for both availability and security:
# distributed-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
namespace: production
spec:
replicas: 6
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
topologySpreadConstraints:
# Spread across availability zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-gateway
# Spread across nodes within each zone
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-gateway
containers:
- name: gateway
image: registry.example.com/api-gateway:1.8.0
resources:
requests:
cpu: "250m"
memory: "256Mi"
Step 6: Multi-Tenant Scheduling
Combine taints, tolerations, and node affinity to isolate tenant workloads on dedicated node pools:
# Create per-tenant taints
kubectl taint nodes worker-10 worker-11 \
tenant=alpha:NoSchedule
kubectl taint nodes worker-12 worker-13 \
tenant=beta:NoSchedule
# Label nodes for tenant affinity
kubectl label nodes worker-10 worker-11 tenant=alpha
kubectl label nodes worker-12 worker-13 tenant=beta
# tenant-alpha-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alpha-app
namespace: team-alpha
spec:
replicas: 2
selector:
matchLabels:
app: alpha-app
template:
metadata:
labels:
app: alpha-app
tenant: alpha
spec:
tolerations:
- key: "tenant"
operator: "Equal"
value: "alpha"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tenant
operator: In
values:
- alpha
containers:
- name: app
image: registry.example.com/alpha-app:1.0.0
resources:
requests:
cpu: "200m"
memory: "256Mi"
Enforce tenant scheduling with Kyverno:
# enforce-tenant-scheduling.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: enforce-tenant-node-affinity
spec:
validationFailureAction: Enforce
rules:
- name: require-tenant-affinity
match:
any:
- resources:
kinds:
- Pod
namespaces:
- "team-*"
validate:
message: "Pods in tenant namespaces must include a nodeAffinity for the tenant label."
pattern:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tenant
operator: In
Expected Behaviour
After implementing scheduler hardening:
- Sensitive workloads run exclusively on tainted nodes that reject all other pods
- Untrusted workloads are confined to their own node pool and cannot be scheduled alongside sensitive services
- Pod replicas are distributed across nodes and zones, preventing single-node failures from causing full outages
- Topology spread constraints maintain even distribution as pods scale up and down
- Multi-tenant workloads are isolated to per-tenant node pools, with Kyverno policies preventing tenants from scheduling on other tenants’ nodes
- General workloads continue to schedule on the general-tier nodes without modification
Trade-offs
| Control | Impact | Risk | Mitigation |
|---|---|---|---|
| Taints on sensitive nodes | Dedicated nodes may be underutilized if sensitive workloads are small | Wasted compute resources on dedicated nodes | Right-size the dedicated node pool. Use cluster autoscaler with node pool-specific scaling |
| Required pod anti-affinity | Pods cannot schedule if there are not enough nodes (e.g., 4 replicas need 4 nodes) | Pods stuck in Pending state during node shortages | Use preferredDuringScheduling instead of required for non-critical services. Ensure node count exceeds replica count |
| Topology spread with DoNotSchedule | Pods reject placement if the spread constraint cannot be met | Pods stuck Pending when zones or nodes are unevenly sized | Use ScheduleAnyway (soft) for less critical services. Ensure zones have similar node counts |
| Per-tenant node pools | Each tenant needs dedicated nodes, increasing infrastructure cost | Higher cost per tenant compared to shared node pools | Use node autoscaling to scale down idle tenant pools. Evaluate whether namespace isolation is sufficient for the trust level |
| Kyverno scheduling enforcement | Additional admission webhook latency on pod creation | Slight deployment slowdown (50-100ms per pod) | Acceptable for most workloads. Exempt system namespaces from the policy |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| All sensitive nodes full | New sensitive-tier pods stuck in Pending state | kubectl get pods shows Pending; kubectl describe pod shows “0/N nodes available: N node(s) had taint” |
Add nodes to the sensitive pool. Enable cluster autoscaler for the sensitive node group |
| Taint removed from node | Non-sensitive pods schedule on previously dedicated nodes, breaking isolation | Audit node taints periodically; Kyverno can enforce taints exist on labelled nodes | Re-apply the taint. Investigate how it was removed (accidental kubectl command, node replacement without taint) |
| Topology spread prevents scaling | HPA tries to add replicas but the spread constraint blocks placement in an imbalanced cluster | HPA events show “unable to schedule”; pods in Pending with topology spread errors | Rebalance nodes across zones. Switch the imbalanced constraint from DoNotSchedule to ScheduleAnyway |
| Anti-affinity blocks rolling updates | During a rolling update, new pods cannot schedule because old pods still occupy the required topology | Deployment rollout stalls; new pods Pending while old pods still Running | Configure maxSurge and maxUnavailable in the deployment strategy. Use preferredDuringScheduling for anti-affinity during rollouts |
| Tenant schedules on wrong node pool | Kyverno policy not applied, or namespace not matching the policy selector | Pods from tenant A running on tenant B’s nodes (check with kubectl get pods -o wide) |
Fix the Kyverno policy match selector. Evict misplaced pods. Audit and re-apply taints |
When to Consider a Managed Alternative
Transition point: Managing dedicated node pools, taints, and autoscaling across multiple security tiers or tenants adds significant operational overhead. Each node pool needs its own autoscaling configuration, its own monitoring, and its own capacity planning. At 3+ node pools, the management burden is substantial.
Recommended providers:
- Civo (#22): Managed Kubernetes with node pool support. Create dedicated node pools for different security tiers through the API or UI. Autoscaling is managed by the provider.
- Sysdig (#122): Provides workload placement visualization, showing which pods are co-located on which nodes. Useful for auditing whether scheduling constraints are working as intended.
What you still control: The scheduling constraints (node affinity, taints, topology spread) are workload-level configurations that you define regardless of whether the infrastructure is managed. Managed providers handle node provisioning and autoscaling; you define the placement rules.
Premium content pack: Kyverno policy pack for scheduling enforcement, including policies for tenant node affinity, anti-affinity requirements for critical services, and topology spread validation. Includes Terraform modules for creating labelled and tainted node pools on major cloud providers.