Securing CI/CD Runners: Isolation, Credential Scoping, and Ephemeral Environments

Securing CI/CD Runners: Isolation, Credential Scoping, and Ephemeral Environments

Problem

CI/CD runners are the most privileged, least monitored components in most infrastructure. A self-hosted runner has persistent access to deployment credentials, can reach production networks, and executes arbitrary code from every pull request. A compromised runner gives an attacker everything: secrets, deployment keys, container registry access, and a direct path to production.

The specific gaps:

  • Persistent runners accumulate state between jobs, a malicious job can leave a backdoor that executes during the next job on the same runner.
  • Over-scoped credentials: runners have access to all secrets in the repository/project, not just the ones the current job needs.
  • No network isolation: runners can reach production infrastructure, internal services, and the internet.
  • No monitoring: runner activity is not audited. A compromised runner operates undetected until someone notices the damage.

Target systems: GitHub Actions self-hosted runners, GitLab CI runners. Principles apply to Jenkins, Drone, Buildkite, and Woodpecker.

Threat Model

  • Adversary: Attacker submitting a malicious pull request (open-source projects), compromised developer account (enterprise), or compromised third-party CI action/step.
  • Access level: Code execution on the CI runner with access to all pipeline secrets and network connectivity.
  • Objective: Extract secrets (cloud credentials, signing keys, registry tokens). Inject malicious code into build artifacts. Pivot to production infrastructure via deployment credentials.
  • Blast radius: All secrets accessible to the runner. All infrastructure the runner can deploy to. All container images the runner can push to.

Configuration

Ephemeral Runners

The most impactful change: every job runs on a fresh runner instance that is destroyed after the job completes. No state persists between jobs.

GitHub Actions: Ephemeral self-hosted runners

# Register a self-hosted runner with --ephemeral flag.
# After one job, the runner deregisters and the VM/container is destroyed.
./config.sh --url https://github.com/your-org/your-repo \
  --token YOUR_REGISTRATION_TOKEN \
  --ephemeral \
  --name "ephemeral-runner-$(date +%s)"

./run.sh
# Runner accepts one job, executes it, then exits.
# The orchestrator (systemd, K8s Job, or cloud autoscaler) creates a new instance.

GitHub Actions: Autoscaling ephemeral runners with Actions Runner Controller (ARC):

# runner-deployment.yaml - ARC on Kubernetes
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: ephemeral-runners
spec:
  replicas: 3
  template:
    spec:
      ephemeral: true
      repository: your-org/your-repo
      labels:
        - self-hosted
        - linux
        - ephemeral
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
      # Security context for the runner pod
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        readOnlyRootFilesystem: false  # Runner needs write access
        seccompProfile:
          type: RuntimeDefault

GitLab CI: Docker executor with ephemeral containers

# /etc/gitlab-runner/config.toml
[[runners]]
  name = "ephemeral-docker"
  executor = "docker"
  [runners.docker]
    image = "ubuntu:24.04"
    privileged = false
    # Each job gets a fresh container - destroyed after completion
    pull_policy = ["always"]
    # Disable Docker socket access (prevents container escape)
    volumes = []
    # Network isolation
    network_mode = "bridge"
    # Resource limits
    cpus = "2"
    memory = "4g"

OIDC Federation (No Static Credentials)

Replace static AWS/GCP credentials with OIDC federation. The runner receives a short-lived token that expires in minutes.

GitHub Actions → AWS via OIDC:

# .github/workflows/deploy.yml
name: Deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write   # Required for OIDC
  contents: read

jobs:
  deploy:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-deploy
          aws-region: eu-west-1
          # No static credentials stored anywhere.
          # The role trust policy restricts to this specific repo and branch.

      - name: Deploy
        run: |
          aws ecs update-service --cluster prod --service app --force-new-deployment

AWS IAM trust policy for the OIDC role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
        }
      }
    }
  ]
}

The Condition block is critical: it restricts the role to a specific repository AND branch. A different repository or a pull request branch cannot assume this role.

Runner Network Isolation

Runners should only be able to reach: the container registry, the cloud provider APIs, and the artifact storage. Nothing else.

# runner-network-policy.yaml (for ARC runners on Kubernetes)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-egress-restrict
  namespace: actions-runner-system
spec:
  podSelector:
    matchLabels:
      app: runner
  policyTypes:
    - Egress
  egress:
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
    # Allow HTTPS to GitHub API
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
      ports:
        - protocol: TCP
          port: 443
    # Block everything else (no SSH to production, no internal services)

Runner Compromise Detection

# Prometheus alert rules for CI/CD runner monitoring
groups:
  - name: cicd-runner-security
    rules:
      - alert: UnexpectedRunnerRegistration
        expr: increase(github_runner_registration_total[1h]) > 5
        labels:
          severity: warning
        annotations:
          summary: "Unusual number of runner registrations in the past hour"
          runbook: "Check for unauthorized runner registration. Verify all runners are expected."

      - alert: LongRunningJob
        expr: github_job_duration_seconds > 3600
        labels:
          severity: warning
        annotations:
          summary: "CI job running for over 1 hour, possible compromise or stuck job"

      - alert: SecretsAccessedByUnexpectedWorkflow
        expr: increase(github_secret_access_total{workflow!~"deploy|release|build"}[1h]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Secrets accessed by unexpected workflow: {{ $labels.workflow }}"

Expected Behaviour

  • Every CI job runs on a fresh, ephemeral runner instance, no state persists between jobs
  • No static cloud credentials stored on runners or in repository secrets; OIDC tokens expire in minutes
  • Runner network restricted to container registry, cloud APIs, and artifact storage only
  • Monitoring alerts on: unexpected runner registrations, long-running jobs, secrets access by unusual workflows
  • GITHUB_TOKEN permissions explicitly declared per-job (not default write-all)

Trade-offs

Control Impact Risk Mitigation
Ephemeral runners 10-30 second cold start per job; cache miss for dependencies Slower CI; no persistent build cache Use external cache (S3, GCS) for dependency caching. Cache hits restore in 5-10 seconds.
OIDC federation No static credentials; automatic rotation Requires AWS/GCP OIDC trust setup (4-8 hours per provider per repo) Document the setup once; template for new repos.
Network egress restrictions Blocks unexpected outbound connections npm install or pip install from public registries may be blocked Allow egress to port 443 (HTTPS) broadly, but block non-HTTPS egress. For stricter control: proxy all dependency downloads through an internal registry.
Runner monitoring Visibility into runner activity Alert fatigue from legitimate long-running jobs Tune thresholds per workflow. Exclude known long jobs (integration tests, large builds).

Failure Modes

Failure Symptom Detection Recovery
OIDC token request fails CI job fails with “unable to assume role” Job logs show 401/403 on AWS STS; configure-aws-credentials step fails Check OIDC trust policy: verify repo, branch, and audience conditions match. Check the IAM role exists and has the correct permissions.
Ephemeral runner doesn’t clean up Old runner instances accumulate; cost and security risk Runner instance count exceeds expected; stale instances visible in cloud console Implement max-lifetime policy (terminate instances older than 2 hours). Add cleanup cron job.
Network policy blocks dependency download npm install, pip install, or docker pull fails in CI Job fails at dependency installation step; network timeout in job logs Add the registry domain to the egress allowlist. Or proxy dependencies through an internal registry.
Compromised CI action Third-party action exfiltrates secrets Secrets appear in unexpected locations; unusual API calls from runner IP Pin all actions by SHA (not tag). Review action source code. Use permissions to limit GITHUB_TOKEN scope.

When to Consider a Managed Alternative

Self-hosted runners require VM/container infrastructure, patching, and monitoring. GitHub-hosted larger runners ($0.008-0.064/minute) provide managed isolation, ephemeral by default, with no infrastructure to maintain. Buildkite (#94) provides managed orchestration with self-hosted runner security (you control where jobs run; Buildkite handles scheduling and monitoring).

For runner audit logging: Grafana Cloud (#108) or Axiom (#112) for centralized CI/CD audit log analysis. Sysdig (#122) for runtime monitoring of runner containers on Kubernetes.

Premium content pack: CI/CD hardening templates. GitHub Actions workflow templates with OIDC, minimal permissions, and pinned actions; GitLab CI config with ephemeral Docker executor; ARC runner deployment manifests with security context and network policies.