Automated OS Hardening with Ansible: A Production-Ready Playbook Collection

Automated OS Hardening with Ansible: A Production-Ready Playbook Collection

Problem

Manual OS hardening does not scale. The sysctl settings from Article #1, the systemd overrides from Article #2, the SSH configuration from Article #7, applying these across 10, 50, or 200 hosts by hand is error-prone, time-consuming, and impossible to verify consistently.

The specific problems:

  • Configuration drift. A host hardened 6 months ago may have had settings reverted by a package update, a troubleshooting session, or a configuration management conflict. Without automated verification, you do not know which hosts are hardened and which have drifted.
  • Inconsistent application. Different engineers apply different subsets of hardening settings. Server A has sysctl hardened but not systemd. Server B has SSH hardened but not filesystem mount options. There is no single source of truth.
  • Testing gap. Hardening settings that work on a web server break a database server. Without per-role configuration and testing, every new hardening push is a gamble.
  • Compliance verification. Answering “are all our hosts CIS Level 1 compliant?” requires logging into each host and running checks, or, more realistically, it means nobody checks.

Ansible solves all four problems: declarative state management, idempotent application, role-based configuration, and automated verification. This article provides a production-ready playbook architecture that implements CIS Benchmark-level hardening for Ubuntu 24.04 LTS and RHEL 9, with per-role customisation, Molecule testing, and staged rollout.

Target systems: Ubuntu 24.04 LTS, RHEL 9 / Rocky Linux 9. Ansible 2.15+. Molecule for testing.

Threat Model

  • Adversary: Any attacker exploiting unhardened defaults across a fleet. The threat is not a specific attack technique; it is the inconsistent security posture that creates gaps attackers can find and exploit.
  • Blast radius: Without automation, some hosts are hardened, some are not, and you do not know which. An attacker who scans your fleet will find the unhardened hosts. With automation, every host in the fleet has the same baseline, verified by CI.

Configuration

Playbook Repository Structure

ansible-hardening/
├── inventory/
│   ├── production/
│   │   ├── hosts.yml
│   │   └── group_vars/
│   │       ├── all.yml           # Defaults for all hosts
│   │       ├── webservers.yml    # Web server overrides
│   │       ├── databases.yml     # Database overrides
│   │       └── kubernetes.yml    # K8s node overrides
│   └── staging/
│       └── hosts.yml
├── roles/
│   ├── base/                     # Applied to ALL hosts
│   │   ├── tasks/
│   │   │   ├── main.yml
│   │   │   ├── sysctl.yml
│   │   │   ├── systemd.yml
│   │   │   ├── ssh.yml
│   │   │   ├── filesystem.yml
│   │   │   ├── kernel-modules.yml
│   │   │   ├── auditd.yml
│   │   │   └── packages.yml
│   │   ├── templates/
│   │   │   ├── sysctl-hardening.conf.j2
│   │   │   ├── sshd_config.j2
│   │   │   └── audit.rules.j2
│   │   ├── handlers/
│   │   │   └── main.yml
│   │   └── defaults/
│   │       └── main.yml
│   ├── webserver/                # Additional hardening for web servers
│   ├── database/                 # Additional hardening for databases
│   └── kubernetes-node/          # Additional hardening for K8s nodes
├── site.yml                      # Main playbook
├── molecule/
│   └── default/
│       ├── molecule.yml
│       ├── converge.yml
│       └── verify.yml
└── requirements.yml

Inventory Configuration

# inventory/production/hosts.yml
all:
  children:
    webservers:
      hosts:
        web-01.example.com:
        web-02.example.com:
    databases:
      hosts:
        db-01.example.com:
    kubernetes:
      hosts:
        k8s-node-01.example.com:
        k8s-node-02.example.com:
        k8s-node-03.example.com:
# inventory/production/group_vars/all.yml
# Defaults applied to every host. Override per group as needed.

# sysctl hardening
hardening_sysctl_rp_filter: 1
hardening_sysctl_accept_source_route: 0
hardening_sysctl_accept_redirects: 0
hardening_sysctl_tcp_syncookies: 1
hardening_sysctl_kptr_restrict: 2
hardening_sysctl_dmesg_restrict: 1
hardening_sysctl_unprivileged_bpf_disabled: 1

# SSH hardening
hardening_ssh_permit_root_login: "no"
hardening_ssh_password_auth: "no"
hardening_ssh_max_auth_tries: 3
hardening_ssh_max_startups: "10:30:60"
hardening_ssh_allow_tcp_forwarding: "no"
hardening_ssh_x11_forwarding: "no"
hardening_ssh_allow_agent_forwarding: "no"

# Filesystem
hardening_tmp_noexec: true
hardening_tmp_nosuid: true
hardening_tmp_nodev: true

# Packages to remove
hardening_packages_remove:
  - telnet
  - rsh-client
  - talk
# inventory/production/group_vars/databases.yml
# Database servers need some settings adjusted.

# PostgreSQL needs more shared memory
hardening_sysctl_shmmax: 2147483648

# Database servers may need TCP forwarding for replication
hardening_ssh_allow_tcp_forwarding: "local"

Base Role - sysctl Task

# roles/base/tasks/sysctl.yml
---
- name: Deploy network hardening sysctl config
  ansible.builtin.template:
    src: sysctl-hardening.conf.j2
    dest: /etc/sysctl.d/60-hardening.conf
    owner: root
    group: root
    mode: '0644'
  notify: reload sysctl
  tags: [sysctl, network]

- name: Apply sysctl settings immediately
  ansible.builtin.command: sysctl --system
  changed_when: false
  tags: [sysctl]

- name: Verify critical sysctl settings
  ansible.builtin.command: "sysctl -n {{ item.key }}"
  register: sysctl_check
  failed_when: sysctl_check.stdout | trim != item.value | string
  changed_when: false
  loop:
    - { key: "net.ipv4.conf.all.rp_filter", value: "{{ hardening_sysctl_rp_filter }}" }
    - { key: "net.ipv4.conf.all.accept_source_route", value: "{{ hardening_sysctl_accept_source_route }}" }
    - { key: "net.ipv4.conf.all.accept_redirects", value: "{{ hardening_sysctl_accept_redirects }}" }
    - { key: "net.ipv4.tcp_syncookies", value: "{{ hardening_sysctl_tcp_syncookies }}" }
    - { key: "kernel.kptr_restrict", value: "{{ hardening_sysctl_kptr_restrict }}" }
    - { key: "kernel.dmesg_restrict", value: "{{ hardening_sysctl_dmesg_restrict }}" }
  tags: [sysctl, verify]
{# roles/base/templates/sysctl-hardening.conf.j2 #}
# Managed by Ansible - do not edit manually.
# Source: ansible-hardening/roles/base/templates/sysctl-hardening.conf.j2

# Network stack hardening
net.ipv4.conf.all.rp_filter = {{ hardening_sysctl_rp_filter }}
net.ipv4.conf.default.rp_filter = {{ hardening_sysctl_rp_filter }}
net.ipv4.conf.all.accept_source_route = {{ hardening_sysctl_accept_source_route }}
net.ipv4.conf.default.accept_source_route = {{ hardening_sysctl_accept_source_route }}
net.ipv4.conf.all.accept_redirects = {{ hardening_sysctl_accept_redirects }}
net.ipv4.conf.default.accept_redirects = {{ hardening_sysctl_accept_redirects }}
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.log_martians = 1
net.ipv4.tcp_syncookies = {{ hardening_sysctl_tcp_syncookies }}
net.ipv4.tcp_timestamps = 1
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv6.conf.all.accept_redirects = 0
net.ipv6.conf.default.accept_redirects = 0

# Kernel protections
kernel.randomize_va_space = 2
kernel.kptr_restrict = {{ hardening_sysctl_kptr_restrict }}
kernel.dmesg_restrict = {{ hardening_sysctl_dmesg_restrict }}
kernel.perf_event_paranoid = 3
kernel.yama.ptrace_scope = 1
kernel.unprivileged_bpf_disabled = {{ hardening_sysctl_unprivileged_bpf_disabled }}
net.core.bpf_jit_harden = 2
kernel.kexec_load_disabled = 1

# Filesystem
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
fs.protected_fifos = 2
fs.protected_regular = 2
fs.suid_dumpable = 0

{% if hardening_sysctl_shmmax is defined %}
# Database overrides
kernel.shmmax = {{ hardening_sysctl_shmmax }}
{% endif %}

Base Role - SSH Task

# roles/base/tasks/ssh.yml
---
- name: Deploy hardened sshd_config
  ansible.builtin.template:
    src: sshd_config.j2
    dest: /etc/ssh/sshd_config
    owner: root
    group: root
    mode: '0600'
    validate: '/usr/sbin/sshd -t -f %s'
  notify: restart sshd
  tags: [ssh]

- name: Verify sshd configuration is valid
  ansible.builtin.command: /usr/sbin/sshd -t
  changed_when: false
  tags: [ssh, verify]

Main Playbook

# site.yml
---
- name: Apply base hardening to all hosts
  hosts: all
  become: true
  roles:
    - base
  tags: [base]

- name: Apply web server hardening
  hosts: webservers
  become: true
  roles:
    - webserver
  tags: [webserver]

- name: Apply database hardening
  hosts: databases
  become: true
  roles:
    - database
  tags: [database]

- name: Apply Kubernetes node hardening
  hosts: kubernetes
  become: true
  roles:
    - kubernetes-node
  tags: [kubernetes]

Molecule Testing

# molecule/default/molecule.yml
---
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: ubuntu-2404
    image: ubuntu:24.04
    privileged: true
    command: /lib/systemd/systemd
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
    tmpfs:
      - /run
      - /tmp
  - name: rhel-9
    image: rockylinux:9
    privileged: true
    command: /lib/systemd/systemd
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
    tmpfs:
      - /run
      - /tmp
provisioner:
  name: ansible
  playbooks:
    converge: converge.yml
    verify: verify.yml
verifier:
  name: ansible
# molecule/default/verify.yml
---
- name: Verify hardening
  hosts: all
  become: true
  tasks:
    - name: Check sysctl rp_filter
      ansible.builtin.command: sysctl -n net.ipv4.conf.all.rp_filter
      register: rp_filter
      failed_when: rp_filter.stdout | trim != "1"
      changed_when: false

    - name: Check kernel pointer restriction
      ansible.builtin.command: sysctl -n kernel.kptr_restrict
      register: kptr
      failed_when: kptr.stdout | trim != "2"
      changed_when: false

    - name: Check sshd config is valid
      ansible.builtin.command: /usr/sbin/sshd -t
      changed_when: false

    - name: Verify sshd password auth is disabled
      ansible.builtin.command: grep -E "^PasswordAuthentication no" /etc/ssh/sshd_config
      changed_when: false

    - name: Check unneeded packages are removed
      ansible.builtin.package:
        name: telnet
        state: absent
      check_mode: true
      register: telnet_check
      failed_when: telnet_check.changed
# Run Molecule tests locally:
cd ansible-hardening
molecule test

# Expected output:
# --> Test matrix
# --> Dependency
# --> Create
# --> Converge
# --> Idempotence   ← Re-runs playbook; expects zero changes
# --> Verify        ← Runs verification playbook
# --> Destroy

# All stages should pass. Idempotence is critical -
# it proves the playbook can be safely re-run.

Staged Rollout

Never apply hardening to your entire fleet at once. Use a canary strategy:

# Stage 1: Apply to a single canary host
ansible-playbook site.yml -i inventory/production -l web-01.example.com --diff

# Verify the canary host is healthy:
# - Check application health endpoint
# - Check monitoring dashboards for anomalies
# - Wait 30 minutes

# Stage 2: Apply to 25% of each group
ansible-playbook site.yml -i inventory/production --limit '~web-0[1-2]|~db-01|~k8s-node-01' --diff

# Wait 1 hour. Verify.

# Stage 3: Apply to all hosts
ansible-playbook site.yml -i inventory/production --diff

Drift Detection

Schedule regular compliance checks to detect configuration drift:

# Run the playbook in check mode (dry-run) - reports what WOULD change:
ansible-playbook site.yml -i inventory/production --check --diff

# If the output shows zero changes: fleet is in compliance.
# If changes are reported: a host has drifted.

# Automate this as a cron job or CI pipeline:
# 0 6 * * * ansible-playbook site.yml --check --diff 2>&1 | mail -s "Hardening drift report" security@example.com

For Prometheus-based monitoring:

# Export drift check results as a Prometheus metric:
# Create a textfile exporter gauge:
echo "hardening_drift_detected $(ansible-playbook site.yml --check --diff 2>&1 | grep -c 'changed=')" \
  > /var/lib/node_exporter/hardening_drift.prom

Expected Behaviour

After setting up and running the playbook:

  • ansible-playbook site.yml applies all hardening across the fleet idempotently
  • Re-running produces zero changes (idempotent)
  • molecule test passes on both Ubuntu 24.04 and RHEL 9
  • Canary host rollout catches breaking changes before fleet-wide application
  • Drift detection (scheduled --check --diff) reports zero changes when fleet is compliant
  • Each server role (web, database, K8s node) has appropriate hardening with role-specific overrides
  • New hosts added to inventory are automatically hardened on the next playbook run

Trade-offs

Decision Impact Risk Mitigation
CIS Level 1 baseline (not Level 2) Level 2 adds 20+ additional controls with higher breakage risk Some compliance frameworks require Level 2 Start with Level 1. Add Level 2 controls incrementally after testing each one.
Ansible over Salt/Puppet Broadest adoption, lowest learning curve, agentless Salt is faster for event-driven remediation; Puppet better for very large fleets (>1000 hosts) Ansible is the right choice for most teams. Migrate later if needed.
Molecule Docker testing Fast (2-3 minutes), runs in CI Docker doesn’t perfectly replicate bare-metal sysctl behaviour Supplement with a staging VM for sysctl-specific tests. Docker catches 90% of issues.
Template-based configs Configuration is generated from variables; one source of truth Template errors can produce invalid configs Validation steps in tasks (sshd -t, sysctl --system) catch template errors before they take effect.
Staged rollout (canary → 25% → 100%) Catches breaking changes before fleet-wide impact Slower than full fleet deployment The time cost (1-2 hours for staged rollout) is negligible compared to the recovery cost of a fleet-wide breakage.

Failure Modes

Failure Symptom Detection Recovery
Playbook locks out SSH Cannot SSH after hardening run Lose SSH access; console access required Include SSH connectivity test as the LAST task in the playbook. If it fails, the override is rolled back before the connection drops. Always test from a second SSH session.
sysctl change breaks application Application fails after sysctl hardening Application monitoring shows errors; canary host catches this Canary deployment limits blast radius. Override the specific sysctl in the host’s group_vars. Re-run playbook.
Template syntax error sshd_config or sysctl.conf is invalid validate parameter on template task catches this; task fails before deploying invalid config Fix the template. The validate parameter ensures the old config stays in place until the new one passes validation.
Molecule passes but production fails Hardening works in Docker but breaks on real hardware Production monitoring detects failure; canary strategy limits blast radius Fix playbook; add the failing scenario to Molecule tests. Supplement Docker testing with staging VM tests for hardware-specific settings.
Drift detected after package upgrade apt upgrade resets an sshd_config setting to default Drift detection (scheduled --check --diff) reports changes Re-run the playbook. The drift is automatically remediated. Investigate which package caused the reset.
Role-specific override missing Database server breaks because sysctl is too restrictive Database monitoring shows performance degradation or errors; canary catches this Add the necessary override to the database group_vars. Re-run playbook.

When to Consider a Managed Alternative

Transition point: Maintaining hardening playbooks across 2+ OS versions and 3+ server roles requires 4-8 hours per month. When the maintenance burden exceeds this, or when the team is moving to containers and managed Kubernetes where host-level hardening is abstracted away.

What managed providers handle:

  • Managed Kubernetes (Civo #22, DigitalOcean #21), Node OS hardening is the provider’s responsibility. You do not run Ansible against managed K8s nodes.
  • Aqua (#123) and Sysdig (#122): Verify compliance across a fleet and alert on drift. They do not remediate (Ansible does that), but they provide the monitoring layer that detects when remediation is needed.
  • Chef InSpec (#36): Compliance verification. Use InSpec as the verifier in your Ansible workflow: Ansible remediates, InSpec verifies.

What you still control: For any hosts you manage directly (bare metal, VMs, self-managed K8s nodes), Ansible hardening remains your responsibility. This playbook collection applies directly to those hosts.

Premium content pack: The complete Ansible playbook collection. roles for base, webserver, database, kubernetes-node, with tested templates for all configurations covered in Articles #1, #2, #5, #7, #8, and #10. Tested with Molecule on Ubuntu 24.04 LTS and RHEL 9. Includes CI pipeline configuration for automated drift detection.