Container Escape Prevention: Kernel Namespaces, Seccomp, and AppArmor Deep Dive

Excerpt: Container escape vulnerabilities allow attackers to break out of a container and gain access to the host system or other containers. This deep dive covers Linux kernel namespace isolation mechanisms, custom seccomp syscall filter profiles, AppArmor mandatory access control policies, rootless container architectures, and Falco runtime threat detection.

Introduction

Containers are not virtual machines. While they provide process isolation, they share the host kernel — and that shared surface area is where container escape vulnerabilities live. Understanding how Linux namespaces, seccomp, and AppArmor work at a technical level is essential for anyone responsible for container security in production environments.

This article builds a mental model of the Linux isolation primitives that underpin container security, then translates that understanding into actionable defense configurations: custom seccomp profiles that minimize syscall attack surface, AppArmor policies enforcing MAC, rootless container deployments, and Falco rules that detect escape attempts at runtime.

Linux Kernel Namespaces: The Foundation of Container Isolation

Containers are fundamentally composed of Linux kernel namespaces — kernel features that partition global system resources so that each namespace has its own isolated instance. The kernel currently provides eight namespace types relevant to container isolation:

pid — isolates process ID numbering. A container sees its processes starting from PID 1, unaware of host PIDs
net — isolates network stack: interfaces, routing tables, firewall rules, socket namespaces
mnt — isolates filesystem mount points, preventing container processes from seeing the host filesystem tree
uts — isolates hostname and domain name, allowing each container to have its own hostname
ipc — isolates System V IPC and POSIX message queues
user — isolates user and group IDs, the foundation of rootless containers
cgroup — isolates the cgroup root, preventing containers from seeing the full cgroup hierarchy
time — isolates system clock offsets (Linux 5.6+)

You can inspect a running container’s namespaces directly:

# List namespaces for a container process
# First find the PID of a container process
docker inspect mycontainer --format '{{ .State.Pid }}'
# Returns: 12345

# List namespaces
ls -la /proc/12345/ns/
# lrwxrwxrwx cgroup -> cgroup:[4026531835]
# lrwxrwxrwx ipc    -> ipc:[4026532456]
# lrwxrwxrwx mnt    -> mnt:[4026532457]
# lrwxrwxrwx net    -> net:[4026532459]
# lrwxrwxrwx pid    -> pid:[4026532458]
# lrwxrwxrwx user   -> user:[4026531837]
# lrwxrwxrwx uts    -> uts:[4026532455]

# Compare with a host process — different inode numbers = different namespaces
ls -la /proc/1/ns/

Namespace escape vulnerabilities typically involve a container process gaining a file descriptor or capability that allows it to “re-enter” the host namespace. The classic example is CVE-2019-5736 (runc overwrite), where a container could overwrite the runc binary on the host through a crafted container image.

Linux Capabilities: Reducing Privilege

Root inside a container is not the same as root on the host (when user namespaces are used), but Linux capabilities are still a major attack surface. By default, Docker grants containers a subset of capabilities including NET_RAW, SYS_PTRACE, and CHOWN — capabilities that are unnecessary for most applications.

Always drop all capabilities and add back only what is required:

# Docker run with minimal capabilities
docker run \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --security-opt no-new-privileges:true \
  example-corp.com/myapp:latest

# Kubernetes pod security context — equivalent
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: myapp
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 1000
        capabilities:
          drop: ["ALL"]
          add: ["NET_BIND_SERVICE"]

The no-new-privileges flag prevents setuid and setgid executables inside the container from gaining elevated privileges, which closes a common privilege escalation path.

Custom Seccomp Profiles

Seccomp (Secure Computing Mode) is a Linux kernel feature that filters system calls. The default Docker seccomp profile blocks about 44 of the 300+ available syscalls, but a custom profile tailored to your application’s actual syscall usage provides much stronger protection.

The approach is to trace your application’s syscall usage, then generate a minimal allowlist profile:

# Step 1: Trace syscalls during normal operation using strace
strace -f -e trace=all -o /tmp/syscalls.txt ./myapp

# Step 2: Or use a more comprehensive approach with perf
perf trace -o /tmp/perf-syscalls.txt ./myapp &
# Run through all application code paths (startup, requests, shutdown)

# Step 3: Generate a seccomp profile from the trace
# Tools like syscall2seccomp or oci-seccomp-bpf-hook can help
# Manual approach: extract unique syscall names
grep "^[0-9]" /tmp/syscalls.txt | awk '{print $NF}' | sort -u

A minimal seccomp profile for a Node.js HTTP server looks like:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": [
        "accept4", "bind", "brk", "clock_gettime", "clone",
        "close", "connect", "epoll_create1", "epoll_ctl", "epoll_wait",
        "eventfd2", "exit", "exit_group", "fcntl", "fstat",
        "futex", "getdents64", "getpid", "getrandom", "gettid",
        "gettimeofday", "ioctl", "listen", "lseek", "madvise",
        "mmap", "mprotect", "munmap", "nanosleep", "open",
        "openat", "pipe2", "poll", "read", "recvfrom",
        "recvmsg", "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
        "sendmsg", "sendto", "set_robust_list", "setsockopt",
        "socket", "stat", "tgkill", "uname", "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply the profile:

# Docker
docker run \
  --security-opt seccomp=/path/to/profile.json \
  example-corp.com/myapp:latest

# Kubernetes (via annotation or PodSecurityContext)
apiVersion: v1
kind: Pod
metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: localhost/myapp-profile.json
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: myapp-profile.json

AppArmor Mandatory Access Control

AppArmor is a Linux Security Module implementing mandatory access control through per-program profiles. Where seccomp filters syscall numbers, AppArmor controls file access, network operations, and capabilities at a semantic level — it understands paths and operations rather than kernel ABI.

A custom AppArmor profile for a containerized web application:

# /etc/apparmor.d/docker-myapp
#include <tunables/global>

profile myapp flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  # Allow reading application files
  /app/** r,
  /app/node_modules/** r,
  /app/dist/** r,

  # Allow writing to temp directory only
  /tmp/** rw,

  # Deny write to sensitive paths
  deny /etc/passwd w,
  deny /etc/shadow rw,
  deny /proc/sysrq-trigger rw,

  # Network: allow TCP on ports 8080 and above
  network tcp,

  # Deny ptrace (prevents escape via process injection)
  deny ptrace,

  # Deny mount (prevents namespace manipulation)
  deny mount,

  # Capabilities
  capability net_bind_service,
  deny capability sys_admin,
  deny capability sys_ptrace,
}

Load and enforce the profile:

# Load profile into kernel
apparmor_parser -r -W /etc/apparmor.d/docker-myapp

# Apply to Docker container
docker run \
  --security-opt "apparmor=myapp" \
  example-corp.com/myapp:latest

# Verify profile is active
cat /proc/$(docker inspect --format '{{ .State.Pid }}' mycontainer)/attr/current
# Returns: myapp (enforce)

Rootless Containers

Rootless containers run the container runtime itself as a non-root user, using user namespaces to map container UIDs to unprivileged host UIDs. This fundamentally changes the security model: even if an attacker escapes the container, they are an unprivileged user on the host.

# Configure rootless Docker (per-user installation)
dockerd-rootless-setuptool.sh install

# Or use Podman, which is rootless by default
podman run example-corp.com/myapp:latest

# Check that the container runtime runs as non-root
ps aux | grep rootlesskit
# soulofall  12345  0.0  0.0  rootlesskit --net=slirp4netns ...

# User namespace mapping
cat /proc/$(pgrep -f "myapp")/uid_map
# 0    100000   65536
# Container UID 0 maps to host UID 100000 (unprivileged)

Rootless containers have trade-offs: they cannot bind ports below 1024, some networking modes are unavailable, and certain volume operations behave differently. However, for most web application workloads, these limitations are acceptable given the security improvement.

Falco Runtime Threat Detection

Falco is a cloud-native runtime security tool that uses eBPF or kernel module probes to observe system calls and detect anomalous behavior in real time. It is the primary tool for detecting container escape attempts and post-exploitation activity.

Key Falco rules for container escape detection:

# Detect attempts to write to the host filesystem via /proc
- rule: Write to Host Filesystem via /proc
  desc: A container attempted to write to host filesystem via /proc escape
  condition: >
    container and
    fd.name startswith /proc and
    evt.type = write
  output: >
    Container writing to /proc (container=%container.name pid=%proc.pid
    file=%fd.name user=%user.name)
  priority: CRITICAL

# Detect namespace escape via setns syscall
- rule: Container Namespace Escape Attempt
  desc: A process attempted to change to a different namespace
  condition: >
    container and
    evt.type = setns
  output: >
    Namespace escape attempt (container=%container.name pid=%proc.pid
    user=%user.name)
  priority: CRITICAL

# Detect unexpected outbound connections (C2 beaconing after escape)
- rule: Unexpected Outbound Network Connection
  desc: Container making outbound connection to unexpected destination
  condition: >
    container and
    evt.type = connect and
    fd.typechar = 4 and
    not fd.sip in (allowed_ips)
  output: >
    Unexpected outbound connection (container=%container.name
    dst=%fd.rip:%fd.rport pid=%proc.pid)
  priority: WARNING

# Detect privileged binary execution
- rule: Privileged Binary Executed in Container
  desc: A setuid binary was executed inside a container
  condition: >
    container and
    evt.type = execve and
    proc.is_suid_exe = true
  output: >
    Setuid binary executed in container (container=%container.name
    file=%proc.exepath user=%user.name)
  priority: HIGH

Defense in Depth: Combining All Layers

The strongest container security posture uses all of these controls together. No single layer is sufficient — seccomp may block the specific syscall used in a known exploit, AppArmor prevents filesystem access even if seccomp is bypassed, user namespaces limit the blast radius if both are defeated, and Falco provides detection when prevention fails.

A security checklist for production container workloads:

Run as non-root user inside the container (USER 1000 in Dockerfile)
Drop all capabilities, add back only what is needed
Set --security-opt no-new-privileges:true
Use a custom seccomp profile derived from actual syscall usage
Apply a least-privilege AppArmor profile
Mount filesystems read-only where possible
Use rootless container runtime in environments where it is feasible
Deploy Falco with custom rules for your environment
Enforce these requirements via Kyverno or OPA/Gatekeeper admission policies
Keep container runtimes (runc, containerd) and host kernels patched promptly

Conclusion

Container escape prevention is fundamentally about minimizing the attack surface exposed to container processes. Linux namespaces provide the isolation foundation, but they are not the full story. Seccomp profiles limit which kernel operations a container can invoke, AppArmor enforces what files and capabilities are accessible, rootless architectures remove the value of a successful escape, and Falco ensures that anomalous behavior is detected and alerted on in real time.

The investment in understanding these mechanisms pays dividends not just in security but in system understanding. Engineers who know why their containers are isolated are better positioned to debug isolation failures, design secure architectures, and respond effectively when alerts fire.