Container runtimes present a unique security challenge: they provide lightweight isolation, but that isolation is far thinner than a full virtual machine. A container shares the host kernel. An exploit that achieves arbitrary kernel code execution in a container breaks the isolation boundary entirely. Defense-in-depth for containers therefore focuses on shrinking the attack surface exposed to workloads through syscall filtering, mandatory access controls, and runtime behavioral monitoring.
The Threat Model
Understanding what you are protecting against clarifies which controls matter:
- Container escape: An attacker who compromises a container process exploits a kernel vulnerability to gain host-level access. Seccomp and AppArmor reduce the kernel attack surface available to this path.
- Privilege escalation within a container: A non-root container process exploits a SUID binary or capability to become root inside the container. Pod Security Standards and capability dropping address this.
- Malicious container image: A compromised or malicious image executes unexpected processes, exfiltrates data, or joins a botnet. Runtime detection (Falco) catches behavioral anomalies.
- Lateral movement: A compromised container attempts to communicate with other services in the cluster. NetworkPolicy controls east-west traffic.
Seccomp: Syscall Filtering
Linux has approximately 400 syscalls. Most containerized applications use fewer than 50 of them regularly. Seccomp (Secure Computing Mode) allows you to specify an allowlist of permitted syscalls; all others result in a SIGKILL or an EPERM error, depending on configuration.
Default Seccomp Profile
Docker and containerd ship a default seccomp profile that blocks ~44 syscalls most dangerous for container escape, including ptrace, personality, keyctl, and the clock-setting syscalls. This is a reasonable baseline but is not enabled by default in Kubernetes — you must opt in:
apiVersion: v1
kind: Pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # Uses the container runtime's default profile
Custom Seccomp Profiles
For workloads with well-understood syscall requirements, a custom profile that allows only the specific syscalls your application uses is significantly more restrictive than the default:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86"],
"syscalls": [
{
"names": [
"accept4", "bind", "brk", "clone", "close", "connect",
"epoll_create1", "epoll_ctl", "epoll_wait", "execve", "exit",
"exit_group", "fstat", "futex", "getpid", "getuid", "listen",
"lstat", "mmap", "mprotect", "munmap", "nanosleep", "open",
"openat", "read", "recvfrom", "rt_sigaction", "rt_sigprocmask",
"sendto", "set_robust_list", "setitimer", "socket", "stat",
"write"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Generate a minimal profile using strace or the seccomp-bpf tooling to record syscalls your application actually makes during testing, then convert that to an allowlist.
Deploy the profile and reference it in the pod spec:
apiVersion: v1
kind: Pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/api-service-seccomp.json
AppArmor for Containers
AppArmor complements seccomp by restricting file system access, network operations, and capability usage at the MAC (Mandatory Access Control) layer. While seccomp filters syscalls by number, AppArmor policies express restrictions in terms of file paths, network protocols, and capabilities.
Default AppArmor Profile
Docker’s default AppArmor profile (docker-default) blocks dangerous operations like mounting filesystems and writing to sensitive paths. Enable it for all containers by annotating the pod:
metadata:
annotations:
container.apparmor.security.beta.kubernetes.io/api-container: runtime/default
Custom AppArmor Profile
A custom profile for an API service that serves HTTP and writes only to specific directories:
#include <tunables/global>
profile api-service flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
#include <abstractions/nameservice>
# Allow read access to application files
/app/** r,
/app/server ix,
# Allow write to log directory only
/var/log/api/** rw,
/tmp/** rw,
# Network: HTTP and HTTPS only
network tcp,
network udp,
# Deny sensitive paths explicitly
deny /etc/shadow r,
deny /proc/sys/** w,
deny /sys/** w,
# Capabilities: only what's needed
capability net_bind_service,
deny capability sys_admin,
deny capability sys_ptrace,
}
Load the profile on each node:
sudo apparmor_parser -r -W /etc/apparmor.d/api-service
# Verify
sudo aa-status | grep api-service
Use a DaemonSet to distribute and load custom AppArmor profiles across all cluster nodes automatically.
Pod Security Standards
Kubernetes Pod Security Standards (PSS) replaced PodSecurityPolicy in Kubernetes 1.25. Three policy levels are available:
- Privileged: Unrestricted. No controls applied.
- Baseline: Prevents known privilege escalation paths. Disallows privileged containers, hostNetwork/hostPID, dangerous capabilities.
- Restricted: Enforces current hardening best practices. Requires non-root user, read-only root filesystem, dropped all capabilities, seccomp profile set.
Apply PSS at the namespace level:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: v1.29
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
With the restricted policy enforced, pods must include a compliant security context:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: api
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
Falco Runtime Detection
Seccomp and AppArmor enforce static policies. Falco provides dynamic behavioral detection — it observes syscalls in real time and raises alerts when behavior matches threat signatures, regardless of whether a policy explicitly blocked an action.
Deploy Falco as a DaemonSet:
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
--namespace falco --create-namespace \
--set driver.kind=modern_ebpf \
--set falcosidekick.enabled=true \
--set falcosidekick.config.slack.webhookurl="https://hooks.slack.com/services/..."
Example Falco rules for container threat detection:
- rule: Terminal Shell in Container
desc: Detects shell spawned in a container
condition: >
spawned_process and container
and shell_procs
and not proc.pname in (known_shell_spawning_binaries)
output: >
Shell spawned in container (user=%user.name container=%container.id
image=%container.image.repository cmd=%proc.cmdline)
priority: WARNING
- rule: Sensitive File Read in Container
desc: Detects reads of sensitive host files from inside a container
condition: >
open_read and container
and (fd.name startswith /etc/shadow
or fd.name startswith /root/.ssh
or fd.name startswith /etc/kubernetes/pki)
output: >
Sensitive file read from container (file=%fd.name
container=%container.id image=%container.image.repository)
priority: CRITICAL
- rule: Unexpected Network Connection
desc: Container connects to unexpected external host
condition: >
outbound and container
and not proc.name in (allowed_network_tools)
and not fd.sip in (trusted_server_ips)
output: >
Unexpected outbound connection (dest=%fd.rip:%fd.rport
container=%container.id proc=%proc.name)
priority: WARNING
Image Security
Runtime defenses are a last line. Start earlier in the supply chain:
- Minimal base images: distroless or alpine-based images eliminate hundreds of binaries an attacker could leverage post-exploitation.
- Image scanning: Scan images in CI with Trivy or Grype before they reach production. Block images with CRITICAL vulnerabilities from being pushed to your registry.
- Image signing: Use Cosign to sign images and Kyverno or OPA Gatekeeper to enforce that only signed images from trusted registries run in production namespaces.
- No privileged containers: Enforce via PSS or an admission webhook. There are virtually no production workloads that legitimately require
privileged: true.
Conclusion
Container runtime security is layered: seccomp reduces the kernel attack surface syscall by syscall, AppArmor constrains filesystem and network access at the MAC layer, Pod Security Standards enforce baseline hardening policies across the cluster, and Falco detects behavioral anomalies that static policies miss. No single control is sufficient — the value comes from the combination. Implement all layers, and integrate image scanning and signing upstream in your CI/CD pipeline so that threats are caught before they reach the runtime at all.
