Restrict a Container's Syscalls with seccomp
Kubernetes v1.19 [stable]
Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a node to your Pods and containers.
Identifying the privileges required for your workloads can be difficult. In this tutorial, you will go through how to load seccomp profiles into a local Kubernetes cluster, how to apply them to a Pod, and how you can begin to craft profiles that give only the necessary privileges to your container processes.
Objectives
- Learn how to load seccomp profiles on a node
- Learn how to apply a seccomp profile to a container
- Observe auditing of syscalls made by a container process
- Observe behavior when a missing profile is specified
- Observe a violation of a seccomp profile
- Learn how to create fine-grained seccomp profiles
- Learn how to apply a container runtime default seccomp profile
Before you begin
In order to complete all steps in this tutorial, you must install kind and kubectl.
This tutorial shows some examples that are still alpha (since v1.22) and others that use only generally available seccomp functionality. You should make sure that your cluster is for the version you are using.
The tutorial also uses the curl
tool for downloading examples to your computer.
You can adapt the steps to use a different tool if you prefer.
privileged: true
set in the container's securityContext
. Privileged containers always
run as Unconfined
.
Download example seccomp profiles
The contents of these profiles will be explored later on, but for now go ahead
and download them into a directory named profiles/
so that they can be loaded
into the cluster.
{
"defaultAction": "SCMP_ACT_LOG"
}
{
"defaultAction": "SCMP_ACT_ERRNO"
}
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"accept4",
"epoll_wait",
"pselect6",
"futex",
"madvise",
"epoll_ctl",
"getsockname",
"setsockopt",
"vfork",
"mmap",
"read",
"write",
"close",
"arch_prctl",
"sched_getaffinity",
"munmap",
"brk",
"rt_sigaction",
"rt_sigprocmask",
"sigaltstack",
"gettid",
"clone",
"bind",
"socket",
"openat",
"readlinkat",
"exit_group",
"epoll_create1",
"listen",
"rt_sigreturn",
"sched_yield",
"clock_gettime",
"connect",
"dup2",
"epoll_pwait",
"execve",
"exit",
"fcntl",
"getpid",
"getuid",
"ioctl",
"mprotect",
"nanosleep",
"open",
"poll",
"recvfrom",
"sendto",
"set_tid_address",
"setitimer",
"writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Run these commands:
mkdir ./profiles
curl -L -o profiles/audit.json https://k8s.io/examples/pods/security/seccomp/profiles/audit.json
curl -L -o profiles/violation.json https://k8s.io/examples/pods/security/seccomp/profiles/violation.json
curl -L -o profiles/fine-grained.json https://k8s.io/examples/pods/security/seccomp/profiles/fine-grained.json
ls profiles
You should see three profiles listed at the end of the final step:
audit.json fine-grained.json violation.json
Create a local Kubernetes cluster with kind
For simplicity,
Download that example kind configuration, and save it to a file named You can set a specific Kubernetes version by setting the node's container image.
See
As an alpha feature, you can configure Kubernetes to use the profile that the
container runtime
prefers by default, rather than falling back to Once you have a kind configuration in place, create the kind cluster with
that configuration: After the new Kubernetes cluster is ready, identify the Docker container running
as the single node cluster: You should see output indicating that a container is running with name
If observing the filesystem of that container, you should see that the
You have verified that these seccomp profiles are available to the kubelet
running within kind. If enabled, the kubelet will use the Some workloads may require a lower amount of syscall restrictions than others.
This means that they can fail during runtime even with the If you were introducing this feature into production-like cluster, the Kubernetes project
recommends that you enable this feature gate on a subset of your nodes and then
test workload execution before rolling the change out cluster-wide. More detailed information about a possible upgrade and downgrade strategy can be
found in the . Since the feature is in alpha state it is disabled per default. To enable it,
pass the flags If the cluster is ready, then running a pod: Should now have the default seccomp profile attached. This can be verified by
using To start off, apply the Here's a manifest for that Pod: Create the Pod in the cluster: This profile does not restrict any syscalls, so the Pod should start
successfully. In order to be able to interact with this endpoint exposed by this
container, create a NodePort Services
that allows access to the endpoint from inside the kind control plane container. Check what port the Service has been assigned on the node. The output is similar to: Now you can use You can see that the process is running, but what syscalls did it actually make?
Because this Pod is running in a local cluster, you should be able to see those
in You should already see some logs of syscalls made by For example: You can begin to understand the syscalls required by the Clean up that Pod and Service before moving to the next section: For demonstration, apply a profile to the Pod that does not allow for any
syscalls. The manifest for this demonstration is: Attempt to create the Pod in the cluster: The Pod creates, but there is an issue.
If you check the status of the Pod, you should see that it failed to start. As seen in the previous example, the Clean up that Pod before moving to the next section: If you take a look at the The manifest for this example is: Create the Pod in your cluster: The Pod should be showing as having started successfully: Open up a new terminal window and use Next, expose the Pod with a NodePort Service: Check what port the Service has been assigned on the node: The output is similar to: Use You should see no output in the Clean up that Pod and Service before moving to the next section: Most container runtimes provide a sane set of default syscalls that are allowed
or not. You can adopt these defaults for your workload by setting the seccomp
type in the security context of a pod or container to Here's a manifest for a Pod that requests the Create that Pod: The Pod should be showing as having started successfully: Finally, now that you saw that work OK, clean up: You can learn more about Linux seccomp:apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
extraMounts:
- hostPath: "./profiles"
containerPath: "/var/lib/kubelet/seccomp/profiles"
kind.yaml
:curl -L -O https://k8s.io/examples/pods/security/seccomp/kind.yaml
Unconfined
.
If you want to try that, see
enable the use of RuntimeDefault
as the default seccomp profile for all workloads
before you continue.kind create cluster --config=kind.yaml
docker ps
kind-control-plane
. The output is similar to:CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6a96207fed4b kindest/node:v1.18.2 "/usr/local/bin/entr…" 27 seconds ago Up 24 seconds 127.0.0.1:42223->6443/tcp kind-control-plane
profiles/
directory has been successfully loaded into the default seccomp path
of the kubelet. Use docker exec
to run a command in the Pod:# Change 6a96207fed4b to the container ID you saw from "docker ps"
docker exec -it 6a96207fed4b ls /var/lib/kubelet/seccomp/profiles
audit.json fine-grained.json violation.json
Enable the use of
RuntimeDefault
as the default seccomp profile for all workloadsKubernetes v1.22 [alpha]
SeccompDefault
is an optional kubelet
feature gate as
well as corresponding --seccomp-default
command line flag.
Both have to be enabled simultaneously to use the feature.RuntimeDefault
seccomp profile by default, which is
defined by the container runtime, instead of using the Unconfined
(seccomp disabled) mode.
The default profiles aim to provide a strong set
of security defaults while preserving the functionality of the workload. It is
possible that the default profiles differ between container runtimes and their
release versions, for example when comparing those from CRI-O and containerd.securityContext.seccompProfile
API field nor add the deprecated annotations of
the workload. This provides users the possibility to rollback anytime without
actually changing the workload configuration. Tools like
RuntimeDefault
profile. To mitigate such a failure, you can:
Unconfined
.SeccompDefault
feature for the nodes. Also making sure that
workloads get scheduled on nodes where the feature is disabled.--feature-gates=SeccompDefault=true --seccomp-default
to the
kubelet
CLI or enable it via the kubelet configuration
file. To enable the
feature gate in SeccompDefault feature
:kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
SeccompDefault: true
nodes:
- role: control-plane
image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
seccomp-default: "true"
- role: worker
image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
feature-gates: SeccompDefault=true
seccomp-default: "true"
kubectl run --rm -it --restart=Never --image=alpine alpine -- sh
docker exec
to run crictl inspect
for the container on the kind
worker:docker exec -it kind-worker bash -c \
'crictl inspect $(crictl ps --name=alpine -q) | jq .info.runtimeSpec.linux.seccomp'
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
"syscalls": [
{
"names": ["..."]
}
]
}
Create a Pod with a seccomp profile for syscall auditing
audit.json
profile, which will log all syscalls of the
process, to a new Pod.apiVersion: v1
kind: Pod
metadata:
name: audit-pod
labels:
app: audit-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/audit.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
seccomp.security.alpha.kubernetes.io/pod
(for the whole pod) and
container.seccomp.security.alpha.kubernetes.io/[name]
(for a single container)
is going to be removed with the release of Kubernetes v1.25. Please always use
the native API fields in favor of the annotations.
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/audit-pod.yaml
kubectl get pod/audit-pod
NAME READY STATUS RESTARTS AGE
audit-pod 1/1 Running 0 30s
kubectl expose pod audit-pod --type NodePort --port 5678
kubectl get service audit-pod
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
audit-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s
curl
to access that endpoint from inside the kind control plane container,
at the port exposed by this Service. Use docker exec
to run the curl
command within the
container belonging to that control plane container:# Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
docker exec -it 6a96207fed4b curl localhost:32373
just made some syscalls!
/var/log/syslog
. Open up a new terminal window and tail
the output for
calls from http-echo
:tail -f /var/log/syslog | grep 'http-echo'
http-echo
, and if you
curl
the endpoint in the control plane container you will see more written.Jul 6 15:37:40 my-machine kernel: [369128.669452] audit: type=1326 audit(1594067860.484:14536): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=51 compat=0 ip=0x46fe1f code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669453] audit: type=1326 audit(1594067860.484:14537): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=54 compat=0 ip=0x46fdba code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669455] audit: type=1326 audit(1594067860.484:14538): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669456] audit: type=1326 audit(1594067860.484:14539): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=288 compat=0 ip=0x46fdba code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669517] audit: type=1326 audit(1594067860.484:14540): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=0 compat=0 ip=0x46fd44 code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669519] audit: type=1326 audit(1594067860.484:14541): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
Jul 6 15:38:40 my-machine kernel: [369188.671648] audit: type=1326 audit(1594067920.488:14559): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
Jul 6 15:38:40 my-machine kernel: [369188.671726] audit: type=1326 audit(1594067920.488:14560): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
http-echo
process by
looking at the syscall=
entry on each line. While these are unlikely to
encompass all syscalls it uses, it can serve as a basis for a seccomp profile
for this container.kubectl delete service audit-pod --wait
kubectl delete pod audit-pod --wait --now
Create Pod with seccomp profile that causes violation
apiVersion: v1
kind: Pod
metadata:
name: violation-pod
labels:
app: violation-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/violation.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/violation-pod.yaml
kubectl get pod/violation-pod
NAME READY STATUS RESTARTS AGE
violation-pod 0/1 CrashLoopBackOff 1 6s
http-echo
process requires quite a few
syscalls. Here seccomp has been instructed to error on any syscall by setting
"defaultAction": "SCMP_ACT_ERRNO"
. This is extremely secure, but removes the
ability to do anything meaningful. What you really want is to give workloads
only the privileges they need.kubectl delete pod violation-pod --wait --now
Create Pod with seccomp profile that only allows necessary syscalls
fine-grained.json
profile, you will notice some of the syscalls
seen in syslog of the first example where the profile set "defaultAction": "SCMP_ACT_LOG"
. Now the profile is setting "defaultAction": "SCMP_ACT_ERRNO"
,
but explicitly allowing a set of syscalls in the "action": "SCMP_ACT_ALLOW"
block. Ideally, the container will run successfully and you will see no messages
sent to syslog
.apiVersion: v1
kind: Pod
metadata:
name: fine-pod
labels:
app: fine-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/fine-grained.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/fine-pod.yaml
kubectl get pod fine-pod
NAME READY STATUS RESTARTS AGE
fine-pod 1/1 Running 0 30s
tail
to monitor for log entries that
mention calls from http-echo
:# The log path on your computer might be different from "/var/log/syslog"
tail -f /var/log/syslog | grep 'http-echo'
kubectl expose pod fine-pod --type NodePort --port 5678
kubectl get service fine-pod
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fine-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s
curl
to access that endpoint from inside the kind control plane container:# Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
docker exec -it 6a96207fed4b curl localhost:32373
just made some syscalls!
syslog
. This is because the profile allowed all
necessary syscalls and specified that an error should occur if one outside of
the list is invoked. This is an ideal situation from a security perspective, but
required some effort in analyzing the program. It would be nice if there was a
simple way to get closer to this security without requiring as much effort.kubectl delete service fine-pod --wait
kubectl delete pod fine-pod --wait --now
Create Pod that uses the container runtime default seccomp profile
RuntimeDefault
.SeccompDefault
feature gate enabled, then Pods use the RuntimeDefault
seccomp profile whenever
no other seccomp profile is specified. Otherwise, the default is Unconfined
.
RuntimeDefault
seccomp profile
for all its containers:apiVersion: v1
kind: Pod
metadata:
name: default-pod
labels:
app: default-pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some more syscalls!"
securityContext:
allowPrivilegeEscalation: false
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/default-pod.yaml
kubectl get pod default-pod
NAME READY STATUS RESTARTS AGE
default-pod 1/1 Running 0 20s
kubectl delete pod default-pod --wait --now
What's next