Restrict a Container's Syscalls with seccomp
Kubernetes v1.19 [stable]Seccomp stands for secure computing mode and has been a feature of the Linuxkernel since version 2.6.12. It can be used to sandbox the privileges of aprocess, restricting the calls it is able to make from userspace into thekernel. Kubernetes lets you automatically apply seccomp profiles loaded onto anode to your Pods and containers.
Identifying the privileges required for your workloads can be difficult. In thistutorial, you will go through how to load seccomp profiles into a localKubernetes cluster, how to apply them to a Pod, and how you can begin to craftprofiles that give only the necessary privileges to your container processes.
Objectives
- Learn how to load seccomp profiles on a node
- Learn how to apply a seccomp profile to a container
- Observe auditing of syscalls made by a container process
- Observe behavior when a missing profile is specified
- Observe a violation of a seccomp profile
- Learn how to create fine-grained seccomp profiles
- Learn how to apply a container runtime default seccomp profile
Before you begin
In order to complete all steps in this tutorial, you must installkind andkubectl.
The commands used in the tutorial assume that you are usingDocker as your container runtime. (The cluster thatkind creates mayuse a different container runtime internally). You could also usePodman but in that case, you would have to follow specificinstructions in order to complete the taskssuccessfully.
This tutorial shows some examples that are still beta (since v1.25) andothers that use only generally available seccomp functionality. You shouldmake sure that your cluster isconfigured correctlyfor the version you are using.
The tutorial also uses thecurl tool for downloading examples to your computer.You can adapt the steps to use a different tool if you prefer.
Note:
It is not possible to apply a seccomp profile to a container running withprivileged: true set in the container'ssecurityContext. Privileged containers alwaysrun asUnconfined.Download example seccomp profiles
The contents of these profiles will be explored later on, but for now go aheadand download them into a directory namedprofiles/ so that they can be loadedinto the cluster.
{"defaultAction":"SCMP_ACT_LOG"}{"defaultAction":"SCMP_ACT_ERRNO"}{"defaultAction":"SCMP_ACT_ERRNO","architectures": ["SCMP_ARCH_X86_64","SCMP_ARCH_X86","SCMP_ARCH_X32" ],"syscalls": [ {"names": ["accept4","epoll_wait","pselect6","futex","madvise","epoll_ctl","getsockname","setsockopt","vfork","mmap","read","write","close","arch_prctl","sched_getaffinity","munmap","brk","rt_sigaction","rt_sigprocmask","sigaltstack","gettid","clone","bind","socket","openat","readlinkat","exit_group","epoll_create1","listen","rt_sigreturn","sched_yield","clock_gettime","connect","dup2","epoll_pwait","execve","exit","fcntl","getpid","getuid","ioctl","mprotect","nanosleep","open","poll","recvfrom","sendto","set_tid_address","setitimer","writev","fstatfs","getdents64","pipe2","getrlimit" ],"action":"SCMP_ACT_ALLOW" } ]}Run these commands:
mkdir ./profilescurl -L -o profiles/audit.json https://k8s.io/examples/pods/security/seccomp/profiles/audit.jsoncurl -L -o profiles/violation.json https://k8s.io/examples/pods/security/seccomp/profiles/violation.jsoncurl -L -o profiles/fine-grained.json https://k8s.io/examples/pods/security/seccomp/profiles/fine-grained.jsonls profilesYou should see three profiles listed at the end of the final step:
audit.json fine-grained.json violation.jsonCreate a local Kubernetes cluster with kind
For simplicity,kind can be used to create a singlenode cluster with the seccomp profiles loaded. Kind runs Kubernetes in Docker,so each node of the cluster is a container. This allows for filesto be mounted in the filesystem of each container similar to loading filesonto a node.
apiVersion:kind.x-k8s.io/v1alpha4kind:Clusternodes:-role:control-planeextraMounts:-hostPath:"./profiles"containerPath:"/var/lib/kubelet/seccomp/profiles"Download that example kind configuration, and save it to a file namedkind.yaml:
curl -L -O https://k8s.io/examples/pods/security/seccomp/kind.yamlYou can set a specific Kubernetes version by setting the node's container image.SeeNodes within thekind documentation about configuration for more details on this.This tutorial assumes you are using Kubernetes v1.35.
As a beta feature, you can configure Kubernetes to use the profile that thecontainer runtimeprefers by default, rather than falling back toUnconfined.If you want to try that, seeenable the use ofRuntimeDefault as the default seccomp profile for all workloadsbefore you continue.
Once you have a kind configuration in place, create the kind cluster withthat configuration:
kind create cluster --config=kind.yamlAfter the new Kubernetes cluster is ready, identify the Docker container runningas the single node cluster:
docker psYou should see output indicating that a container is running with namekind-control-plane. The output is similar to:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES6a96207fed4b kindest/node:v1.18.2 "/usr/local/bin/entr…" 27 seconds ago Up 24 seconds 127.0.0.1:42223->6443/tcp kind-control-planeIf observing the filesystem of that container, you should see that theprofiles/ directory has been successfully loaded into the default seccomp pathof the kubelet. Usedocker exec to run a command in the Pod:
dockerexec -it kind-control-plane ls /var/lib/kubelet/seccomp/profilesaudit.json fine-grained.json violation.jsonYou have verified that these seccomp profiles are available to the kubeletrunning within kind.
Create a Pod that uses the container runtime default seccomp profile
Most container runtimes provide a sane set of default syscalls that are allowedor not. You can adopt these defaults for your workload by setting the seccomptype in the security context of a pod or container toRuntimeDefault.
Note:
If you have theseccompDefaultconfigurationenabled, then Pods use theRuntimeDefault seccomp profile wheneverno other seccomp profile is specified. Otherwise, the default isUnconfined.Here's a manifest for a Pod that requests theRuntimeDefault seccomp profilefor all its containers:
apiVersion:v1kind:Podmetadata:name:default-podlabels:app:default-podspec:securityContext:seccompProfile:type:RuntimeDefaultcontainers:-name:test-containerimage:hashicorp/http-echo:1.0args:-"-text=just made some more syscalls!"securityContext:allowPrivilegeEscalation:falseCreate that Pod:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/default-pod.yamlkubectl get pod default-podThe Pod should be showing as having started successfully:
NAME READY STATUS RESTARTS AGEdefault-pod 1/1 Running 0 20sDelete the Pod before moving to the next section:
kubectl delete pod default-pod --wait --nowCreate a Pod with a seccomp profile for syscall auditing
To start off, apply theaudit.json profile, which will log all syscalls of theprocess, to a new Pod.
Here's a manifest for that Pod:
apiVersion:v1kind:Podmetadata:name:audit-podlabels:app:audit-podspec:securityContext:seccompProfile:type:LocalhostlocalhostProfile:profiles/audit.jsoncontainers:-name:test-containerimage:hashicorp/http-echo:1.0args:-"-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:falseNote:
Older versions of Kubernetes allowed you to configure seccompbehavior usingannotations.Kubernetes 1.35 only supports using fields within.spec.securityContext to configure seccomp, and this tutorial explains thatapproach.Create the Pod in the cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/audit-pod.yamlThis profile does not restrict any syscalls, so the Pod should startsuccessfully.
kubectl get pod audit-podNAME READY STATUS RESTARTS AGEaudit-pod 1/1 Running 0 30sIn order to be able to interact with this endpoint exposed by thiscontainer, create a NodePortServicethat allows access to the endpoint from inside the kind control plane container.
kubectl expose pod audit-pod --type NodePort --port5678Check what port the Service has been assigned on the node.
kubectl get service audit-podThe output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEaudit-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72sNow you can usecurl to access that endpoint from inside the kind control plane container,at the port exposed by this Service. Usedocker exec to run thecurl command within thecontainer belonging to that control plane container:
# Change 32373 to the port number you saw from "kubectl get service audit-pod"dockerexec -it kind-control-plane curl localhost:32373just made some syscalls!You can see that the process is running, but what syscalls did it actually make?Because this Pod is running in a local cluster, you should be able to see thosein/var/log/syslog on your local system. Open up a new terminal window andtail the output forcalls fromhttp-echo:
# The log path on your computer might be different from "/var/log/syslog"tail -f /var/log/syslog | grep'http-echo'You should already see some logs of syscalls made byhttp-echo, and if you runcurl again insidethe control plane container you will see more output written to the log.
For example:
Jul 6 15:37:40 my-machine kernel: [369128.669452] audit: type=1326 audit(1594067860.484:14536): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=51 compat=0 ip=0x46fe1f code=0x7ffc0000Jul 6 15:37:40 my-machine kernel: [369128.669453] audit: type=1326 audit(1594067860.484:14537): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=54 compat=0 ip=0x46fdba code=0x7ffc0000Jul 6 15:37:40 my-machine kernel: [369128.669455] audit: type=1326 audit(1594067860.484:14538): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000Jul 6 15:37:40 my-machine kernel: [369128.669456] audit: type=1326 audit(1594067860.484:14539): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=288 compat=0 ip=0x46fdba code=0x7ffc0000Jul 6 15:37:40 my-machine kernel: [369128.669517] audit: type=1326 audit(1594067860.484:14540): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=0 compat=0 ip=0x46fd44 code=0x7ffc0000Jul 6 15:37:40 my-machine kernel: [369128.669519] audit: type=1326 audit(1594067860.484:14541): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000Jul 6 15:38:40 my-machine kernel: [369188.671648] audit: type=1326 audit(1594067920.488:14559): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000Jul 6 15:38:40 my-machine kernel: [369188.671726] audit: type=1326 audit(1594067920.488:14560): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000You can begin to understand the syscalls required by thehttp-echo process bylooking at thesyscall= entry on each line. While these are unlikely toencompass all syscalls it uses, it can serve as a basis for a seccomp profilefor this container.
Delete the Service and the Pod before moving to the next section:
kubectl delete service audit-pod --waitkubectl delete pod audit-pod --wait --nowCreate a Pod with a seccomp profile that causes violation
For demonstration, apply a profile to the Pod that does not allow for anysyscalls.
The manifest for this demonstration is:
apiVersion:v1kind:Podmetadata:name:violation-podlabels:app:violation-podspec:securityContext:seccompProfile:type:LocalhostlocalhostProfile:profiles/violation.jsoncontainers:-name:test-containerimage:hashicorp/http-echo:1.0args:-"-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:falseAttempt to create the Pod in the cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/violation-pod.yamlThe Pod creates, but there is an issue.If you check the status of the Pod, you should see that it failed to start.
kubectl get pod violation-podNAME READY STATUS RESTARTS AGEviolation-pod 0/1 CrashLoopBackOff 1 6sAs seen in the previous example, thehttp-echo process requires quite a fewsyscalls. Here seccomp has been instructed to error on any syscall by setting"defaultAction": "SCMP_ACT_ERRNO". This is extremely secure, but removes theability to do anything meaningful. What you really want is to give workloadsonly the privileges they need.
Delete the Pod before moving to the next section:
kubectl delete pod violation-pod --wait --nowCreate a Pod with a seccomp profile that only allows necessary syscalls
If you take a look at thefine-grained.json profile, you will notice some of the syscallsseen in syslog of the first example where the profile set"defaultAction": "SCMP_ACT_LOG". Now the profile is setting"defaultAction": "SCMP_ACT_ERRNO",but explicitly allowing a set of syscalls in the"action": "SCMP_ACT_ALLOW"block. Ideally, the container will run successfully and you will see no messagessent tosyslog.
The manifest for this example is:
apiVersion:v1kind:Podmetadata:name:fine-podlabels:app:fine-podspec:securityContext:seccompProfile:type:LocalhostlocalhostProfile:profiles/fine-grained.jsoncontainers:-name:test-containerimage:hashicorp/http-echo:1.0args:-"-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:falseCreate the Pod in your cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/fine-pod.yamlkubectl get pod fine-podThe Pod should be showing as having started successfully:
NAME READY STATUS RESTARTS AGEfine-pod 1/1 Running 0 30sOpen up a new terminal window and usetail to monitor for log entries thatmention calls fromhttp-echo:
# The log path on your computer might be different from "/var/log/syslog"tail -f /var/log/syslog | grep'http-echo'Next, expose the Pod with a NodePort Service:
kubectl expose pod fine-pod --type NodePort --port5678Check what port the Service has been assigned on the node:
kubectl get service fine-podThe output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEfine-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72sUsecurl to access that endpoint from inside the kind control plane container:
# Change 32373 to the port number you saw from "kubectl get service fine-pod"dockerexec -it kind-control-plane curl localhost:32373just made some syscalls!You should see no output in thesyslog. This is because the profile allowed allnecessary syscalls and specified that an error should occur if one outside ofthe list is invoked. This is an ideal situation from a security perspective, butrequired some effort in analyzing the program. It would be nice if there was asimple way to get closer to this security without requiring as much effort.
Delete the Service and the Pod before moving to the next section:
kubectl delete service fine-pod --waitkubectl delete pod fine-pod --wait --nowEnable the use ofRuntimeDefault as the default seccomp profile for all workloads
Kubernetes v1.27 [stable]To use seccomp profile defaulting, you must run the kubelet with the--seccomp-defaultcommand line flagenabled for each node where you want to use it.
If enabled, the kubelet will use theRuntimeDefault seccomp profile by default, which isdefined by the container runtime, instead of using theUnconfined (seccomp disabled) mode.The default profiles aim to provide a strong setof security defaults while preserving the functionality of the workload. It ispossible that the default profiles differ between container runtimes and theirrelease versions, for example when comparing those from CRI-O and containerd.
Note:
Enabling the feature will neither change the KubernetessecurityContext.seccompProfile API field nor add the deprecated annotations ofthe workload. This provides users the possibility to rollback anytime withoutactually changing the workload configuration. Tools likecrictl inspect can be used toverify which seccomp profile is being used by a container.Some workloads may require a lower amount of syscall restrictions than others.This means that they can fail during runtime even with theRuntimeDefaultprofile. To mitigate such a failure, you can:
- Run the workload explicitly as
Unconfined. - Disable the
SeccompDefaultfeature for the nodes. Also making sure thatworkloads get scheduled on nodes where the feature is disabled. - Create a custom seccomp profile for the workload.
If you were introducing this feature into production-like cluster, the Kubernetes projectrecommends that you enable this feature gate on a subset of your nodes and thentest workload execution before rolling the change out cluster-wide.
You can find more detailed information about a possible upgrade and downgrade strategyin the related Kubernetes Enhancement Proposal (KEP):Enable seccomp by default.
Kubernetes 1.35 lets you configure the seccomp profilethat applies when the spec for a Pod doesn't define a specific seccomp profile.However, you still need to enable this defaulting for each node where you wouldlike to use it.
If you are running a Kubernetes 1.35 cluster and want toenable the feature, either run the kubelet with the--seccomp-default commandline flag, or enable it through thekubelet configurationfile. To enable thefeature gate inkind, ensure thatkind providesthe minimum required Kubernetes version and enables theSeccompDefault featurein the kind configuration:
kind:ClusterapiVersion:kind.x-k8s.io/v1alpha4nodes:-role:control-planeimage:kindest/node:v1.28.0@sha256:9f3ff58f19dcf1a0611d11e8ac989fdb30a28f40f236f59f0bea31fb956ccf5ckubeadmConfigPatches:- | kind: JoinConfiguration nodeRegistration: kubeletExtraArgs: seccomp-default: "true"-role:workerimage:kindest/node:v1.28.0@sha256:9f3ff58f19dcf1a0611d11e8ac989fdb30a28f40f236f59f0bea31fb956ccf5ckubeadmConfigPatches:- | kind: JoinConfiguration nodeRegistration: kubeletExtraArgs: seccomp-default: "true"If the cluster is ready, then running a pod:
kubectl run --rm -it --restart=Never --image=alpine alpine -- shShould now have the default seccomp profile attached. This can be verified byusingdocker exec to runcrictl inspect for the container on the kindworker:
dockerexec -it kind-worker bash -c\'crictl inspect $(crictl ps --name=alpine -q) | jq .info.runtimeSpec.linux.seccomp'{"defaultAction":"SCMP_ACT_ERRNO","architectures": ["SCMP_ARCH_X86_64","SCMP_ARCH_X86","SCMP_ARCH_X32"],"syscalls": [ {"names": ["..."] } ]}What's next
You can learn more about Linux seccomp: