Running Kubernetes Node Components as a Non-root User
Kubernetes v1.22 [alpha]
This document describes how to run Kubernetes Node components such as kubelet, CRI, OCI, and CNI without root privileges, by using a user namespace.
This technique is also known as rootless mode.
This document describes how to run Kubernetes Node components (and hence pods) as a non-root user.
If you are just looking for how to run a pod as a non-root user, see SecurityContext.
Before you begin
Your Kubernetes server must be at or later than version 1.22.
To check the version, enter kubectl version
.
- /etc/subgid
- Enable the
KubeletInUserNamespace
feature gate
Running Kubernetes inside Rootless Docker/Podman
kind
See .
See the page about the
Rootless Podman is not supported.
See
Sysbox supports running Kubernetes inside unprivileged containers without
requiring Cgroup v2 and without the
See
Usernetes supports both containerd and CRI-O as CRI runtimes.
Usernetes supports multi-node clusters using Flannel (VXLAN). See
This section provides hints for running Kubernetes in a user namespace manually. The first step is to create a user namespace. If you are trying to run Kubernetes in a user-namespaced container such as
Rootless Docker/Podman or LXC/LXD, you are all set, and you can go to the next subsection. Otherwise you have to create a user namespace by yourself, by calling A user namespace can be also unshared by using command line tools such as: After unsharing the user namespace, you will also have to unshare other namespaces such as mount namespace. You do not need to call At least, the following directories need to be writable in the namespace (not outside the namespace): In addition to the user namespace, you also need to have a writable cgroup tree with cgroup v2. If you are trying to run Kubernetes in Rootless Docker/Podman or LXC/LXD on a systemd-based host, you are all set. Otherwise you have to create a systemd unit with On your node, systemd must already be configured to allow delegation; for more details, see
The network namespace of the Node components has to have a non-loopback interface, which can be for example configured with
,
The network namespaces of the Pods can be configured with regular CNI plugins.
For multi-node networking, Flannel (VXLAN, 8472/UDP) is known to work. Ports such as the kubelet port (10250/TCP) and You can use the port forwarder from K3s.
See
for more details.
The implementation can be found in
The kubelet relies on a container runtime. You should deploy a container runtime such as
containerd or CRI-O and ensure that it is running within the user namespace before the kubelet starts. Running CRI plugin of containerd in a user namespace is supported since containerd 1.4. Running containerd within a user namespace requires the following configurations. The default path of the configuration file is Running CRI-O in a user namespace is supported since CRI-O 1.22. CRI-O requires an environment variable The following configurations are also recommended: The default path of the configuration file is Running kubelet in a user namespace requires the following configuration: When the Within a user namespace, the kubelet also ignores any error raised from trying to open The Running kubelet in a user namespace without using this feature gate is also possible
by mounting a specially crafted proc filesystem (as done by
Running kube-proxy in a user namespace requires the following configuration: Most of "non-local" volume drivers such as Some CNI plugins may not work. Flannel (VXLAN) is known to work. For more on this, see the
minikube
Running Kubernetes inside Unprivileged Containers
sysbox
KubeletInUserNamespace
feature gate. It
does this by exposing specially crafted /proc
and /sys
filesystems inside
the container plus several other advanced OS virtualization techniques.Running Rootless Kubernetes directly on a host
K3s
Usernetes
Manually deploy a node that runs the kubelet in a user namespace
Creating a user namespace
unshare(2)
with CLONE_NEWUSER
.chroot()
nor pivot_root()
after unsharing the mount namespace,
however, you have to mount writable filesystems on several directories in the namespace.
/etc
/run
/var/logs
/var/lib/kubelet
/var/lib/cni
/var/lib/containerd
(for containerd)/var/lib/containers
(for CRI-O)Creating a delegated cgroup tree
Delegate=yes
property to delegate a cgroup tree with writable permission.Configuring network
NodePort
service ports have to be exposed from the Node network namespace to
the host with an external port forwarder, such as RootlessKit, slirp4netns, or
Configuring CRI
version = 2
[plugins."io.containerd.grpc.v1.cri"]
# Disable AppArmor
disable_apparmor = true
# Ignore an error during setting oom_score_adj
restrict_oom_score_adj = true
# Disable hugetlb cgroup v2 controller (because systemd does not support delegating hugetlb controller)
disable_hugetlb_controller = true
[plugins."io.containerd.grpc.v1.cri".containerd]
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
snapshotter = "fuse-overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# We use cgroupfs that is delegated by systemd, so we do not use SystemdCgroup driver
# (unless you run another systemd in the namespace)
SystemdCgroup = false
/etc/containerd/config.toml
.
The path can be specified with containerd -c /path/to/containerd/config.toml
._CRIO_ROOTLESS=1
to be set.[crio]
storage_driver = "overlay"
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
storage_option = ["overlay.mount_program=/usr/local/bin/fuse-overlayfs"]
[crio.runtime]
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroup_manager = "cgroupfs"
/etc/crio/crio.conf
.
The path can be specified with crio --config /path/to/crio/crio.conf
.Configuring kubelet
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
KubeletInUserNamespace: true
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroupDriver: "cgroupfs"
KubeletInUserNamespace
feature gate is enabled, the kubelet ignores errors
that may happen during setting the following sysctl values on the node.
vm.overcommit_memory
vm.panic_on_oom
kernel.panic
kernel.panic_on_oops
kernel.keys.root_maxkeys
kernel.keys.root_maxbytes
./dev/kmsg
.
This feature gate also allows kube-proxy to ignore an error during setting RLIMIT_NOFILE
.KubeletInUserNamespace
feature gate was introduced in Kubernetes v1.22 with "alpha" status.Configuring kube-proxy
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "iptables" # or "userspace"
conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
tcpCloseWaitTimeout: 0s
Caveats
nfs
and iscsi
do not work.
Local volumes like local
, hostPath
, emptyDir
, configMap
, secret
, and downwardAPI
are known to work.See Also