Fixing the Subpath Volume Vulnerability in Kubernetes
On March 12, 2018, the Kubernetes Product Security team disclosed
The vulnerability has been fixed and released in the latest Kubernetes patch releases. We recommend that all users upgrade to get the fix. For more details on the impact and how to get the fix, please see the ). This post presents a technical deep dive on the vulnerability and the solution. To understand the vulnerability, one must first understand how volume and subpath mounting works in Kubernetes. Before a container is started on a node, the kubelet volume manager locally mounts all the volumes specified in the PodSpec under a directory for that Pod on the host system. Once all the volumes are successfully mounted, it constructs the list of volume mounts to pass to the container runtime. Each volume mount contains information that the container runtime needs, the most relevant being: When starting the container, the container runtime creates the path in the container root filesystem, if necessary, and then bind mounts it to the provided host path. Subpath mounts are passed to the container runtime just like any other volume. The container runtime does not distinguish between a base volume and a subpath volume, and handles them the same way. Instead of passing the host path to the root of the volume, Kubernetes constructs the host path by appending the Pod-specified subpath (a relative path) to the base volume’s host path. For example, here is a spec for a subpath volume mount: In this example, when the Pod gets scheduled to a node, the system will: The vulnerability with subpath volumes was discovered by Maxim Ivanov, by making a few observations: The basic example below demonstrates the vulnerability. It takes advantage of the observations outlined above by: For this example, the system will: This is a manifestation of a
It should be noted that init containers are not always required for this exploit, depending on the volume type. It is used in the EmptyDir example because EmptyDir volumes cannot be shared with other Pods, and only created when a Pod is created, and destroyed when the Pod is destroyed. For persistent volume types, this exploit can also be done across two different Pods sharing the same volume. The underlying issue is that the host path for subpaths are untrusted and can point anywhere in the system. The fix needs to ensure that this host path is both: The Kubernetes product security team went through many iterations of possible solutions before finally agreeing on a design. Our first design was relatively simple. For each subpath mount in each container: However, this design is prone to the classic time-of-check-to-time-of-use (
We went a bit wild with this idea: While this design does ensure that the symlinks cannot point outside of the volume, it was ultimately rejected due to difficulties of implementing the chroot mechanism in 4) across all the various distros and environments that Kubernetes has to support, including containerized kubelets. Coming back to earth a little bit, our next idea was to: In theory, this sounded pretty simple, but in reality, 2) was quite difficult to implement correctly. Many scenarios had to be handled where volumes (like EmptyDir) could be on a shared filesystem, on a separate filesystem, on the root filesystem, or not on the root filesystem. NFS volumes ended up handling all bind mounts as a separate mount, instead of as a child to the base volume. There was additional uncertainty about how out-of-tree volume types (that we couldn’t test) would behave. Given the amount of scenarios and corner cases that had to be handled with the previous design, we really wanted to find a solution that was more generic across all volume types. The final design that we ultimately went with was to: Note that this solution is different for Windows hosts, where the mounting semantics are different than Linux. In Windows, the design is to: Both solutions are able to address all the requirements of: Special thanks to many folks involved with handling this vulnerability: If you find a vulnerability in Kubernetes, please follow our responsible disclosure process and
-- Michelle Au, Software Engineer, Google; and Jan Šafránek, Software Engineer, Red HatKubernetes Background
/var/lib/kubelet/pods/<pod uid>/volumes/<volume type>/<volume name>
)apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
<snip>
volumeMounts:
- mountPath: /mnt/data
name: my-volume
subPath: dataset1
volumes:
- name: my-volume
emptyDir: {}
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + dataset1
/mnt/data
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1
/mnt/data
in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1
on the host.The Vulnerability
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
initContainers:
- name: prep-symlink
image: "busybox"
command: ["bin/sh", "-ec", "ln -s / /mnt/data/symlink-door"]
volumeMounts:
- name: my-volume
mountPath: /mnt/data
containers:
- name: my-container
image: "busybox"
command: ["/bin/sh", "-ec", "ls /mnt/data; sleep 999999"]
volumeMounts:
- mountPath: /mnt/data
name: my-volume
subPath: symlink-door
volumes:
- name: my-volume
emptyDir: {}
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
/mnt/data
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
/mnt/data
in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
on the host./mnt/data/symlink-door
-> /
, and then exits./var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + symlink-door
.
/mnt/data
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/symlink-door
/mnt/data
in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty~dir/my-volume/symlink-door
/
on the host! Now the container can see all of the host’s filesystem through its mount point /mnt/data
.The Fix
Idea 1
Idea 2
dir1
.dir1/volume
.dir1
.volume/subpath
to subpath
. This ensures that any symlinks get resolved to inside the chroot environment.dir1/subpath
to the container runtime.Idea 3
The Solution
openat()
syscall, and disallow symlinks. With each path segment, validate that the current path is within the base volume./proc/<kubelet pid>/fd/<final fd>
to a working directory under the kubelet’s pod directory. The proc file is a link to the opened file. If that file gets replaced while kubelet still has it open, then the link will still point to the original file.
Acknowledgements