Basic knowledge of Kubernetes vocabulary is assumed in this article. You can find a quick introduction on Kubernetes at the beginning of my previous article about a tool, kdigger, as well as a link to the Kubernetes glossary.
Introduction
One of the ideas behind Kubernetes is to abstract the underlying infrastructure running the workloads. As such, ideally, the Pods should not rely on their host’s filesystem or configuration. They should run the same, regardless of their scheduling inside of the cluster.
The volume abstraction was introduced to Kubernetes to, first, solve the data persistence problem: as the Pods can be scheduled anywhere, volumes need to follow them. Second, to address the data-sharing problem for data manipulated by multiple containers. The various type of volumes can be explored in the storage section of the Kubernetes concepts documentation.
The hostPath type is one of those, and it is worth noticing that its documentation […]”. Indeed, hostPath is described as follows in the documentation:
A hostPath volume mounts a file or directory from the host node’s filesystem into your Pod. This is not something that most Pods will need, but it offers a powerful escape hatch for some applications.
As you see, the reality is not ideal, and some specific workloads need direct access to some resources of their nodes. That’s the reason why the hostPath volume type was introduced to Kubernetes.
Thus, multiple warnings explain that it presents security risks and should be avoided as much as possible. Indeed, the nature of hostPath is breaking part of the isolation promised by containers: the filesystem isolation. Concretely, and as specified in the documentation, a badly configured hostPath could expose privileged system credentials, such as for the kubelet, and make privilege escalation possible into the cluster.
Without going further on the potential configuration risks, we are going to investigate how this feature suffers from some vulnerabilities that led to container breakout possibilities for Kubernetes users in the past.
Context
To give a little bit of context, we started investigating the kubelet CVE-2021-25741 vulnerability a month after its public disclosure. It was revealed in issue #104980 on the Kubernetes Github repository and we began by gathering information and links. We shared a link to a post that seemed to greatly explain the issue on the Kubernetes blog (Fixing the Subpath Volume Vulnerability in Kubernetes) and we noticed that this post was three years older than the vulnerability we were investigating. After the initial surprise, we started to understand the situation.
The kubelet CVE-2021-25741 is related to two other vulnerabilities, one from late 2017, CVE-2017-1002101, also related to kubelet, and one from late 2020, CVE-2021-30465, related to runc. To explain their relations, CVE-2021-25741 was discovered because of CVE-2021-30465 and is a result of the “incomplete” patch of CVE-2017-1002101.
I mean,
— Yawar Amin.cad (@yawaramin) October 3, 2021
- CVE-2017-1002101 - subpath volume mount handling allows arbitrary file access in host filesystem
- CVE-2021-25741: …user may be able to create a container with subpath volume mounts to access files & directories outside of the volume, including on the host filesystem. https://t.co/0VJWRL8Xsk
CVEs
kubelet CVE-2017-1002101
In 2018, one of the most severe Kubernetes vulnerabilities was disclosed: CVE-2017-1002101. It concerns the agent running on every node of the cluster, communicating with the API: the kubelet. With a symlink race, it is possible to reliably mount an arbitrary HostPaths into a container’s Pod without having the actual authorization. Indeed, HostPaths could be authorized at the path level via the Pod Security Policy, which has been deprecated for Kubernetes v1.21 and later, or the new Pod Security admission replacement, in beta since v1.23. More specifically, the subpath feature in HostPaths is the entry point that the user could manipulate that will be read and mounted by kubelet in the container creation process. This vulnerability leverages the fact that kubelet is running as root on the host and that symbolic links are interpreted relatively to the reader.
Note that there exist initiatives to run the core components of Kubernetes as underprivileged users. This is known as rootless Kubernetes: it is still difficult to do but could potentially de facto eliminate this class of issues.
Here is a simplified example, by Brad Geesaman, mounting the host filesystem at /rootfs/host. You can find other and complete examples in his POC repository.
apiVersion: v1 kind: Pod metadata: name: subpath spec: containers: - image: nginx:latest name: setup imagePullPolicy: "Always" command: ["/bin/bash"] args: ["-c", "cd /rootfs && rm -rf hostetc && ln -s / /rootfs/host && touch /status/done && sleep infinity"] volumeMounts: - mountPath: /rootfs name: escape-volume - mountPath: /status name: status-volume - image: nginx:latest name: exploit imagePullPolicy: "Always" command: ["/bin/bash"] args: ["-c", "if [[ -f /status/done ]];then sleep infinity; else sleep 1; fi"] volumeMounts: - mountPath: /rootfs name: escape-volume subPath: host - mountPath: /status name: status-volume volumes: - name: escape-volume emptyDir: {} - name: status-volume emptyDir: {}
For more explanations on this vulnerability, Twistlock wrote a comprehensive deep dive on the issue. Also, a recap post was published on the Kubernetes blog about this vulnerability and the process the community went through to fix the issue.
The solution to resolve this issue is smart, but quite common, as we will see later. To summarize, the idea is to use file descriptors to make it impossible to replace the original file for which all symlinks were resolved and the validation has been done. kubelet is resolving all the symlinks in the subPath, making sure that all the paths are within the base volume using the openat() syscall and disallowing symlinks, then bind mounting /proc/<kubelet pid>/fd/<final fd> to a working directory under the kubelet’s Pod directory, and finally closing the file descriptor and passing the bind mount to the container runtime.
As said, using file descriptors to avoid race conditions that can occur when using pathnames is a common solution. You can find detailed information about it in the book The Art of Software Security Assessment, published in 2006. Other pathname race conditions have been fixed with this solution in the past.
runc CVE-2021-30465
Three years after the 2018 vulnerability, in late 2020, Etienne Champetier discovered a new vulnerability in runc that is a also symlink race, exploiting a TOCTOU. It is worth noting that this vulnerability was discovered thanks to a comment in the code source of runc. Etienne released a great blog article about the vulnerability and how it can be exploited in the context of Kubernetes. Indeed kubelet daemons are using container runtimes that are, in most cases, using runc. If you want more information, Kubernetes sig-security also gained interest and presented their process through the creation of a POC at a 2021 North America KubeCon panel and also released their hackmd notes.
This vulnerability results in the same consequences as the previous one but is relatively more sophisticated to exploit because it involves a race condition. The idea is that runc calls a function named securejoin.SecureJoin() to resolve the destination and target and then call mount(), so trouble can happen in between. As Kubernetes users, we have full control over the target of the mount, so we can manipulate the value between the two statements to force the mount to follow a chosen symlink.
The patch is similar to the previous kubelet vulnerability patch and consists in using file descriptors of the process as a “stop-gap” after the verifications have been made on the path.
kubelet CVE 2021-25741
The runc CVE-2021-30465 piqued Google engineers’ curiosity who discovered almost the same vulnerability directly in the kubelet binary. The CVE ID is CVE-2021-25741 and it was disclosed in late 2021. Identically to the two previous vulnerabilities, there is a window of time between the resolution of the path and the actual mount that made an exploit possible.
What is interesting in this vulnerability is that it was made possible by what happened to be the “incomplete” patch of the first kubelet CVE-2017-1002101. Indeed, efforts were made to resolve all symlinks and lock the file or folder resolved by using file descriptors, but kubelet is using the mount(8) linux-util instead of the mount(2) syscall. And unfortunately, by default, mount(8) does follow symlinks again, although mount(2) does not. Etienne Champetier summarized that in a comment in the issue thread.
The patch for this issue is pretty straightforward. It only consists in passing the --no-canonicalize flag to mount(8) utility when calling it. This flag was introduced in late 2009 and was already used as a patch for a vulnerability in FUSE, CVE-2011-0543.
This vulnerability was disclosed in September 2021 and Google released a post on their security blog in December. Recently, in January 2022, Arkadiy Litvinenko released a POC on his Github that tries to exploit the TOCTOU to mount the host filesystem into a Pod.
Conclusion
The Kubernetes documentation was somehow right, “HostPaths volumes present many security risks […]”. To be completely honest, the risks mentioned are related to the user configuration of this feature, not really its vulnerabilities.
Containers, by default, have their own filesystem and are trapped inside of it via different mechanisms. The volume feature of Kubernetes allows attaching diverse filesystems to the container’s existing one. But allowing the user to input arbitrary filesystem constructions, especially with symlinks, to the arguments of mount, led to various vulnerabilities, that are difficult to fix correctly. This group of three vulnerabilities shows how imperfect a patch can be and how discoveries can lead to other findings in similar or linked software.
To conclude, here is a summary timeline of the main events and resources related to these three vulnerabilities.
DATE | CVE | DESCRIPTION |
---|---|---|
March 2018 | 2017-1002101 | CVE-2017-1002101 is disclosed on Github in issue #60813. |
March 2018 | 2017-1002101 | Deep dive article on the most severe Kubernetes vulnerabilities to date article by Twistlock and POCs by Brad Geesaman are published. |
April 2018 | 2017-1002101 | Fixing the Subpath Volume Vulnerability in Kubernetes post on Kubernetes blog is published. |
December 2018 | 2017-1002101 | How Symlinks Pwned Kubernetes (And How We Fixed It) by Michelle Au and Jan Šafránek is presented at KubeCon NA 2018. |
December 2020 | 2021-30465 | CVE-2021-30465 is discovered by Etienne Champetier in runc. |
Mars 2021 | 2021-30465 | Vulnerability disclosure on Github and OSS-Security, Etienne Champetier publishes a write-up and a POC. |
September 2021 | 2021-25741 | CVE-2021-25741 is disclosed on Github in issue #104980, reported by Google Engineers Fabricio Voznika and Mark Wolters. |
October 2021 | 2021-30465 | Sig security Ian Coldwater, Brad Geesaman, Rory McCune, and Duffie Cooley published and presented their POC at KubeCon NA 2021. |
December 2021 | 2021-25741 | Google publishes Exploring Container Security: A Storage Vulnerability Deep Dive on its security blog. |
January 2022 | 2021-25741 | Arkadiy Litvinenko publishes a POC. |
Bonus: mount + symbolic link = evil
While investigating CVE 2021-25741 and learning about the previous issues, we came across this structure of symlinks and bind mounts. It piqued our curiosity because it looked like a brainteaser. When simplifying by resolving the symlinks, it is much easier to figure out what is going on, but it illustrates a little bit how symlinks can create a mount can of worms.
Question
See the following diagram, the grey boxes are directories and the yellow are symlinks.
filesystem structure that start with a base folder, containing two foldersl dir1 and dir2. dir1 contains a symlink named loop that points to the base folder and dir2 contains a symlink name root that points to the root of the filesystem.
What happens if you run the following mount(8) command when you are in the base folder?
mount -o bind base/dir2 base/dir1/loop
Note that when you bind mount like mount -o bind src dest you mount src on dest. If you want to experiment, please do in a separate virtual machine: you can recreate the file system structure with the following command
mkdir -p base/dir1 base/dir2 && ln -s ../../base base/dir1/loop && ln -s / base/dir2/root
Solution
You could “by eyes” resolve the symlinks and the command would become:
mount -o bind base/dir2 base
The result is, dir2 is bind mounted on the base, covering everything under base in the filesystem and squashing dir2. It is completely valid to mount a child onto its parent.
the result of the bind mount command, a filesystem with the base folder containing the symlink named root that points to the root of the filesystem.
Article Link: Kubernetes and HostPath, a Love-Hate Relationship - Quarkslab's blog