Writing a Container
I use containers on the job daily, but to be honest, don’t have much
idea other than they’re vaguely chroot() +
cgroupsv2 + seccomp. So, to rectify that,
let’s build our own containers.
Dependencies
First off, some dependencies. Since we want to run binaries inside of our chroot, and our chroot is going to be far away from root, we have two options:
- compile with vanilla
gccand then copy our shared libraries to our chroot. - Statically compile our dependencies.
I went with 2. But that leaves the compiler itself. Luckily, there’s
a project called crosstool-NG that allows you
to generate cross-compilers pretty easily. My target is
x86_64-unknown-linux-gnu, so I just needed a build of
x86_64-unknown-linux-musl. After adding the compiler to
your $PATH, you’re also ready to statically compile some
code.
A Shell
To get interactivity in our chroot, we’ll need a shell.
I chose dash because it’s easier to compile. Clone it, and
let’s compile it with our cross compiler.
git clone https://git.kernel.org/pub/scm/utils/dash/dash.git/ --depth=1In bash or zsh, set your compiler, and build. Dash will be built in
src/dash. Copy it for later.
cd dash
export PATH="$HOME/x-tools/x86_64-unknown-linux-musl/bin:$PATH"
export CC="x86_64-unknown-linux-musl-gcc"
export LD="x86_64-unknown-linux-musl-ld"
export CFLAGS="-Os"
export LDFLAGS="--static"
makeA Userspace
For a userspace, I decided to use toybox.
Same drill, compile it statically.
git clone https://github.com/landley/toybox.git --depth=1cd toybox
export PATH="$HOME/x-tools/x86_64-unknown-linux-musl/bin:$PATH"
export CROSS_COMPILE=x86_64-unknown-linux-musl-
export LDFLAGS="--static"
make distclean
make defconfig
make toybox CROSS_COMPILE=$CROSS_COMPILE
maketoybox will be in the root directory. Keep it for
later.
An operating system?
Now we’ll need the directory structure of an operating system.
Luckily, toybox also has us covered. There’s a script called
mkroot/mkroot.sh which when run, will create our root
filesystem.
$ mkroot/mkroot.shRun it, and copy the root dir to where you want your OS
to be. Now, copy over dash and toybox into
root/bin, and you have a userspace and a shell. What more
could you want?
Chrooting to our OS
If you cd into root and then run
bin/toybox ls, you actually have an OS inside your OS. Of
course, it’s pretty useless since you can cd out. This is
basically what chroot does for us – it puts the root dir
wherever you specify, and that’s it.
With that out of the way, let’s write a program that uses
chroot.
#define _GNU_SOURCE
#include <errno.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
/*
* Stage 0: Minimal chroot launcher.
*
* This version demonstrates the tiniest "container": it simply chroots into a
* user-specified directory and execs a command. You must run it as root and
* it is trivial to escape by pointing -r at "/" or any other host path
and escaping.
*/
static void die(const char *fmt, ...) {
va_list ap;
va_start(ap, fmt);
vfprintf(stderr, fmt, ap);
va_end(ap);
if (!*fmt || fmt[strlen(fmt) - 1] != '\n')
fputc('\n', stderr);
exit(EXIT_FAILURE);
}
static void usage(const char *prog) {
fprintf(stderr, "Usage: %s -r <rootfs> -- <cmd> [args...]\n", prog);
exit(2);
}
int main(int argc, char **argv) {
const char *root = NULL;
/* Parse: expect -r <root> followed by -- and the command */
for (int i = 1; i < argc; i++) {
if (!strcmp(argv[i], "-r") && i + 1 < argc) {
root = argv[++i];
} else if (!strcmp(argv[i], "--")) {
argv += i + 1;
argc -= i + 1;
break;
} else if (argv[i][0] == '-') {
usage(argv[0]);
} else {
argv += i;
argc -= i;
break;
}
}
if (!root || argc < 1)
usage(argv[0]);
if (chroot(root) == -1)
die("chroot(%s): %s", root, strerror(errno));
if (chdir("/") == -1)
die("chdir(/): %s", strerror(errno));
execvp(argv[0], argv);
die("execvp('%s') failed: %s", argv[0], strerror(errno));
}Compile this and we have to run it as root. Point it to our little filesystem and we can drop into the shell.
gcc container.c -o container -lseccompNote that we have to use sudo here otherwise we can’t
get into the chroot.
sudo ./container -r ./root -- /bin/dashWe’re actually not able to do too much to just leave:
# pwd
/
# cd /root/takashi
/bin/dash: 5: cd: can't cd to /root/takashiBut remember, we’re root. We can escape using proc.
# toybox mkdir /proc
# toybox mount -t proc proc /proc
# toybox chroot /proc/1/root /bin/sh
$ whoami
rootAnd you’re in a root shell as root. Freedom, so very easily.
Some more security
We’ll want to prevent running as root. And also escaping so easily.
This time around, we’ll make the container runnable without root, mount
/proc privately, clean up the environment, and change our
user to nobody. We can still run any syscall we want though.
static void make_private_mounts(void) {
if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) == -1)
die("mount(/, MS_PRIVATE): %s", strerror(errno));
}
static void bind_mount_self(const char *path) {
if (mount(path, path, NULL, MS_BIND | MS_REC, NULL) == -1)
die("bind-mount %s: %s", path, strerror(errno));
}
static void close_all_fds_except012(void) {
int dir = open("/proc/self/fd", O_RDONLY | O_DIRECTORY | O_CLOEXEC);
if (dir < 0) {
for (int fd = 3; fd < 1024; ++fd)
close(fd);
return;
}
char name[PATH_MAX];
for (;;) {
ssize_t n = syscall(SYS_getdents64, dir, name, sizeof(name));
if (n <= 0)
break;
for (ssize_t i = 0; i < n;) {
struct linux_dirent64 {
ino64_t d_ino;
off64_t d_off;
unsigned short d_reclen;
unsigned char d_type;
char d_name[];
};
struct linux_dirent64 *d = (struct linux_dirent64 *)(name + i);
i += d->d_reclen;
if (!strcmp(d->d_name, ".") || !strcmp(d->d_name, ".."))
continue;
int fd = atoi(d->d_name);
if (fd > 2 && fd != dir)
close(fd);
}
}
close(dir);
}
static void drop_privileges(uid_t uid, gid_t gid) {
if (setgroups(0, NULL) == -1 && errno != EPERM)
die("setgroups: %s", strerror(errno));
if (setgid(gid) == -1)
die("setgid: %s", strerror(errno));
if (setuid(uid) == -1)
die("setuid: %s", strerror(errno));
}
static void write_file_fmt(const char *path, const char *fmt, ...) {
int fd = open(path, O_WRONLY);
if (fd == -1)
die("open(%s): %s", path, strerror(errno));
char buf[256];
va_list ap;
va_start(ap, fmt);
int len = vsnprintf(buf, sizeof(buf), fmt, ap);
va_end(ap);
if (len < 0 || (size_t)len >= sizeof(buf))
die("formatting %s failed", path);
if (write(fd, buf, len) != len)
die("write(%s): %s", path, strerror(errno));
close(fd);
}
static void setup_user_namespace(uid_t host_uid, gid_t host_gid) {
if (unshare(CLONE_NEWUSER) == -1)
die("unshare(CLONE_NEWUSER): %s", strerror(errno));
if (access("/proc/self/setgroups", F_OK) == 0)
write_file_fmt("/proc/self/setgroups", "deny\n");
write_file_fmt("/proc/self/uid_map", "0 %u 1\n", host_uid);
write_file_fmt("/proc/self/gid_map", "0 %u 1\n", host_gid);
}And the changes to main:
int main(int argc, char **argv) {
const char *root = NULL;
uid_t host_uid = geteuid();
gid_t host_gid = getegid();
bool rootless = (host_uid != 0);
uid_t uid = rootless ? 0 : 65534;
gid_t gid = rootless ? 0 : 65534;
for (int i = 1; i < argc; i++) {
if (!strcmp(argv[i], "-r") && i + 1 < argc) {
root = argv[++i];
} else if (!strcmp(argv[i], "-u") && i + 1 < argc) {
uid = (uid_t)strtoul(argv[++i], NULL, 10);
} else if (!strcmp(argv[i], "-g") && i + 1 < argc) {
gid = (gid_t)strtoul(argv[++i], NULL, 10);
} else if (!strcmp(argv[i], "--")) {
argv += i + 1;
argc -= i + 1;
break;
} else if (argv[i][0] == '-') {
usage(argv[0]);
} else {
argv += i;
argc -= i;
break;
}
}
if (!root || argc < 1)
usage(argv[0]);
if (rootless && (uid != 0 || gid != 0))
die("non-root users must use -u/-g 0 in this stage");
setup_user_namespace(host_uid, host_gid);
if (unshare(CLONE_NEWNS) == -1)
die("unshare(CLONE_NEWNS): %s", strerror(errno));
make_private_mounts();
bind_mount_self(root);
if (chroot(root) == -1)
die("chroot(%s): %s", root, strerror(errno));
if (chdir("/") == -1)
die("chdir(/): %s", strerror(errno));
mkdir("/proc", 0555);
mount("proc", "/proc", "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, NULL);
clearenv();
close_all_fds_except012();
drop_privileges(uid, gid);
execvp(argv[0], argv);
die("execvp('%s') failed: %s", argv[0], strerror(errno));
}Now, do we have an escape? If you do it right, no. However, there’s a convoluted way of getting out.
If we know there’s a process that has access to the root file system,
mounting, etc, we “hack” it in a bit of an odd way with
ptrace. Ptrace allows us to “observe” and
“control” the execution of other processes. With this, we can have it be
a confused deputy and mount our file system and therefore get a way to
chroot out. Compile this program and point it to a victim pid on the
host that can mount for us:
int main(int argc, char **argv) {
pid_t pid = atoi(argv[1]);
struct user_regs_struct regs;
ptrace(PTRACE_ATTACH, pid, 0, 0);
waitpid(pid, NULL, 0);
ptrace(PTRACE_GETREGS, pid, 0, ®s);
regs.orig_rax = SYS_mount;
regs.rdi = (unsigned long)"/host";
regs.rsi = (unsigned long)"/mnt";
regs.rdx = MS_BIND;
regs.r10 = 0;
ptrace(PTRACE_SETREGS, pid, 0, ®s);
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, NULL, 0);
ptrace(PTRACE_DETACH, pid, 0, 0);
return 0;
}And we are out:
$ toybox chroot /mnt /bin/shEven More Security
So we’ve learned that ptrace is evil in this case. What
we need is a way to only allow certain syscalls. Thankfully,
seccomp will do that for us – if you try to run a syscall,
it will get killed.
Let’s add some extra headers and some helpers:
#include <seccomp.h>
#include <signal.h>
#include <sys/prctl.h>
#include <sys/resource.h>
static void setup_user_namespace(uid_t host_uid, gid_t host_gid,
uid_t target_uid, gid_t target_gid) {
char map_buf[256];
int len = 0;
if (host_uid == 0) {
len = snprintf(map_buf, sizeof(map_buf), "0 %u 1\n", host_uid);
if (target_uid != 0)
len += snprintf(map_buf + len, sizeof(map_buf) - len, "%u %u 1\n",
target_uid, target_uid);
} else {
if (target_uid != 0)
die("non-root users must run with -u 0 (got %u)", target_uid);
len = snprintf(map_buf, sizeof(map_buf), "0 %u 1\n", host_uid);
}
if (len < 0 || (size_t)len >= sizeof(map_buf))
die("uid_map formatting overflow");
write_file_fmt("/proc/self/uid_map", "%s", map_buf);
len = 0;
if (host_uid == 0) {
len = snprintf(map_buf, sizeof(map_buf), "0 %u 1\n", host_gid);
if (target_gid != 0)
len += snprintf(map_buf + len, sizeof(map_buf) - len, "%u %u 1\n",
target_gid, target_gid);
} else {
if (target_gid != 0)
die("non-root users must run with -g 0 (got %u)", target_gid);
len = snprintf(map_buf, sizeof(map_buf), "0 %u 1\n", host_gid);
}
if (len < 0 || (size_t)len >= sizeof(map_buf))
die("gid_map formatting overflow");
write_file_fmt("/proc/self/gid_map", "%s", map_buf);
}
static void sigsys_handler(int _nr, siginfo_t *info, void *_uctx) {
(void)_nr;
(void)_uctx;
if (!info) {
fprintf(stderr, "blocked syscall (no info)\n");
} else {
fprintf(stderr, "blocked syscall %d (arch %#x, addr %p)\n",
info->si_syscall, info->si_arch, info->si_call_addr);
}
_exit(1);
}
static void install_sigsys_logger(void) {
struct sigaction sa;
memset(&sa, 0, sizeof(sa));
sa.sa_sigaction = sigsys_handler;
sa.sa_flags = SA_SIGINFO | SA_RESETHAND;
if (sigaction(SIGSYS, &sa, NULL) == -1)
die("sigaction(SIGSYS): %s", strerror(errno));
}
static void install_seccomp_allowlist() {
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == -1)
die("prctl(NO_NEW_PRIVS): %s", strerror(errno));
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_TRAP);
if (!ctx)
die("seccomp_init failed");
#define ALLOW(syscall) \
do { \
if (seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(syscall), 0)) \
die("allow " #syscall " failed"); \
} while (0)
#define ALLOW_ARG(syscall, idx, cmp, val) \
do { \
if (seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(syscall), 1, \
SCMP_CMP(idx, cmp, val))) \
die("allow " #syscall " arg failed"); \
} while (0)
ALLOW(rt_sigaction);
ALLOW(rt_sigprocmask);
ALLOW(umask);
ALLOW(rt_sigreturn);
ALLOW(exit);
ALLOW(exit_group);
ALLOW(getpid);
ALLOW(getppid);
ALLOW(gettid);
ALLOW(futex);
ALLOW(clone);
ALLOW(set_tid_address);
ALLOW(set_robust_list);
ALLOW(clock_gettime);
ALLOW(clock_nanosleep);
ALLOW(nanosleep);
ALLOW(getrandom);
ALLOW(brk);
ALLOW(mmap);
ALLOW(munmap);
ALLOW(mremap);
ALLOW(mprotect);
ALLOW(arch_prctl);
ALLOW(getuid);
ALLOW(geteuid);
ALLOW(getgid);
ALLOW(getegid);
ALLOW(prlimit64);
ALLOW(prctl);
ALLOW(madvise);
ALLOW(access);
ALLOW(faccessat2);
ALLOW(pread64);
ALLOW(rseq);
ALLOW(fadvise64);
ALLOW(read);
ALLOW(write);
ALLOW(writev);
ALLOW(close);
ALLOW(lseek);
ALLOW(getdents64);
ALLOW(ioctl);
ALLOW(statx);
ALLOW(fstat);
ALLOW(fstatfs);
ALLOW(newfstatat);
ALLOW(fcntl);
ALLOW(dup);
ALLOW(dup2);
ALLOW(dup3);
ALLOW(pipe);
ALLOW(pipe2);
ALLOW(readlink);
ALLOW(readlinkat);
ALLOW(uname);
ALLOW(open);
ALLOW(openat);
ALLOW(execve);
ALLOW(execveat);
ALLOW(socket);
ALLOW(connect);
ALLOW(bind);
ALLOW(listen);
ALLOW(accept);
ALLOW(accept4);
ALLOW(getsockopt);
ALLOW(setsockopt);
ALLOW(getsockname);
ALLOW(getpeername);
ALLOW(sendto);
ALLOW(recvfrom);
ALLOW(sendmsg);
ALLOW(recvmsg);
ALLOW(shutdown);
ALLOW(clone3);
ALLOW(eventfd2);
ALLOW(poll);
ALLOW(statfs);
ALLOW(getcwd);
ALLOW(getpgid);
ALLOW(setpgid);
ALLOW(stat);
ALLOW(vfork);
ALLOW(wait4);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(chroot), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(pivot_root), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(mount), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(umount2), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(open_by_handle_at), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(name_to_handle_at), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(bpf), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(kexec_load), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(kexec_file_load), 0);
seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(ptrace), 0);
if (seccomp_load(ctx) != 0) {
seccomp_release(ctx);
die("seccomp_load failed");
}
seccomp_release(ctx);
#undef ALLOW
#undef ALLOW_ARGAnd fix up main.
int main(int argc, char **argv) {
const char *root = NULL;
uid_t host_uid = geteuid();
gid_t host_gid = getegid();
bool rootless = (host_uid != 0);
uid_t uid = rootless ? 0 : 65534;
gid_t gid = rootless ? 0 : 65534;
for (int i = 1; i < argc; i++) {
if (!strcmp(argv[i], "-r") && i + 1 < argc) {
root = argv[++i];
} else if (!strcmp(argv[i], "-u") && i + 1 < argc) {
uid = (uid_t)strtoul(argv[++i], NULL, 10);
} else if (!strcmp(argv[i], "-g") && i + 1 < argc) {
gid = (gid_t)strtoul(argv[++i], NULL, 10);
} else if (!strcmp(argv[i], "--")) {
argv += i + 1;
argc -= i + 1;
break;
} else if (argv[i][0] == '-') {
usage(argv[0]);
} else {
argv += i;
argc -= i;
break;
}
}
if (!root || argc < 1)
usage("container");
if (rootless && (uid != 0 || gid != 0))
die("non-root users can only use -u/-g 0 (try running as root for other "
"ids)");
setup_user_namespace(host_uid, host_gid, uid, gid);
if (unshare(CLONE_NEWNS) == -1)
die("unshare(CLONE_NEWNS): %s", strerror(errno));
make_private_mounts();
bind_mount_self(root);
if (chroot(root) == -1)
die("chroot(%s): %s", root, strerror(errno));
if (chdir("/") == -1)
die("chdir(/): %s", strerror(errno));
mkdir("/proc", 0555);
if (mount("proc", "/proc", "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, NULL) ==
-1) {
}
clearenv();
close_all_fds_except_std();
drop_privileges(uid, gid);
install_sigsys_logger();
install_seccomp_allowlist();
execvp(argv[0], argv);
die("execvp('%s') failed: %s", argv[0], strerror(errno));
}Now with this, we can’t try any ptrace tricks, since
we’ve explicitly disallowed it. As well, we’ve added a helper that lets
us strace a task to see which syscall killed our task (good for
debugging).
I’ve allowed enough syscalls to allow for one last party trick – a way to call the network (on your own computer).
I spun up an http server serving the container.c file at
127.0.0.1:8080/container.c:
And look, there’s the code.
./container_3 -r root/ -- /bin/dash
$ toybox wget http://127.0.0.1:8080/container.c -O -
# the code we've been writing today.If you’re a little too overzealous with the system calls, there’s probably an escape out, but we can cover the most egregious ones with the seccomp denylist. You can configure the list how you like and you can get a passable toy container and play around with it.