Kernel-Less Containers

Note: You should probably have a rough understanding of the terms ‘kernel/user-space’ and ‘syscall’, otherwise this post is going to be difficult to understand!

Namespaces


$ sudo unshare -m

# mount -o remount,rw /nix/store/

# ... you can now edit files in the nix store
# ... but only in this shell!

We can entirely customize the view of the filesystems, mount/unmount things, and it is visible only for this process tree which is running inside the new mount namespace.

Containers

What is usually referred to as a ‘container’ is a collection of namespaces, bundled with some mechanism to provision a container image.

Rootless containers

Rootless containers use unprivileged user namespaces to create containers without requiring special privileges on the host. This is also used by some tools like bubblewrap and flatpak.

This only works if the sysctl kernel.unprivileged_userns_clone is set to true. Because this approach extends the attack surface of the kernel by a lot, it is usually disabled on shared hosters.

About ptrace(2) and proot(1)

The ptrace syscall is used by debuggers to trap the program and monitor syscalls performed by the program. It works even on ancient kernels and shared hosters or OpenVZ.

proot is a tool which mimics the functionality of chroot (and more!) without the need for special privileges or user namespaces. It uses ptrace to hook into the applications’ syscalls and write the appropriate answer into the registers.

It sure would be cool if we could intercept all syscalls and emulate unprivileged network namespaces and mount namespaces without ever telling the kernel… 🤔

User-Mode Linux

I want to mention that everything I am about to describe is already possible using User-Mode Linux, which is a way of running a Linux kernel as a normal userspace process on top of another Linux kernel. However, there is probably a larger performance penalty to this, but I haven’t benchmarked it yet.

gVisor

And then I discovered, someone has already implemented all this, albeit with a slightly different intention/use-case: gVisor originating from Google attempts to build a security sandbox for containers, meaning that the application inside the container does not talk directly to the host kernel (as is the case with namespacing implemented in the host kernel), but instead intercepts the syscalls made by the application and handles them within the gVisor codebase which is written in Go and thus less likely to suffer from memory safety issues than the Linux kernel. Additionally, this gVisor sandbox is still contained within a ‘normal’ kernel namespace and seccomp based container, so even if someone breaks out of the inner sandbox they are still confined to the outer container as a second line of defense.

And they even poured a lot of work into optimizing this approach. Their Blog and documentation give a lot of useful insight into how gVisor works.

So I wondered, what if we take only the inner sandbox of gVisor, and use it to run our OCI containers on a host where we the kernel doesn’t allow us to use namespaces. It appears they are disabling parts of the outer container for their unit tests using a parameter -TESTONLY-no-chroot, however I could not get this running yet.

Endless Opportunities

If we can present a userspace filesystem such as sshfs, or a virtual network such as a WireGuard tunnel, to the application, without ever telling the kernel, that would give us a lot of flexibility and make it easier to tinker with things, and deploy whole stacks of software to remote locations where we have no special privileges.

We could hook this up to existing filesystem abstractions such as FUSE rclone to present any cloud storage to the application, and encrypt data on the fly.

If we use some DPDK magic it might even bypass the kernel completely and go super duper fast!

Thanks for reading all of this. I’m really excited by discovering all of this, and I hope you could learn something.