Running a Container off the Host /usr/

Post Syndicated from original https://0pointer.net/blog/running-an-container-off-the-host-usr.html

Apparently, in some parts of this
world, the /usr/-merge
transition is still ongoing. Let’s take the opportunity to have a look
at one specific way to take benefit of the /usr/-merge (and
associated work) IRL.

I develop system-level software as you might know. Oftentimes I want
to run my development code on my PC but be reasonably sure it cannot
destroy or otherwise negatively affect my host system. Now I could set
up a container tree for that, and boot into that. But often I am too
lazy for that, I don’t want to bother with a slow package manager
setting up a new OS tree for me. So here’s what I often do instead —
and this only works because of the /usr/-merge.

I run a command like the following (without any preparatory work):

systemd-nspawn \
        --directory=/ \
        --volatile=yes \
        -U \
        --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
        --set-credential=firstboot.locale:C.UTF-8 \
        --bind-user=lennart \
        -b

And then I very quickly get a login prompt on a container that runs
the exact same software as my host — but is also isolated from the
host. I do not need to prepare any separate OS tree or anything
else. It just works. And my host user lennart is just there,
ready for me to log into.

So here’s what these
systemd-nspawn
options specifically do:

--directory=/ tells systemd-nspawn to run off the host OS’
file hierarchy. That smells like danger of course, running two
OS instances off the same directory hierarchy. But don’t be
scared, because:
--volatile=yes enables volatile mode. Specifically this means
what we configured with --directory=/ as root file system is
slightly rearranged. Instead of mounting that tree as it is, we’ll
mount a tmpfs instance as actual root file system, and then
mount the /usr/ subdirectory of the specified hierarchy into the
/usr/ subdirectory of the container file hierarchy in read-only
fashion – and only that directory. So now we have a container
directory tree that is basically empty, but imports all host OS
binaries and libraries into its /usr/ tree. All software
installed on the host is also available in the container with no
manual work. This mechanism only works because on /usr/-merged
OSes vendor resources are monopolized at a single place:
/usr/. It’s sufficient to share that one directory with the
container to get a second instance of the host OS running. Note
that this means /etc/ and /var/ will be entirely empty
initially when this second system boots up. Thankfully, forward
looking distributions (such as Fedora) have adopted
systemd-tmpfiles
and
systemd-sysusers
quite pervasively, so that system users and files/directories
required for operation are created automatically should they be
missing. Thus, even though at boot the mentioned directories are
initially empty, once the system is booted up they are
sufficiently populated for things to just work.
-U means we’ll enable user namespacing, in fully automatic
mode. This does three things: it picks a free host UID range
dynamically for the container, then sets up user namespacing for
the container processes mapping host UID range to UIDs 0…65534 in
the container. It then sets up a similar UID mapped mount on the
/usr/ tree of the container. Net effect: file ownerships as set
on the host OS tree appear as they belong to the very same users
inside of the container environment, except that we use user
namespacing for everything, and thus the users are actually
neatly isolated from the host.
--set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) passes a credential to the container. Credentials are
bits of data that you can pass to systemd services and whole
systems. They are actually awesome concepts (e.g. they support
TPM2 authentication/encryption that just works!) but I am not going
to go into details around that, given it’s off-topic in this
specific scenario. Here we just take benefit of the fact that
systemd-sysusers looks for a credential called
passwd.hashed-password.root to initialize the root password of
the system from. We set it to mysecret. This means once the
system is booted up we can log in as root and the supplied
password. Yay. (Remember, /etc/ is initially empty on this
container, and thus also carries no /etc/passwd or
/etc/shadow, and thus has no root user record, and thus no root
password.)

mkpasswd is a tool then
converts a plain text password into a UNIX hashed password, which
is what this specific credential expects.
Similar, --set-credential=firstboot.locale:C.UTF-8 tells the
systemd-firstboot
service in the container to initialize /etc/locale.conf with
this locale.
--bind-user=lennart binds the host user lennart into the
container, also as user lennart. This does two things: it mounts
the host user’s home directory into the container. It also copies
a minimal user record of the specified user into the container
that
nss-systemd
then picks up and includes in the regular user database. This
means, once the container is booted up I can log in as lennart
with my regular password, and once I logged in I will see my
regular host home directory, and can make changes to it. Yippieh!
(This does a couple of more things, such as UID mapping and
things, but let’s not get lost in too much details.)

So, if I run this, I will very quickly get a login prompt, where I can
log into as my regular user. I have full access to my host home
directory, but otherwise everyhing is nicely isolated from the host,
and changes outside of the home directory are either prohibited or are
volatile, i.e. go to a tmpfs instance whose lifetime is bound to the
container’s lifetime: when I shut down the container I just started,
then any changes outside of my user’s home directory are lost.

Note that while here I use --volatile=yes in combination with
--directory=/ you can actually use it on any OS hierarchy, i.e. just
about any directory that contains OS binaries.

Similar, the --bind-user= stuff works with any OS hierarchy too (but
do note that only systemd 249 and newer will pick up the user records
passed to the container that way, i.e. this requires at least v249
both on the host and in the container to work).

Or in short: the possibilities are endless!

Requirements

For this all to work, you need:

A recent kernel (5.15 should suffice, as it brings UID mapped
mounts for the most common file systems, so that -U and
--bind-user= can work well.)
A recent systemd (249 should suffice, which brings --bind-user=,
and a -U switch backed by UID mapped mounts).
A distribution that adopted systemd-tmpfiles and
systemd-sysusers so that the directory hierarchy and user
databases are automatically populated when empty at boot. (Fedora
35 should suffice.)

Limitations

While a lot of today’s software actually out of the box works well on
systems that come up with an unpopulated /etc/ and /var/, and
either fall back to reasonable built-in defaults, or deploy
systemd-tmpfiles to create what is missing, things aren’t perfect:
some software typically installed an desktop OSes will fail to start
when invoked in such a container, and be visible as ugly failed
services, but it won’t stop me from logging in and using the system
for what I want to use it. It would be excellent to get that fixed,
though. This can either be fixed in the relevant software upstream
(i.e. if opening your configuration file fails with ENOENT, then
just default to reasonable defaults), or in the distribution packaging
(i.e. add a
tmpfiles.d/
file that copies or symlinks in skeleton configuration from
/usr/share/factory/etc/ via the C or L line types).

And then there’s certain software dealing with hardware management and
similar that simply cannot work in a container (as device APIs on
Linux are generally not virtualized for containers) reasonably. It
would be excellent if software like that would be updated to carry
ConditionVirtualization=!container or
ConditionPathIsReadWrite=/sys conditionalization in their unit
files, so that it is automatically – cleanly – skipped when executed
in such a container environment.

And that’s all for now.

Noise

Running a Container off the Host /usr/

Requirements

Limitations

The collective thoughts of the interwebz