Running a Container off the Host /usr/

Post Syndicated from original https://0pointer.net/blog/running-an-container-off-the-host-usr.html

Apparently, in some parts of this
world
, the /usr/-merge
transition is still ongoing. Let’s take the opportunity to have a look
at one specific way to take benefit of the /usr/-merge (and
associated work) IRL.

I develop system-level software as you might know. Oftentimes I want
to run my development code on my PC but be reasonably sure it cannot
destroy or otherwise negatively affect my host system. Now I could set
up a container tree for that, and boot into that. But often I am too
lazy for that, I don’t want to bother with a slow package manager
setting up a new OS tree for me. So here’s what I often do instead —
and this only works because of the /usr/-merge.

I run a command like the following (without any preparatory work):

systemd-nspawn \
        --directory=/ \
        --volatile=yes \
        -U \
        --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
        --set-credential=firstboot.locale:C.UTF-8 \
        --bind-user=lennart \
        -b

And then I very quickly get a login prompt on a container that runs
the exact same software as my host — but is also isolated from the
host. I do not need to prepare any separate OS tree or anything
else. It just works. And my host user lennart is just there,
ready for me to log into.

So here’s what these
systemd-nspawn
options specifically do:

  • --directory=/ tells systemd-nspawn to run off the host OS’
    file hierarchy. That smells like danger of course, running two
    OS instances off the same directory hierarchy. But don’t be
    scared, because:

  • --volatile=yes enables volatile mode. Specifically this means
    what we configured with --directory=/ as root file system is
    slightly rearranged. Instead of mounting that tree as it is, we’ll
    mount a tmpfs instance as actual root file system, and then
    mount the /usr/ subdirectory of the specified hierarchy into the
    /usr/ subdirectory of the container file hierarchy in read-only
    fashion – and only that directory. So now we have a container
    directory tree that is basically empty, but imports all host OS
    binaries and libraries into its /usr/ tree. All software
    installed on the host is also available in the container with no
    manual work. This mechanism only works because on /usr/-merged
    OSes vendor resources are monopolized at a single place:
    /usr/. It’s sufficient to share that one directory with the
    container to get a second instance of the host OS running. Note
    that this means /etc/ and /var/ will be entirely empty
    initially when this second system boots up. Thankfully, forward
    looking distributions (such as Fedora) have adopted
    systemd-tmpfiles
    and
    systemd-sysusers
    quite pervasively, so that system users and files/directories
    required for operation are created automatically should they be
    missing. Thus, even though at boot the mentioned directories are
    initially empty, once the system is booted up they are
    sufficiently populated for things to just work.

  • -U means we’ll enable user namespacing, in fully automatic
    mode. This does three things: it picks a free host UID range
    dynamically for the container, then sets up user namespacing for
    the container processes mapping host UID range to UIDs 0…65534 in
    the container. It then sets up a similar UID mapped mount on the
    /usr/ tree of the container. Net effect: file ownerships as set
    on the host OS tree appear as they belong to the very same users
    inside of the container environment, except that we use user
    namespacing for everything, and thus the users are actually
    neatly isolated from the host.

  • --set-credential=passwd.hashed-password.root:$(mkpasswd
    mysecret)
    passes a credential to the container. Credentials are
    bits of data that you can pass to systemd services and whole
    systems. They are actually awesome concepts (e.g. they support
    TPM2 authentication/encryption that just works!) but I am not going
    to go into details around that, given it’s off-topic in this
    specific scenario. Here we just take benefit of the fact that
    systemd-sysusers looks for a credential called
    passwd.hashed-password.root to initialize the root password of
    the system from. We set it to mysecret. This means once the
    system is booted up we can log in as root and the supplied
    password. Yay. (Remember, /etc/ is initially empty on this
    container, and thus also carries no /etc/passwd or
    /etc/shadow, and thus has no root user record, and thus no root
    password.)

    mkpasswd is a tool then
    converts a plain text password into a UNIX hashed password, which
    is what this specific credential expects.

  • Similar, --set-credential=firstboot.locale:C.UTF-8 tells the
    systemd-firstboot
    service in the container to initialize /etc/locale.conf with
    this locale.

  • --bind-user=lennart binds the host user lennart into the
    container, also as user lennart. This does two things: it mounts
    the host user’s home directory into the container. It also copies
    a minimal user record of the specified user into the container
    that
    nss-systemd
    then picks up and includes in the regular user database. This
    means, once the container is booted up I can log in as lennart
    with my regular password, and once I logged in I will see my
    regular host home directory, and can make changes to it. Yippieh!
    (This does a couple of more things, such as UID mapping and
    things, but let’s not get lost in too much details.)

So, if I run this, I will very quickly get a login prompt, where I can
log into as my regular user. I have full access to my host home
directory, but otherwise everyhing is nicely isolated from the host,
and changes outside of the home directory are either prohibited or are
volatile, i.e. go to a tmpfs instance whose lifetime is bound to the
container’s lifetime: when I shut down the container I just started,
then any changes outside of my user’s home directory are lost.

Note that while here I use --volatile=yes in combination with
--directory=/ you can actually use it on any OS hierarchy, i.e. just
about any directory that contains OS binaries.

Similar, the --bind-user= stuff works with any OS hierarchy too (but
do note that only systemd 249 and newer will pick up the user records
passed to the container that way, i.e. this requires at least v249
both on the host and in the container to work).

Or in short: the possibilities are endless!

Requirements

For this all to work, you need:

  1. A recent kernel (5.15 should suffice, as it brings UID mapped
    mounts for the most common file systems, so that -U and
    --bind-user= can work well.)

  2. A recent systemd (249 should suffice, which brings --bind-user=,
    and a -U switch backed by UID mapped mounts).

  3. A distribution that adopted systemd-tmpfiles and
    systemd-sysusers so that the directory hierarchy and user
    databases are automatically populated when empty at boot. (Fedora
    35 should suffice.)

Limitations

While a lot of today’s software actually out of the box works well on
systems that come up with an unpopulated /etc/ and /var/, and
either fall back to reasonable built-in defaults, or deploy
systemd-tmpfiles to create what is missing, things aren’t perfect:
some software typically installed an desktop OSes will fail to start
when invoked in such a container, and be visible as ugly failed
services, but it won’t stop me from logging in and using the system
for what I want to use it. It would be excellent to get that fixed,
though. This can either be fixed in the relevant software upstream
(i.e. if opening your configuration file fails with ENOENT, then
just default to reasonable defaults), or in the distribution packaging
(i.e. add a
tmpfiles.d/
file that copies or symlinks in skeleton configuration from
/usr/share/factory/etc/ via the C or L line types).

And then there’s certain software dealing with hardware management and
similar that simply cannot work in a container (as device APIs on
Linux are generally not virtualized for containers) reasonably. It
would be excellent if software like that would be updated to carry
ConditionVirtualization=!container or
ConditionPathIsReadWrite=/sys conditionalization in their unit
files, so that it is automatically – cleanly – skipped when executed
in such a container environment.

And that’s all for now.