Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/stateless.html

(Just a small heads-up: I don’t blog as much as I used to, I
nowadays update my Google+
page
a lot more frequently. You might want to subscribe that if
you are interested in more frequent technical updates on what we are
working on.)

In the past weeks we have been working on a couple of features for
systemd
that enable a number of new usecases I’d like to shed some light
on. Taking benefit of the /usr
merge
that a number of distributions have completed we want to
bring runtime behaviour of Linux systems to the next level. With the
/usr merge completed most static vendor-supplied OS data is
found exclusively in /usr, only a few additional bits in
/var and /etc are necessary to make a system
boot. On this we can build to enable a couple of new features:

  1. A mechanism we call Factory Reset shall flush out
    /etc and /var, but keep the vendor-supplied
    /usr, bringing the system back into a well-defined, pristine
    vendor state with no local state or configuration. This functionality
    is useful across the board from servers, to desktops, to embedded
    devices.
  2. A Stateless System goes one step further: a system like
    this never stores /etc or /var on persistent
    storage, but always comes up with pristine vendor state. On systems
    like this every reboot acts as factor reset. This functionality is
    particularly useful for simple containers or systems that boot off the
    network or read-only media, and receive all configuration they need
    during runtime from vendor packages or protocols like DHCP or are
    capable of discovering their parameters automatically from the
    available hardware or periphery.
  3. Reproducible Systems multiply a vendor image into many
    containers or systems. Only local configuration or state is stored
    per-system, while the vendor operating system is pulled in from the
    same, immutable, shared snapshot. Each system hence has its private
    /etc and /var for receiving local configuration,
    however the OS tree in /usr is pulled in via bind mounts (in
    case of containers) or technologies like NFS (in case of physical
    systems), or btrfs snapshots from a golden master image. This is
    particular interesting for containers where the goal is to run
    thousands of container images from the same OS tree. However, it also
    has a number of other usecases, for example thin client systems, which
    can boot the same NFS share a number of times. Furthermore this
    mechanism is useful to implement very simple OS installers, that
    simply unserialize a /usr snapshot into a file system,
    install a boot loader, and reboot.
  4. Verifiable Systems are closely related to stateless
    systems: if the underlying storage technology can cryptographically
    ensure that the vendor-supplied OS is trusted and in a consistent
    state, then it must be made sure that /etc or /var
    are either included in the OS image, or simply unnecessary for booting.

Concepts

A number of Linux-based operating systems have tried to implement
some of the schemes described out above in one way or
another. Particularly interesting are GNOME’s OSTree, CoreOS and Google’s Android and
ChromeOS. They generally found different solutions for the specific
problems you have when implementing schemes like this, sometimes taking
shortcuts that keep only the specific case in mind, and cannot cover
the general purpose. With systemd now being at the core of so many
distributions and deeply involved in bringing up and maintaining the
system we came to the conclusion that we should attempt to add generic
support for setups like this to systemd itself, to open this up for
the general purpose distributions to build on. We decided to focus on
three kinds of systems:

  1. The stateful system, the traditional system as we know it with
    machine-specific /etc, /usr and /var, all
    properly populated.
  2. Startup without a populated /var, but with configured
    /etc. (We will call these volatile systems.)
  3. Startup without either /etc or /var. (We will
    call these stateless systems.)

A factory reset is just a special case of the latter two modes,
where the system boots up without /var and /etc but
the next boot is a normal stateful boot like like the first described
mode. Note that a mode where /etc is flushed, but
/var is not is nothing we intend to cover (why? well, the
user ID question becomes much harder, see below, and we simply saw no
usecase for it worth the trouble).

Problems

Booting up a system without a populated /var is relatively
straight-forward. With a
few lines of tmpfiles configuration
it is possible to populate
/var with its basic structure in a way that is sufficient to
make a system boot cleanly. systemd version 214 and newer ship with
support for this. Of course, support for this scheme in systemd is
only a small part of the solution. While a lot of software
reconstructs the directory hierarchy it needs in /var
automatically, many software does not. In case like this it is
necessary to ship a couple of additional tmpfiles lines that setup up
at boot-time the necessary files or directories in /var to
make the software operate, similar to what RPM or DEB packages would
set up at installation time.

Booting up a system without a populated /etc is a more
difficult task. In /etc we have a lot of configuration bits
that are essential for the system to operate, for example and most
importantly system user and group information in /etc/passwd
and /etc/group. If the system boots up without /etc
there must be a way to replicate the minimal information necessary in
it, so that the system manages to boot up fully.

To make this even more complex, in order to support “offline”
updates of /usr that are replicated into a number of systems
possessing private /etc and /var there needs to be a
way how these directories can be upgraded transparently when
necessary, for example by recreating caches like
/etc/ld.so.cache or adding missing system users to
/etc/passwd on next reboot.

Starting with systemd 215 (yet unreleased, as I type this) we will
ship with a number of features in systemd that make /etc-less
boots functional:

  • A new tool systemd-sysusers as been added. It introduces
    a new drop-in directory /usr/lib/sysusers.d/. Minimal
    descriptions of necessary system users and groups can be placed
    there. Whenever the tool is invoked it will create these users in
    /etc/passwd and /etc/group should they be
    missing. It is only suitable for creating system users and groups, not
    for normal users. It will write to the files directly via the
    appropriate glibc APIs, which is the right thing to do for system
    users. (For normal users no such APIs exist, as the users might be
    stored centrally on LDAP or suchlike, and they are out of focus for
    our usecase.) The major benefit of this tool is that system user
    definition can happen offline: a package simply has to drop in a new
    file to register a user. This makes system user registration
    declarative instead of imperative — which is the way
    how system users are traditionally created from RPM or DEB
    installation scripts. By being declarative it is easy to replicate the
    users on next boot to a number of system instances.

    To make this new
    tool interesting for packaging scripts we make it easy to
    alternatively invoke it during package installation time, thus being a
    good alternative to invocations of useradd -r and
    groupadd -r.

    Some OS designs use a static, fixed user/group list stored in
    /usr as primary database for users/groups, which fixed
    UID/GID mappings. While this works for specific systems, this cannot
    cover the general purpose. As the UID/GID range for system
    users/groups is very small (only containing 998 users and groups on most systems), the
    best has to be made from this space and only UIDs/GIDs necessary on
    the specific system should be allocated. This means allocation has to
    be dynamic and adjust to what is necessary.

    Also note that this tool has
    one very nice feature: in addition to fully dynamic, and fully static
    UID/GID assignment for the users to create, it supports reading
    UID/GID numbers off existing files in /usr, so that vendors
    can make use of setuid/setgid binaries owned by specific users.

  • We also added a default
    user definition list
    which creates the most basic users the system
    and systemd need. Of course, very likely downstream distributions
    might need to alter this default list, add new entries and possibly
    map specific users to particular numeric UIDs.
  • A new condition ConditionNeedsUpdate= has been
    added. With this mechanism it is possible to conditionalize execution
    of services depending on whether /usr is newer than
    /etc or /var. The idea is that various services that
    need to be added into the boot process on upgrades make use of this to
    not delay boot-ups on normal boots, but run as necessary should
    /usr have been update since the last boot. This is
    implemented based on the mtime timestamp of the
    /usr: if the OS has been updated the packaging software
    should touch the directory, thus informing all instances that
    an upgrade of /etc and /var might be necessary.
  • We added a number of service files, that make use of the new
    ConditionNeedsUpdate= switch, and run a couple of services
    after each update. Among them are the aforementiond
    systemd-sysusers tool, as well as services that rebuild the
    udev hardware database, the journal catalog database and the library
    cache in /etc/ld.so.cache.
  • If systemd detects an empty /etc at early boot it will
    now use the unit
    preset
    information to enable all services by default that the
    vendor or packager declared. It will then proceed booting.
  • We added a
    new tmpfiles snippet
    that is able to reconstruct the
    most basic structure of /etc if it is missing.
  • tmpfiles also gained the ability copy entire directory trees into
    place should they be missing. This is particularly useful for copying
    certain essential files or directories into /etc without
    which the system refuses to boot. Currently the most prominent
    candidates for this are /etc/pam.d and
    /etc/dbus-1. In the long run we hope that packages can be
    fixed so that they always work correctly without configuration in
    /etc. Depending on the software this means that they should
    come with compiled-in defaults that just work should their
    configuration file be missing, or that they should fall back to static
    vendor-supplied configuration in /usr that is used whenever
    /etc doesn’t have any configuration. Both the PAM and the
    D-Bus case are probably candidates for the latter. Given that there
    are probably many cases like this we are working with a number of
    folks to introduce a new directory called /usr/share/etc
    (name is not settled yet) to major distributions, that always
    contain the full, original, vendor-supplied configuration of all
    packages. This is very useful here, so that there’s an obvious place
    to copy the original configuration from, but it is also useful
    completely independently as this provides administrators with an easy
    place to diff their own configuration in /etc
    against to see what local changes are in place.
  • We added a new --tmpfs= switch to systemd-nspawn
    to make testing of systems with unpopulated /etc and
    /var easy. For example, to run a fully state-less container, use a command line like this:

    # system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b

    This command line will invoke the container tree stored in
    /srv/mycontainer in a read-only way, but with a (writable)
    tmpfs mounted to /var and /etc. With a very recent
    git snapshot of systemd invoking a Fedora rawhide system should mostly
    work OK, modulo the D-Bus and PAM problems mentioned above. A later
    version of systemd-nspawn is likely to gain a high-level
    switch --mode={stateful|volatile|stateless} that sets
    combines this into simple switches reusing the vocabulary introduced
    earlier.

What’s Next

Pulling this all together we are very close to making boots with
empty /etc and /var on general purpose Linux
operating systems a reality. Of course, while doing the groundwork in
systemd gets us some distance, there’s a lot of work left. Most
importantly: the majority of Linux packages are simply incomptible
with this scheme the way they are currently set up. They do not work
without configuration in /etc or state directories in
/var; they do not drop system user information in
/usr/lib/sysusers.d. However, we believe it’s our job to do
the groundwork, and to start somewhere.

So what does this mean for the next steps? Of course, currently
very little of this is available in any distribution (simply already
because 215 isn’t even released yet). However, this will hopefully
change quickly. As soon as that is accomplished we can start working
on making the other components of the OS work nicely in this
scheme. If you are an upstream developer, please consider making your
software work correctly if /etc and/or /var are not
populated. This means:

  • When you need a state directory in /var and it is missing,
    create it first. If you cannot do that, because you dropped priviliges
    or suchlike, please consider dropping in a tmpfiles snippet that
    creates the directory with the right permissions early at boot, should
    it be missing.
  • When you need configuration files in /etc to work
    properly, consider changing your application to work nicely when these
    files are missing, and automatically fall back to either built-in
    defaults, or to static vendor-supplied configuration files shipped in
    /usr, so that administrators can override configuration in
    /etc but if they don’t the default configuration counts.
  • When you need a system user or group, consider dropping in a file
    into /usr/lib/sysusers.d describing the users. (Currently
    documentation on this is minimal, we will provide more docs on this
    shortly.)

If you are a packager, you can also help on making this all work:

  • Ask upstream to implement what we describe above, possibly even preparing a patch for this.
  • If upstream will not make these changes, then consider dropping in
    tmpfiles snippets that copy the bare minimum of configuration files to
    make your software work from somewhere in /usr into
    /etc.
  • Consider moving from imperative useradd commands in
    packaging scripts, to declarative sysusers files. Ideally,
    this is shipped upstream too, but if that’s not possible then simply
    adding this to packages should be good enough.

Of course, before moving to declarative system user definitions you
should consult with your distribution whether their packaging policy
even allows that. Currently, most distributions will not, so we have
to work to get this changed first.

Anyway, so much about what we have been working on and where we want to take this.

Conclusion

Before we finish, let me stress again why we are doing all
this:

  1. For end-user machines like desktops, tablets or mobile phones, we
    want a generic way to implement factory reset, which the user can make
    use of when the system is broken (saves you support costs), or when he
    wants to sell it and get rid of his private data, and renew that “fresh
    car smell”.
  2. For embedded machines we want a generic way how to reset
    devices. We also want a way how every single boot can be identical to
    a factory reset, in a stateless system design.
  3. For all kinds of systems we want to centralize vendor data in
    /usr so that it can be strictly read-only, and fully
    cryptographically verified as one unit.
  4. We want to enable new kinds of OS installers that simply
    deserialize a vendor OS /usr snapshot into a new file system,
    install a boot loader and reboot, leaving all first-time configuration
    to the next boot.
  5. We want to enable new kinds of OS updaters that build on this, and
    manage a number of vendor OS /usr snapshots in verified states, and
    which can then update /etc and /var simply by
    rebooting into a newer version.
  6. We wanto to scale container setups naturally, by sharing a single
    golden master /usr tree with a large number of instances that
    simply maintain their own private /etc and /var for
    their private configuration and state, while still allowing clean
    updates of /usr.
  7. We want to make thin clients that share /usr across the
    network work by allowing stateless bootups. During all discussions on
    how /usr was to be organized this was fequently mentioned. A
    setup like this so far only worked in very specific cases, with this
    scheme we want to make this work in general case.

Of course, we have no illusions, just doing the groundwork for all
of this in systemd doesn’t make this all a real-life solution
yet. Also, it’s very unlikely that all of Fedora (or any other general
purpose distribution) will support this scheme for all its packages
soon, however, we are quite confident that the idea is convincing,
that we need to start somewhere, and that getting the most core
packages adapted to this shouldn’t be out of reach.

Oh, and of course, the concepts behind this are really not new, we
know that. However, what’s new here is that we try to make them
available in a general purpose OS core, instead of special purpose
systems.

Anyway, let’s get the ball rolling! Late’s make stateless systems a
reality!

And that’s all I have for now. I am sure this leaves a lot of
questions open. If you have any, join us on IRC on #systemd
on freenode or comment on Google+.