Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/stateless.html
(Just a small heads-up: I don’t blog as much as I used to, I
nowadays update my Google+
page a lot more frequently. You might want to subscribe that if
you are interested in more frequent technical updates on what we are
working on.)
In the past weeks we have been working on a couple of features for
systemd
that enable a number of new usecases I’d like to shed some light
on. Taking benefit of the /usr
merge that a number of distributions have completed we want to
bring runtime behaviour of Linux systems to the next level. With the
/usr merge completed most static vendor-supplied OS data is
found exclusively in /usr, only a few additional bits in
/var and /etc are necessary to make a system
boot. On this we can build to enable a couple of new features:
- A mechanism we call Factory Reset shall flush out
/etc and /var, but keep the vendor-supplied
/usr, bringing the system back into a well-defined, pristine
vendor state with no local state or configuration. This functionality
is useful across the board from servers, to desktops, to embedded
devices. - A Stateless System goes one step further: a system like
this never stores /etc or /var on persistent
storage, but always comes up with pristine vendor state. On systems
like this every reboot acts as factor reset. This functionality is
particularly useful for simple containers or systems that boot off the
network or read-only media, and receive all configuration they need
during runtime from vendor packages or protocols like DHCP or are
capable of discovering their parameters automatically from the
available hardware or periphery. - Reproducible Systems multiply a vendor image into many
containers or systems. Only local configuration or state is stored
per-system, while the vendor operating system is pulled in from the
same, immutable, shared snapshot. Each system hence has its private
/etc and /var for receiving local configuration,
however the OS tree in /usr is pulled in via bind mounts (in
case of containers) or technologies like NFS (in case of physical
systems), or btrfs snapshots from a golden master image. This is
particular interesting for containers where the goal is to run
thousands of container images from the same OS tree. However, it also
has a number of other usecases, for example thin client systems, which
can boot the same NFS share a number of times. Furthermore this
mechanism is useful to implement very simple OS installers, that
simply unserialize a /usr snapshot into a file system,
install a boot loader, and reboot. - Verifiable Systems are closely related to stateless
systems: if the underlying storage technology can cryptographically
ensure that the vendor-supplied OS is trusted and in a consistent
state, then it must be made sure that /etc or /var
are either included in the OS image, or simply unnecessary for booting.
Concepts
A number of Linux-based operating systems have tried to implement
some of the schemes described out above in one way or
another. Particularly interesting are GNOME’s OSTree, CoreOS and Google’s Android and
ChromeOS. They generally found different solutions for the specific
problems you have when implementing schemes like this, sometimes taking
shortcuts that keep only the specific case in mind, and cannot cover
the general purpose. With systemd now being at the core of so many
distributions and deeply involved in bringing up and maintaining the
system we came to the conclusion that we should attempt to add generic
support for setups like this to systemd itself, to open this up for
the general purpose distributions to build on. We decided to focus on
three kinds of systems:
- The stateful system, the traditional system as we know it with
machine-specific /etc, /usr and /var, all
properly populated. - Startup without a populated /var, but with configured
/etc. (We will call these volatile systems.) - Startup without either /etc or /var. (We will
call these stateless systems.)
A factory reset is just a special case of the latter two modes,
where the system boots up without /var and /etc but
the next boot is a normal stateful boot like like the first described
mode. Note that a mode where /etc is flushed, but
/var is not is nothing we intend to cover (why? well, the
user ID question becomes much harder, see below, and we simply saw no
usecase for it worth the trouble).
Problems
Booting up a system without a populated /var is relatively
straight-forward. With a
few lines of tmpfiles configuration it is possible to populate
/var with its basic structure in a way that is sufficient to
make a system boot cleanly. systemd version 214 and newer ship with
support for this. Of course, support for this scheme in systemd is
only a small part of the solution. While a lot of software
reconstructs the directory hierarchy it needs in /var
automatically, many software does not. In case like this it is
necessary to ship a couple of additional tmpfiles lines that setup up
at boot-time the necessary files or directories in /var to
make the software operate, similar to what RPM or DEB packages would
set up at installation time.
Booting up a system without a populated /etc is a more
difficult task. In /etc we have a lot of configuration bits
that are essential for the system to operate, for example and most
importantly system user and group information in /etc/passwd
and /etc/group. If the system boots up without /etc
there must be a way to replicate the minimal information necessary in
it, so that the system manages to boot up fully.
To make this even more complex, in order to support “offline”
updates of /usr that are replicated into a number of systems
possessing private /etc and /var there needs to be a
way how these directories can be upgraded transparently when
necessary, for example by recreating caches like
/etc/ld.so.cache or adding missing system users to
/etc/passwd on next reboot.
Starting with systemd 215 (yet unreleased, as I type this) we will
ship with a number of features in systemd that make /etc-less
boots functional:
-
A new tool systemd-sysusers as been added. It introduces
a new drop-in directory /usr/lib/sysusers.d/. Minimal
descriptions of necessary system users and groups can be placed
there. Whenever the tool is invoked it will create these users in
/etc/passwd and /etc/group should they be
missing. It is only suitable for creating system users and groups, not
for normal users. It will write to the files directly via the
appropriate glibc APIs, which is the right thing to do for system
users. (For normal users no such APIs exist, as the users might be
stored centrally on LDAP or suchlike, and they are out of focus for
our usecase.) The major benefit of this tool is that system user
definition can happen offline: a package simply has to drop in a new
file to register a user. This makes system user registration
declarative instead of imperative — which is the way
how system users are traditionally created from RPM or DEB
installation scripts. By being declarative it is easy to replicate the
users on next boot to a number of system instances.To make this new
tool interesting for packaging scripts we make it easy to
alternatively invoke it during package installation time, thus being a
good alternative to invocations of useradd -r and
groupadd -r.Some OS designs use a static, fixed user/group list stored in
/usr as primary database for users/groups, which fixed
UID/GID mappings. While this works for specific systems, this cannot
cover the general purpose. As the UID/GID range for system
users/groups is very small (only containing 998 users and groups on most systems), the
best has to be made from this space and only UIDs/GIDs necessary on
the specific system should be allocated. This means allocation has to
be dynamic and adjust to what is necessary.Also note that this tool has
one very nice feature: in addition to fully dynamic, and fully static
UID/GID assignment for the users to create, it supports reading
UID/GID numbers off existing files in /usr, so that vendors
can make use of setuid/setgid binaries owned by specific users. - We also added a default
user definition list which creates the most basic users the system
and systemd need. Of course, very likely downstream distributions
might need to alter this default list, add new entries and possibly
map specific users to particular numeric UIDs. - A new condition ConditionNeedsUpdate= has been
added. With this mechanism it is possible to conditionalize execution
of services depending on whether /usr is newer than
/etc or /var. The idea is that various services that
need to be added into the boot process on upgrades make use of this to
not delay boot-ups on normal boots, but run as necessary should
/usr have been update since the last boot. This is
implemented based on the mtime timestamp of the
/usr: if the OS has been updated the packaging software
should touch the directory, thus informing all instances that
an upgrade of /etc and /var might be necessary. - We added a number of service files, that make use of the new
ConditionNeedsUpdate= switch, and run a couple of services
after each update. Among them are the aforementiond
systemd-sysusers tool, as well as services that rebuild the
udev hardware database, the journal catalog database and the library
cache in /etc/ld.so.cache. - If systemd detects an empty /etc at early boot it will
now use the unit
preset information to enable all services by default that the
vendor or packager declared. It will then proceed booting. - We added a
new tmpfiles snippet that is able to reconstruct the
most basic structure of /etc if it is missing. - tmpfiles also gained the ability copy entire directory trees into
place should they be missing. This is particularly useful for copying
certain essential files or directories into /etc without
which the system refuses to boot. Currently the most prominent
candidates for this are /etc/pam.d and
/etc/dbus-1. In the long run we hope that packages can be
fixed so that they always work correctly without configuration in
/etc. Depending on the software this means that they should
come with compiled-in defaults that just work should their
configuration file be missing, or that they should fall back to static
vendor-supplied configuration in /usr that is used whenever
/etc doesn’t have any configuration. Both the PAM and the
D-Bus case are probably candidates for the latter. Given that there
are probably many cases like this we are working with a number of
folks to introduce a new directory called /usr/share/etc
(name is not settled yet) to major distributions, that always
contain the full, original, vendor-supplied configuration of all
packages. This is very useful here, so that there’s an obvious place
to copy the original configuration from, but it is also useful
completely independently as this provides administrators with an easy
place to diff their own configuration in /etc
against to see what local changes are in place. -
We added a new --tmpfs= switch to systemd-nspawn
to make testing of systems with unpopulated /etc and
/var easy. For example, to run a fully state-less container, use a command line like this:# system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b
This command line will invoke the container tree stored in
/srv/mycontainer in a read-only way, but with a (writable)
tmpfs mounted to /var and /etc. With a very recent
git snapshot of systemd invoking a Fedora rawhide system should mostly
work OK, modulo the D-Bus and PAM problems mentioned above. A later
version of systemd-nspawn is likely to gain a high-level
switch --mode={stateful|volatile|stateless} that sets
combines this into simple switches reusing the vocabulary introduced
earlier.
What’s Next
Pulling this all together we are very close to making boots with
empty /etc and /var on general purpose Linux
operating systems a reality. Of course, while doing the groundwork in
systemd gets us some distance, there’s a lot of work left. Most
importantly: the majority of Linux packages are simply incomptible
with this scheme the way they are currently set up. They do not work
without configuration in /etc or state directories in
/var; they do not drop system user information in
/usr/lib/sysusers.d. However, we believe it’s our job to do
the groundwork, and to start somewhere.
So what does this mean for the next steps? Of course, currently
very little of this is available in any distribution (simply already
because 215 isn’t even released yet). However, this will hopefully
change quickly. As soon as that is accomplished we can start working
on making the other components of the OS work nicely in this
scheme. If you are an upstream developer, please consider making your
software work correctly if /etc and/or /var are not
populated. This means:
- When you need a state directory in /var and it is missing,
create it first. If you cannot do that, because you dropped priviliges
or suchlike, please consider dropping in a tmpfiles snippet that
creates the directory with the right permissions early at boot, should
it be missing. - When you need configuration files in /etc to work
properly, consider changing your application to work nicely when these
files are missing, and automatically fall back to either built-in
defaults, or to static vendor-supplied configuration files shipped in
/usr, so that administrators can override configuration in
/etc but if they don’t the default configuration counts. - When you need a system user or group, consider dropping in a file
into /usr/lib/sysusers.d describing the users. (Currently
documentation on this is minimal, we will provide more docs on this
shortly.)
If you are a packager, you can also help on making this all work:
- Ask upstream to implement what we describe above, possibly even preparing a patch for this.
- If upstream will not make these changes, then consider dropping in
tmpfiles snippets that copy the bare minimum of configuration files to
make your software work from somewhere in /usr into
/etc. - Consider moving from imperative useradd commands in
packaging scripts, to declarative sysusers files. Ideally,
this is shipped upstream too, but if that’s not possible then simply
adding this to packages should be good enough.
Of course, before moving to declarative system user definitions you
should consult with your distribution whether their packaging policy
even allows that. Currently, most distributions will not, so we have
to work to get this changed first.
Anyway, so much about what we have been working on and where we want to take this.
Conclusion
Before we finish, let me stress again why we are doing all
this:
- For end-user machines like desktops, tablets or mobile phones, we
want a generic way to implement factory reset, which the user can make
use of when the system is broken (saves you support costs), or when he
wants to sell it and get rid of his private data, and renew that “fresh
car smell”. - For embedded machines we want a generic way how to reset
devices. We also want a way how every single boot can be identical to
a factory reset, in a stateless system design. - For all kinds of systems we want to centralize vendor data in
/usr so that it can be strictly read-only, and fully
cryptographically verified as one unit. - We want to enable new kinds of OS installers that simply
deserialize a vendor OS /usr snapshot into a new file system,
install a boot loader and reboot, leaving all first-time configuration
to the next boot. - We want to enable new kinds of OS updaters that build on this, and
manage a number of vendor OS /usr snapshots in verified states, and
which can then update /etc and /var simply by
rebooting into a newer version. - We wanto to scale container setups naturally, by sharing a single
golden master /usr tree with a large number of instances that
simply maintain their own private /etc and /var for
their private configuration and state, while still allowing clean
updates of /usr. - We want to make thin clients that share /usr across the
network work by allowing stateless bootups. During all discussions on
how /usr was to be organized this was fequently mentioned. A
setup like this so far only worked in very specific cases, with this
scheme we want to make this work in general case.
Of course, we have no illusions, just doing the groundwork for all
of this in systemd doesn’t make this all a real-life solution
yet. Also, it’s very unlikely that all of Fedora (or any other general
purpose distribution) will support this scheme for all its packages
soon, however, we are quite confident that the idea is convincing,
that we need to start somewhere, and that getting the most core
packages adapted to this shouldn’t be out of reach.
Oh, and of course, the concepts behind this are really not new, we
know that. However, what’s new here is that we try to make them
available in a general purpose OS core, instead of special purpose
systems.
Anyway, let’s get the ball rolling! Late’s make stateless systems a
reality!
And that’s all I have for now. I am sure this leaves a lot of
questions open. If you have any, join us on IRC on #systemd
on freenode or comment on Google+.