Tag Archives: embedded

REMINDER! systemd.conf 2016 CfP Ends in Two Weeks!

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/reminder-systemdconf-2016-cfp-ends-in-two-weeks.html

Please note that the systemd.conf 2016
Call for Participation ends in less than two weeks, on Aug. 1st!
Please send in your talk proposal by then! We’ve already got a good
number of excellent submissions, but we are interested in yours even
more!

We are looking for talks on all facets of systemd: deployment,
maintenance, administration, development. Regardless of whether you
use it in the cloud, on embedded, on IoT, on the desktop, on mobile,
in a container or on the server: we are interested in your
submissions!

In addition to proposals for talks for the main conference, we are
looking for proposals for workshop sessions held during our
Workshop Day (the first day of the conference). The workshop format
consists of a day of 2-3h training sessions, that may cover any
systemd-related topic you’d like. We are both interested in
submissions from the developer community as well as submissions from
organizations making use of systemd! Introductory workshop sessions
are particularly welcome, as the Workshop Day is intended to open up
our conference to newcomers and people who aren’t systemd gurus yet,
but would like to become more fluent.

For further details on the submissions we are looking for and the CfP
process, please consult the CfP
page
and
submit your proposal using the provided form!

And keep in mind:

REMINDER: Please sign up for the conference soon! Only a
limited number of tickets are available, hence make sure to secure
yours quickly before they run out! (Last year we sold out.) Please
sign up here for the
conference!

AND OF COURSE: We are also looking for more sponsors for
systemd.conf! If you are working on systemd-related projects, or make
use of it in your company, please consider becoming a sponsor of
systemd.conf
2016
!
Without our sponsors we couldn’t organize systemd.conf 2016!

Thank you very much, and see you in Berlin!

CfP is now open

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/cfp-is-now-open.html

The systemd.conf 2016 Call for Participation is Now Open!

We’d like to invite presentation and workshop proposals for systemd.conf 2016!

The conference will consist of three parts:

  • One day of workshops, consisting of in-depth (2-3hr) training and learning-by-doing sessions (Sept. 28th)
  • Two days of regular talks (Sept. 29th-30th)
  • One day of hackfest (Oct. 1st)

We are now accepting submissions for the first three days: proposals
for workshops, training sessions and regular talks. In particular, we
are looking for sessions including, but not limited to, the following
topics:

  • Use Cases: systemd in today’s and tomorrow’s devices and applications
  • systemd and containers, in the cloud and on servers
  • systemd in distributions
  • Embedded systemd and in IoT
  • systemd on the desktop
  • Networking with systemd
  • … and everything else related to systemd

Please submit your proposals by August 1st, 2016. Notification of acceptance will be sent out 1-2 weeks later.

If submitting a workshop proposal please contact the organizers for more details.

To submit a talk, please visit our CfP submission page.

For further information on systemd.conf 2016, please visit our conference web site.

Introducing sd-event

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/introducing-sd-event.html

The Event Loop API of libsystemd

When we began working on
systemd we built
it around a hand-written ad-hoc event loop, wrapping Linux
epoll
. The more
our project grew the more we realized the limitations of using raw
epoll:

  • As we used
    timerfd
    for our timer events, each event source cost one file descriptor and
    we had many of them! File descriptors are a scarce resource on UNIX,
    as
    RLIMIT_NOFILE
    is typically set to 1024 or similar, limiting the number of
    available file descriptors per process to 1021, which isn’t
    particularly a lot.

  • Ordering of event dispatching became a nightmare. In many cases, we
    wanted to make sure that a certain kind of event would always be
    dispatched before another kind of event, if both happen at the same
    time. For example, when the last process of a service dies, we might
    be notified about that via a SIGCHLD signal, via an
    sd_notify() “STATUS=”
    message, and via a control group notification. We wanted to get
    these events in the right order, to know when it’s safe to process
    and subsequently release the runtime data systemd keeps about the
    service or process: it shouldn’t be done if there are still events
    about it pending.

  • For each program we added to the systemd project we noticed we were
    adding similar code, over and over again, to work with epoll’s
    complex interfaces. For example, finding the right file descriptor
    and callback function to dispatch an epoll event to, without running
    into invalidated pointer issues is outright difficult and requires
    non-trivial code.

  • Integrating child process watching into our event loops was much
    more complex than one could hope, and even more so if child process
    events should be ordered against each other and unrelated kinds of
    events.

Eventually, we started working on
sd-bus. At
the same time we decided to seize the opportunity, put together a
proper event loop API in C, and then not only port sd-bus on top of
it, but also the rest of systemd. The result of this is
sd-event. After
almost two years of development we declared sd-event stable in systemd
version 221, and published it as official API of libsystemd.

Why?

sd-event.h,
of course, is not the first event loop API around, and it doesn’t
implement any really novel concepts. When we started working on it we
tried to do our homework, and checked the various existing event loop
APIs, maybe looking for candidates to adopt instead of doing our own,
and to learn about the strengths and weaknesses of the various
implementations existing. Ultimately, we found no implementation that
could deliver what we needed, or where it would be easy to add the
missing bits: as usual in the systemd project, we wanted something
that allows us access to all the Linux-specific bits, instead of
limiting itself to the least common denominator of UNIX. We weren’t
looking for an abstraction API, but simply one that makes epoll usable
in system code.

With this blog story I’d like to take the opportunity to introduce you
to sd-event, and explain why it might be a good candidate to adopt as
event loop implementation in your project, too.

So, here are some features it provides:

  • I/O event sources, based on epoll’s file descriptor watching,
    including edge triggered events (EPOLLET). See
    sd_event_add_io(3).

  • Timer event sources, based on timerfd_create(), supporting the
    CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTIME clocks, as well
    as the CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM clocks that
    can resume the system from suspend. When creating timer events a
    required accuracy parameter may be specified which allows coalescing
    of timer events to minimize power consumption. For each clock only a
    single timer file descriptor is kept, and all timer events are
    multiplexed with a priority queue. See
    sd_event_add_time(3).

  • UNIX process signal events, based on
    signalfd(2),
    including full support for real-time signals, and queued
    parameters. See sd_event_add_signal(3).

  • Child process state change events, based on
    waitid(2). See
    sd_event_add_child(3).

  • Static event sources, of three types: defer, post and exit, for
    invoking calls in each event loop, after other event sources or at
    event loop termination. See
    sd_event_add_defer(3).

  • Event sources may be assigned a 64bit priority value, that controls
    the order in which event sources are dispatched if multiple are
    pending simultanously. See
    sd_event_source_set_priority(3).

  • The event loop may automatically send watchdog notification messages
    to the service manager. See sd_event_set_watchdog(3).

  • The event loop may be integrated into foreign event loops, such as
    the GLib one. The event loop API is hence composable, the same way
    the underlying epoll logic is. See
    sd_event_get_fd(3)
    for an example.

  • The API is fully OOM safe.

  • A complete set of documentation in UNIX man page format is
    available, with
    sd-event(3)
    as the entry page.

  • It’s pretty widely available, and requires no extra
    dependencies. Since systemd is built on it, most major distributions
    ship the library in their default install set.

  • After two years of development, and after being used in all of
    systemd’s components, it has received a fair share of testing already,
    even though we only recently decided to declare it stable and turned
    it into a public API.

Note that sd-event has some potential drawbacks too:

  • If portability is essential to you, sd-event is not your best
    option. sd-event is a wrapper around Linux-specific APIs, and that’s
    visible in the API. For example: our event callbacks receive
    structures defined by Linux-specific APIs such as signalfd.

  • It’s a low-level C API, and it doesn’t isolate you from the OS
    underpinnings. While I like to think that it is relatively nice and
    easy to use from C, it doesn’t compromise on exposing the low-level
    functionality. It just fills the gaps in what’s missing between
    epoll, timerfd, signalfd and related concepts, and it does not hide
    that away.

Either way, I believe that sd-event is a great choice when looking for
an event loop API, in particular if you work on system-level software
and embedded, where functionality like timer coalescing or
watchdog support matter.

Getting Started

Here’s a short example how to use sd-event in a simple daemon. In this
example, we’ll not just use sd-event.h, but also sd-daemon.h to
implement a system service.

#include <alloca.h>
#include <endian.h>
#include <errno.h>
#include <netinet/in.h>
#include <signal.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <unistd.h>

#include <systemd/sd-daemon.h>
#include <systemd/sd-event.h>

static int io_handler(sd_event_source *es, int fd, uint32_t revents, void *userdata) {
        void *buffer;
        ssize_t n;
        int sz;

        /* UDP enforces a somewhat reasonable maximum datagram size of 64K, we can just allocate the buffer on the stack */
        if (ioctl(fd, FIONREAD, &sz) < 0)
                return -errno;
        buffer = alloca(sz);

        n = recv(fd, buffer, sz, 0);
        if (n < 0) {
                if (errno == EAGAIN)
                        return 0;

                return -errno;
        }

        if (n == 5 && memcmp(buffer, "EXIT\n", 5) == 0) {
                /* Request a clean exit */
                sd_event_exit(sd_event_source_get_event(es), 0);
                return 0;
        }

        fwrite(buffer, 1, n, stdout);
        fflush(stdout);
        return 0;
}

int main(int argc, char *argv[]) {
        union {
                struct sockaddr_in in;
                struct sockaddr sa;
        } sa;
        sd_event_source *event_source = NULL;
        sd_event *event = NULL;
        int fd = -1, r;
        sigset_t ss;

        r = sd_event_default(&event);
        if (r < 0)
                goto finish;

        if (sigemptyset(&ss) < 0 ||
            sigaddset(&ss, SIGTERM) < 0 ||
            sigaddset(&ss, SIGINT) < 0) {
                r = -errno;
                goto finish;
        }

        /* Block SIGTERM first, so that the event loop can handle it */
        if (sigprocmask(SIG_BLOCK, &ss, NULL) < 0) {
                r = -errno;
                goto finish;
        }

        /* Let's make use of the default handler and "floating" reference features of sd_event_add_signal() */
        r = sd_event_add_signal(event, NULL, SIGTERM, NULL, NULL);
        if (r < 0)
                goto finish;
        r = sd_event_add_signal(event, NULL, SIGINT, NULL, NULL);
        if (r < 0)
                goto finish;

        /* Enable automatic service watchdog support */
        r = sd_event_set_watchdog(event, true);
        if (r < 0)
                goto finish;

        fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0);
        if (fd < 0) {
                r = -errno;
                goto finish;
        }

        sa.in = (struct sockaddr_in) {
                .sin_family = AF_INET,
                .sin_port = htobe16(7777),
        };
        if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
                r = -errno;
                goto finish;
        }

        r = sd_event_add_io(event, &event_source, fd, EPOLLIN, io_handler, NULL);
        if (r < 0)
                goto finish;

        (void) sd_notifyf(false,
                          "READY=1\n"
                          "STATUS=Daemon startup completed, processing events.");

        r = sd_event_loop(event);

finish:
        event_source = sd_event_source_unref(event_source);
        event = sd_event_unref(event);

        if (fd >= 0)
                (void) close(fd);

        if (r < 0)
                fprintf(stderr, "Failure: %s\n", strerror(-r));

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

The example above shows how to write a minimal UDP/IP server, that
listens on port 7777. Whenever a datagram is received it outputs its
contents to STDOUT, unless it is precisely the string EXIT\n in
which case the service exits. The service will react to SIGTERM and
SIGINT and do a clean exit then. It also notifies the service manager
about its completed startup, if it runs under a service
manager. Finally, it sends watchdog keep-alive messages to the service
manager if it asked for that, and if it runs under a service manager.

When run as systemd service this service’s STDOUT will be connected to
the logging framework of course, which means the service can act as a
minimal UDP-based remote logging service.

To compile and link this example, save it as event-example.c, then run:

$ gcc event-example.c -o event-example `pkg-config --cflags --libs libsystemd`

For a first test, simply run the resulting binary from the command
line, and test it against the following netcat command line:

$ nc -u localhost 7777

For the sake of brevity error checking is minimal, and in a real-world
application should, of course, be more comprehensive. However, it
hopefully gets the idea across how to write a daemon that reacts to
external events with sd-event.

For further details on the functions used in the example above, please
consult the manual pages:
sd-event(3),
sd_event_exit(3),
sd_event_source_get_event(3),
sd_event_default(3),
sd_event_add_signal(3),
sd_event_set_watchdog(3),
sd_event_add_io(3),
sd_notifyf(3),
sd_event_loop(3),
sd_event_source_unref(3),
sd_event_unref(3).

Conclusion

So, is this the event loop to end all other event loops? Certainly
not. I actually believe in “event loop plurality”. There are many
reasons for that, but most importantly: sd-event is supposed to be an
event loop suitable for writing a wide range of applications, but it’s
definitely not going to solve all event loop problems. For example,
while the priority logic is important for many usecase it comes with
drawbacks for others: if not used carefully high-priority event
sources can easily starve low-priority event sources. Also, in order
to implement the priority logic, sd-event needs to linearly iterate
through the event structures returned by
epoll_wait(2)
to sort the events by their priority, resulting in worst case
O(n*log(n)) complexity on each event loop wakeup (for n = number of
file descriptors). Then, to implement priorities fully, sd-event only
dispatches a single event before going back to the kernel and asking
for new events. sd-event will hence not provide the theoretically
possible best scalability to huge numbers of file descriptors. Of
course, this could be optimized, by improving epoll, and making it
support how todays’s event loops actually work (after, all, this is
the problem set all event loops that implement priorities — including
GLib’s — have to deal with), but even then: the design of sd-event is focussed on
running one event loop per thread, and it dispatches events strictly
ordered. In many other important usecases a very different design is
preferable: one where events are distributed to a set of worker threads
and are dispatched out-of-order.

Hence, don’t mistake sd-event for what it isn’t. It’s not supposed to
unify everybody on a single event loop. It’s just supposed to be a
very good implementation of an event loop suitable for a large part of
the typical usecases.

Note that our APIs, including
sd-bus, integrate nicely into
sd-event event loops, but do not require it, and may be integrated
into other event loops too, as long as they support watching for time
and I/O events.

And that’s all for now. If you are considering using sd-event for your
project and need help or have questions, please direct them to the
systemd mailing list.

Second Round of systemd.conf 2015 Sponsors

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/second-round-of-systemdconf-2015-sponsors.html

Second Round of systemd.conf 2015 Sponsors

We are happy to announce the second round of systemd.conf
2015
sponsors! In addition to those from
the first
announcement
, we have:

Our second Gold sponsor is Red Hat!

What began as a better way to build software—openness, transparency, collaboration—soon shifted the balance of power in an entire industry. The revolution of choice continues. Today Red Hat® is the world’s leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux®, and middleware technologies.

A Bronze sponsor is Samsung:

From the beginning we have established a very fast pace and are currently one of the biggest and fastest growing modern-technology R&D centers in East-Central Europe.
We have started with designing subsystems for digital satellite television, however, we have quickly expanded the scope of our interest. Currently, it includes advanced systems of digital television, platform convergence, mobile systems, smart solutions, and enterprise solutions.
Also a vital role in our activity plays the quality and certification center, which controls the conformity of Samsung Electronics products with the highest standards of quality and reliability.

A Bronze sponsor is travelping:

Travelping is passionate about networks, communications and devices. We empower our customers to deploy and operate networks using our state of the art products, solutions and services.
Our products and solutions are based on our industry proven physical and virtual appliance platforms. These purpose built platforms ensure best in class performance, scalability and reliability combined with consistent end to end management capabilities.
To build this products, Travelping has developed a own embedded, cross platform Linux distribution called CAROS.io which incorporates the systemd service manager and tools.

A Bronze sponsor is Collabora:

Collabora has over 10 years of experience working with top tier OEMs & silicon manufacturers worldwide to develop products based on Open Source software. Through the use of Open Source technologies and methodologies, Collabora helps clients in multiple market segments gain faster time to market and save millions of dollars in licensing and maintenance costs. Collabora has already brought to market several products relying on systemd extensively.

A Bronze sponsor is Endocode:

Endocode AG. An employee-owned, software engineering company from Berlin. Open Source is our heart and soul.

A Bronze sponsor is the Linux Foundation:

The Linux Foundation advances the growth of Linux and offers its collaborative principles and practices to any endeavor.

We are Cooperating with LinuxTag e.V. on the organization:

LinuxTag is Europe’s leading organizer of Linux and Open Source events. Born of the community and in business for 20 years, we organize LinuxTag, an annual conference and exhibition attracting thousands of visitors. We also participate and cooperate in organizing workshops, tutorials, seminars, and other events together with and for the Open Source community. Selected events include non-profit workshops, the German Kernel Summit at FrOSCon, participation in the Open Tech Summit, and others. We take care of the organizational framework of systemd.conf 2015. LinuxTag e.V. is a non-profit organization and welcomes donations of ideas and workforce.

A Media Partner is Golem:

Golem.de is an up to date online-publication intended for professional computer users. It provides technology insights of the IT and telecommunications industry. Golem.de offers profound and up to date information on significant and trending topics. Online- and IT-Professionals, marketing managers, purchasers, and readers inspired by technology receive substantial information on product, market and branding potentials through tests, interviews und market analysis.

We’d like to thank our sponsors for their support! Without sponsors our conference would not be possible!

The Conference s SOLD OUT since a few weeks. We no longer accept registrations, nor paper submissions.

For further details about systemd.conf consult the conference website.

See the the first round of sponsor announcements!

See you in Berlin!

Preliminary systemd.conf 2015 Schedule

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/preliminary-systemdconf-2015-schedule.html

A Preliminary systemd.conf 2015 Schedule is Now Online!

We are happy to announce that an initial, preliminary version of the
systemd.conf 2015
schedule
is now
online! (Please ignore that some rows in the schedule link the same
session twice on that page. That’s a bug in the web site CMS we are
working on to fix.)

We got an overwhelming number of high-quality submissions during the
CfP! Because there were so many good talks we really wanted to
accept, we decided to do two full days of talks now, leaving one more
day for the hackfest and BoFs. We also shortened many of the slots, to
make room for more. All in all we now have a schedule packed with
fantastic presentations!

The areas covered range from containers, to system provisioning,
stateless systems, distributed init systems, the kdbus IPC, control
groups, systemd on the desktop, systemd in embedded devices,
configuration management and systemd, and systemd in downstream
distributions.

We’d like to thank everybody who submited a presentation proposal!

Also, don’t forget to register for the conference! Only a limited number of
registrations are available due to space constraints!
Register here!.

We are still looking for sponsors. If you’d like to join the ranks of
systemd.conf 2015 sponsors, please have a look at our Becoming a
Sponsor
page!

For further details about systemd.conf consult the conference
website
.

First Round of systemd.conf 2015 Sponsors

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/first-round-of-systemdconf-2015-sponsors.html

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf
2015
sponsors!

Our first Gold sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS’s commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here https://coreos.com/, Tectonic here, https://tectonic.com/ or follow CoreOS on Twitter @coreoslinux.

A Silver sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world’s top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We’d like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We’ll shortly announce our second round of sponsors, please stay tuned!

If you’d like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don’t forget to register for the conference! Only a limited number of
registrations are available due to space constraints!
Register here!.

For further details about systemd.conf consult the conference website.

systemd.conf 2015 Call for Presentations

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemdconf-2015-call-for-presentations.html

REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

We’d like to remind you that the systemd.conf 2015 Call for Presentations ends
on August 31st! Please submit your presentation proposals before that data
on our website.

We are specifically interested in submissions from projects and vendors building
today’s and tomorrow’s products, services and devices with systemd. We’d like to
learn about the problems you encounter and the benefits you see! Hence, if
you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution
maintainers of systemd! If you develop or maintain systemd packages in a
distribution, please submit a presentation reporting about the state, future
and the problems of systemd packaging so that we can improve downstream
collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud,
on servers, on the desktop, in mobile and in embedded are highly welcome! Talks
about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don’t forget to register for the conference! Only a limited number of
registrations are available due to space constraints!
Register here!.

Also, limited travel and entry fee sponsorship is available for community contributors. Please contact us for details!

For further details about the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.

Revisiting How We Put Together Linux Systems

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

In a previous blog story I discussed
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems,
I now want to take the opportunity to explain a bit where we want to
take this with
systemd in the
longer run, and what we want to build out of it. This is going to be a
longer story, so better grab a cold bottle of
Club Mate before you start
reading.

Traditional Linux distributions are built around packaging systems
like RPM or dpkg, and an organization model where upstream developers
and downstream packagers are relatively clearly separated: an upstream
developer writes code, and puts it somewhere online, in a tarball. A
packager than grabs it and turns it into RPMs/DEBs. The user then
grabs these RPMs/DEBs and installs them locally on the system. For a
variety of uses this is a fantastic scheme: users have a large
selection of readily packaged software available, in mostly uniform
packaging, from a single source they can trust. In this scheme the
distribution vets all software it packages, and as long as the user
trusts the distribution all should be good. The distribution takes the
responsibility of ensuring the software is not malicious, of timely
fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn’t fit
many use-cases of our software particularly well. Let’s have a look at
the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream
    distributions to package their stuff. It’s the downstream
    distribution that decides on schedules, packaging details, and how
    to handle support. Often upstream vendors want much faster release
    cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to
    impossible. Since the end-user can run a variety of different
    package versions together, and expects the software he runs to just
    work on any combination, the test matrix explodes. If upstream tests
    its version on distribution X release Y, then there’s no guarantee
    that that’s the precise combination of packages that the end user
    will eventually run. In fact, it is very unlikely that the end user
    will, since most distributions probably updated a number of
    libraries the package relies on by the time the package ends up being
    made available to the user. The fact that each package can be
    individually updated by the user, and each user can combine library
    versions, plug-ins and executables relatively freely, results in a high
    risk of something going wrong.

  • Since there are so many different distributions in so many different
    versions around, if upstream tries to build and test software for
    them it needs to do so for a large number of distributions, which is
    a massive effort.

  • The distributions are actually quite different in many ways. In
    fact, they are different in a lot of the most basic
    functionality. For example, the path where to put x86-64 libraries
    is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is
    hard: if you want to do it, you need to actually install them, each
    one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and
    trademark requirements (and rightly so), any kind of closed source
    software (or otherwise non-free) does not fit into this scheme at
    all.

This all together makes it really hard for many upstreams to work
nicely with the current way how Linux works. Often they try to improve
the situation for them, for example by bundling libraries, to make
their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for
people who want to put together their individual system, nicely
adjusted to exactly what they need. However, this is not really how
many of today’s Linux systems are built, installed or updated. If you
build any kind of embedded device, a server system, or even user
systems, you frequently do your work based on complete system images,
that are linearly versioned. You build these images somewhere, and
then you replicate them atomically to a larger number of systems. On
these systems, you don’t install or remove packages, you get a defined
set of files, and besides installing or updating the system there are
no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing
for this major use-case of Linux. Their strict focus on individual
packages as well as package managers as end-user install and update
tool is incompatible with what many system vendors want.

Users

The classic Linux distribution scheme is frequently not what end users
want, either. Many users are used to app markets like Android, Windows
or iOS/Mac have. Markets are a platform that doesn’t package, build or
maintain software like distributions do, but simply allows users to
quickly find and download the software they need, with the app vendor
responsible for keeping the app updated, secured, and all that on the
vendor’s release cycle. Users tend to be impatient. They want their
software quickly, and the fine distinction between trusting a single
distribution or a myriad of app developers individually is usually not
important for them. The companies behind the marketplaces usually try
to improve this trust problem by providing sand-boxing technologies: as
a replacement for the distribution that audits, vets, builds and
packages the software and thus allows users to trust it to a certain
level, these vendors try to find technical solutions to ensure that
the software they offer for download can’t be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are
sometimes quite successful attempts to do something about it. Ubuntu
Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of
this problem set, usually with a strict focus on one facet of Linux
systems. For example, Ubuntu Apps focus strictly on end user (desktop)
applications, and don’t care about how we built/update/install the OS
itself, or containers. Docker OTOH focuses on containers only, and
doesn’t care about end-user apps. Software Collections tries to focus
on the development environments. ChromeOS focuses on the OS itself,
but only for end-user devices. CoreOS also focuses on the OS, but
only for server systems.

The approaches they find are usually good at specific things, and use
a variety of different technologies, on different layers. However,
none of these projects tried to fix this problems in a generic way,
for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so
generic: you can build supercomputers and tiny embedded devices out of
it. It’s time we come up with a basic, reusable scheme how to solve
the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom
Gundersen, David Herrmann, and yours truly) recently met in Berlin
about all these things, and tried to come up with a scheme that is
somewhat simple, but tries to solve the issues generically, for all
use-cases, as part of the systemd project. All that in a way that is
somewhat compatible with the current scheme of distributions, to allow
a slow, gradual adoption. Also, and that’s something one cannot stress
enough: the toolbox scheme of classic Linux distributions is
actually a good one, and for many cases the right one. However, we
need to make sure we make distributions relevant again for all
use-cases, not just those of highly individualized systems.

Anyway, so let’s summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their
    software (regardless if just an app, or the whole OS) directly for
    the end user, and know the precise combination of libraries and
    packages it will operate with.

  • We want to allow end users and administrators to install these
    packages on their systems, regardless which distribution they have
    installed on it.

  • We want a unified solution that ultimately can cover updates for
    full systems, OS containers, end user apps, programming ABIs, and
    more. These updates shall be double-buffered, (at least). This is an
    absolute necessity if we want to prepare the ground for operating
    systems that manage themselves, that can update safely without
    administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a
    fully trustable OS, with images that can be verified by a full
    trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the
    kernel, and initrd. Cryptographically secure verification of the
    code we execute is relevant on the desktop (like ChromeOS does), but
    also for apps, for embedded devices and even on servers (in a post-Snowden
    world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So,
now, let’s discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs
and Linux file system name-spacing. btrfs at this point already has a
large number of features that fit neatly in our concept, and the
maintainers are busy working on a couple of others we want to
eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and
introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> — This refers to a full
    vendor operating system tree. It’s basically a /usr tree (and no
    other directories), in a specific version, with everything you need to boot
    it up inside it. The <vendorid> field is replaced by some vendor
    identifier, maybe a scheme like
    org.fedoraproject.FedoraWorkstation. The <architecture> field
    specifies a CPU architecture the OS is designed for, for example
    x86-64. The <version> field specifies a specific OS version, for
    example 23.4. An example sub-volume name could hence look like this:
    usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> — This refers to an
    instance of an operating system. Its basically a root directory,
    containing primarily /etc and /var (but possibly more). Sub-volumes
    of this type do not contain a populated /usr tree though. The
    <name> field refers to some instance name (maybe the host name of
    the instance). The other fields are defined as above. An example
    sub-volume name is
    root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> — This refers to a
    vendor runtime. A runtime here is supposed to be a set of
    libraries and other resources that are needed to run apps (for the
    concept of apps see below), all in a /usr tree. In this regard this
    is very similar to the usr sub-volumes explained above, however,
    while a usr sub-volume is a full OS and contains everything
    necessary to boot, a runtime is really only a set of
    libraries. You cannot boot it, but you can run apps with it. An
    example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> — This is very
    similar to a vendor runtime, as described above, it contains just a
    /usr tree, but goes one step further: it additionally contains all
    development headers, compilers and build tools, that allow
    developing against a specific runtime. For each runtime there should
    be a framework. When you develop against a specific framework in a
    specific architecture, then the resulting app will be compatible
    with the runtime of the same vendor ID and architecture. Example:
    framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> — This
    encapsulates an application bundle. It contains a tree that at
    runtime is mounted to /opt/<vendorid>, and contains all the
    application’s resources. The <vendorid> could be a string like
    org.libreoffice.LibreOffice, the <runtime> refers to one the
    vendor id of one specific runtime the application is built for, for
    example org.gnome.GNOME3_20:3.20.1. The <architecture> and
    <version> refer to the architecture the application is built for,
    and of course its version. Example:
    app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> — This sub-volume shall refer to the home
    directory of the specific user. The <user> field contains the user
    name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs
    of the user. The idea here is that in the long run the list of
    sub-volumes is sufficient as a user database (but see
    below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly
identifiable. It is our intention to introduce a new GPT partition type
ID for this.

How To Use It

After we introduced this naming scheme let’s see what we can build of
this:

  • When booting up a system we mount the root directory from one of the
    root sub-volumes, and then mount /usr from a matching usr
    sub-volume. Matching here means it carries the same <vendor-id>
    and <architecture>. Of course, by default we should pick the
    matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when
    we boot up a regular system: we simply combine a usr sub-volume
    with a root sub-volume.

  • When we enumerate the system’s users we simply go through the
    list of home snapshots.

  • When a user authenticates and logs in we mount his home
    directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the
    app sub-volume to /opt/<vendorid>/, and the appropriate runtime
    sub-volume the app picked to /usr, as well as the user’s
    /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he
    installs the right framework, and then temporarily transitions into
    a name space where /usris mounted from the framework sub-volume, and
    /home/$USER from his own home directory. In this name space he then
    runs his build commands. He can build in multiple name spaces at the
    same time, if he intends to builds software for multiple runtimes or
    architectures at the same time.

Instantiating a new system or OS container (which is exactly the same
in this scheme) just consists of creating a new appropriately named
root sub-volume. Completely naturally you can share one vendor OS
copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because
usr, runtime, framework, app sub-volumes can exist in multiple
versions. Of course, by default the execution logic should always pick
the newest release of each sub-volume, but it is up to the user keep
multiple versions around, and possibly execute older versions, if he
desires to do so. In fact, like on ChromeOS this could even be handled
automatically: if a system fails to boot with a newer snapshot, the
boot loader can automatically revert back to an older version of the
OS.

An Example

Note that in result this allows installing not only multiple end-user
applications into the same btrfs volume, but also multiple operating
systems, multiple system instances, multiple runtimes, multiple
frameworks. Or to spell this out in an example:

Let’s say Fedora, Mageia and ArchLinux all implement this scheme,
and provide ready-made end-user images. Also, the GNOME, KDE, SDL
projects all define a runtime+framework to develop against. Finally,
both LibreOffice and Firefox provide their stuff according to this
scheme. You can now trivially install of these into the same btrfs
volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems
installed. All of them in three versions, and one even in a beta
version. We have four system instances around. Two of them of Fedora,
maybe one of them we usually boot from, the other we run for very
specific purposes in an OS container. We also have the runtimes for
two GNOME releases in multiple versions, plus one for KDE. Then, we
have the development trees for one version of KDE and GNOME around, as
well as two apps, that make use of two releases of the GNOME
runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can
actually relatively freely mix and match apps and OSes, or develop
against specific frameworks in specific versions on any operating
system. It doesn’t matter if you booted your ArchLinux instance, or
your Fedora one, you can execute both LibreOffice and Firefox just
fine, because at execution time they get matched up with the right
runtime, and all of them are available from all the operating systems
you installed. You get the precise runtime that the upstream vendor of
Firefox/LibreOffice did their testing with. It doesn’t matter anymore
which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the
sub-volume list, it doesn’t matter which system you boot, the
distribution should be able to find your local users automatically,
without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on
execution we already came quite far, but how do we actually get these
sub-volumes onto the final machines, and how do we update them? Well,
btrfs has a feature they call “send-and-receive”. It basically allows
you to “diff” two file system versions, and generate a binary
delta. You can generate these deltas on a developer’s machine and then
push them into the user’s system, and he’ll get the exact same
sub-volume too. This is how we envision installation and updating of
operating systems, applications, runtimes, frameworks. At installation
time, we simply deserialize an initial send-and-receive delta into
our btrfs volume, and later, when a new version is released we just
add in the few bits that are new, by dropping in another
send-and-receive delta under a new sub-volume name. And we do it
exactly the same for the OS itself, for a runtime, a framework or an
app. There’s no technical distinction anymore. The underlying
operation for installing apps, runtime, frameworks, vendor OSes, as well
as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an
awful lot of waste, after all they will contain a lot of very similar
data, since a lot of resources are shared between distributions,
frameworks and runtimes. However, thankfully btrfs actually is able to
de-duplicate this for us. If we add in a new app snapshot, this simply
adds in the new files that changed. Moreover different runtimes and
operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user,
desktop side of things, the concept is also extremely powerful in
server scenarios. For example, it is easy to build your own usr
trees and deliver them to your hosts using this scheme. The usr
sub-volumes are supposed to be something that administrators can put
together. After deserializing them into a couple of hosts, you can
trivially instantiate them as OS containers there, simply by adding a
new root sub-volume for each instance, referencing the usr tree you
just put together. Instantiating OS containers hence becomes as easy
as creating a new btrfs sub-volume. And you can still update the images
nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded
use-cases. Regardless if you build a TV, an IVI system or a phone: you
can put together you OS versions as usr trees, and then use
btrfs-send-and-receive facilities to deliver them to the systems, and
update them there.

Many people when they hear the word “btrfs” instantly reply with “is
it ready yet?”. Thankfully, most of the functionality we really need
here is strictly read-only. With the exception of the home
sub-volumes (see below) all snapshots are strictly read-only, and are
delivered as immutable vendor trees onto the devices. They never are
changed. Even if btrfs might still be immature, for this kind of
read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example,
an installer image could include a Fedora version compiled for x86-64,
one for i386, one for ARM, all in the same btrfs volume. Due to btrfs’
de-duplication they will share as much as possible, and when the image
is booted up the right sub-volume is automatically picked. Something
similar of course applies to the apps too!

This also allows us to implement something that we like to call
Operating-System-As-A-Virus. Installing a new system is little more
than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume,
you can trivially duplicate this onto any block device you want. Let’s
say you are a happy Fedora user, and you want to provide a friend with
his own installation of this awesome system, all on a USB stick. All
you have to do for this is do the steps above, using your installed
usr tree as source to copy. And there you go! And you don’t have to
be afraid that any of your personal data is copied too, as the usr
sub-volume is the exact version your vendor provided you with. Or with
other words: there’s no distinction anymore between installer images
and installed systems. It’s all the same. Installation becomes
replication, not more. Live-CDs and installed systems can be fully
identical.

Note that in this design apps are actually developed against a single,
very specific runtime, that contains all libraries it can link against
(including a specific glibc version!). Any library that is not
included in the runtime the developer picked must be included in the
app itself. This is similar how apps on Android declare one very
specific Android version they are developed against. This greatly
simplifies application installation, as there’s no dependency hell:
each app pulls in one runtime, and the app is actually free to pick
which one, as you can have multiple installed, though only one is used
by each app.

Also note that operating systems built this way will never see
“half-updated” systems, as it is common when a system is updated using
RPM/dpkg. When updating the system the code will either run the old or
the new version, but it will never see part of the old files and part
of the new files. This is the same for apps, runtimes, and frameworks,
too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for
this. This scheme relies on the ability to monopolize the
vendor OS resources in /usr, which is the key of what I described in
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems
a few weeks back. Then, of course, for the full desktop app concept we
need a strong sandbox, that does more than just hiding files from the
file system view. After all with an app concept like the above the
primary interfacing between the executed desktop apps and the rest of the
system is via IPC (which is why we work on kdbus and teach it all
kinds of sand-boxing features), and the kernel itself. Harald Hoyer has
started working on generating the btrfs send-and-receive images based
on Fedora.

Getting to the full scheme will take a while. Currently we have many
of the building blocks ready, but some major items are missing. For
example, we push quite a few problems into btrfs, that other solutions
try to solve in user space. One of them is actually
signing/verification of images. The btrfs maintainers are working on
adding this to the code base, but currently nothing exists. This
functionality is essential though to come to a fully verified system
where a trust chain exists all the way from the firmware to the
apps. Also, to make the home sub-volume scheme fully workable we
actually need encrypted sub-volumes, so that the sub-volume’s
pass-phrase can be used for authenticating users in PAM. This doesn’t
exist either.

Working towards this scheme is a gradual process. Many of the steps we
require for this are useful outside of the grand scheme though, which
means we can slowly work towards the goal, and our users can already
take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from
traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with
/usr, /home, /opt, /var, /etc. It executes in an environment that is
pretty much identical to how it would be run on traditional systems.

There’s no need to fully move to a system that uses only btrfs and
follows strictly this sub-volume scheme. For example, we intend to
provide implicit support for systems that are installed on ext4 or
xfs, or that are put together with traditional packaging tools such as
RPM or dpkg: if the the user tries to install a
runtime/app/framework/os image on a system that doesn’t use btrfs so
far, it can just create a loop-back btrfs image in /var, and push the
data into that. Even us developers will run our stuff like this for a
while, after all this new scheme is not particularly useful for highly
individualized systems, and we developers usually tend to run
systems like that.

Also note that this in no way a departure from packaging systems like
RPM or DEB. Even if the new scheme we propose is used for installing
and updating a specific system, it is RPM/DEB that is used to put
together the vendor OS tree initially. Hence, even in this scheme
RPM/DEB are highly relevant, though not strictly as an end-user tool
anymore, but as a build tool.

So Let’s Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images,
    user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS
    images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of
    all executed code can be done, all the way to the firmware, as
    standard feature of the system.

  • We want to allow app vendors to write their programs against very
    specific frameworks, under the knowledge that they will end up being
    executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions
    of them, multiple runtimes in multiple versions, as well as multiple
    frameworks in multiple versions. And of course, multiple apps in
    multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to
    ensure we can reliably update/rollback versions, in particular to
    safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS
    container is as simple as adding in a new snapshot and restarting
    the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS
    instances from a single vendor tree, with zero difference for doing
    this on order to be able to boot it on bare metal/VM or as a
    container.

  • We want to enable Linux to have an open scheme that people can use
    to build app markets and similar schemes, not restricted to a
    specific vendor.

Final Words

I’ll be talking about this at LinuxCon Europe in October. I originally
intended to discuss this at the Linux Plumbers Conference (which I
assumed was the right forum for this kind of major plumbing level
improvement), and at linux.conf.au, but there was no interest in my
session submissions there…

Of course this is all work in progress. These are our current ideas we
are working towards. As we progress we will likely change a number of
things. For example, the precise naming of the sub-volumes might look
very different in the end.

Of course, we are developers of the systemd project. Implementing this
scheme is not just a job for the systemd developers. This is a
reinvention how distributions work, and hence needs great support from
the distributions. We really hope we can trigger some interest by
publishing this proposal now, to get the distributions on board. This
after all is explicitly not supposed to be a solution for one specific
project and one specific vendor product, we care about making this
open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us
(IRC, mail, G+, …).

The future is going to be awesome!

Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/stateless.html

(Just a small heads-up: I don’t blog as much as I used to, I
nowadays update my Google+
page
a lot more frequently. You might want to subscribe that if
you are interested in more frequent technical updates on what we are
working on.)

In the past weeks we have been working on a couple of features for
systemd
that enable a number of new usecases I’d like to shed some light
on. Taking benefit of the /usr
merge
that a number of distributions have completed we want to
bring runtime behaviour of Linux systems to the next level. With the
/usr merge completed most static vendor-supplied OS data is
found exclusively in /usr, only a few additional bits in
/var and /etc are necessary to make a system
boot. On this we can build to enable a couple of new features:

  1. A mechanism we call Factory Reset shall flush out
    /etc and /var, but keep the vendor-supplied
    /usr, bringing the system back into a well-defined, pristine
    vendor state with no local state or configuration. This functionality
    is useful across the board from servers, to desktops, to embedded
    devices.
  2. A Stateless System goes one step further: a system like
    this never stores /etc or /var on persistent
    storage, but always comes up with pristine vendor state. On systems
    like this every reboot acts as factor reset. This functionality is
    particularly useful for simple containers or systems that boot off the
    network or read-only media, and receive all configuration they need
    during runtime from vendor packages or protocols like DHCP or are
    capable of discovering their parameters automatically from the
    available hardware or periphery.
  3. Reproducible Systems multiply a vendor image into many
    containers or systems. Only local configuration or state is stored
    per-system, while the vendor operating system is pulled in from the
    same, immutable, shared snapshot. Each system hence has its private
    /etc and /var for receiving local configuration,
    however the OS tree in /usr is pulled in via bind mounts (in
    case of containers) or technologies like NFS (in case of physical
    systems), or btrfs snapshots from a golden master image. This is
    particular interesting for containers where the goal is to run
    thousands of container images from the same OS tree. However, it also
    has a number of other usecases, for example thin client systems, which
    can boot the same NFS share a number of times. Furthermore this
    mechanism is useful to implement very simple OS installers, that
    simply unserialize a /usr snapshot into a file system,
    install a boot loader, and reboot.
  4. Verifiable Systems are closely related to stateless
    systems: if the underlying storage technology can cryptographically
    ensure that the vendor-supplied OS is trusted and in a consistent
    state, then it must be made sure that /etc or /var
    are either included in the OS image, or simply unnecessary for booting.

Concepts

A number of Linux-based operating systems have tried to implement
some of the schemes described out above in one way or
another. Particularly interesting are GNOME’s OSTree, CoreOS and Google’s Android and
ChromeOS. They generally found different solutions for the specific
problems you have when implementing schemes like this, sometimes taking
shortcuts that keep only the specific case in mind, and cannot cover
the general purpose. With systemd now being at the core of so many
distributions and deeply involved in bringing up and maintaining the
system we came to the conclusion that we should attempt to add generic
support for setups like this to systemd itself, to open this up for
the general purpose distributions to build on. We decided to focus on
three kinds of systems:

  1. The stateful system, the traditional system as we know it with
    machine-specific /etc, /usr and /var, all
    properly populated.
  2. Startup without a populated /var, but with configured
    /etc. (We will call these volatile systems.)
  3. Startup without either /etc or /var. (We will
    call these stateless systems.)

A factory reset is just a special case of the latter two modes,
where the system boots up without /var and /etc but
the next boot is a normal stateful boot like like the first described
mode. Note that a mode where /etc is flushed, but
/var is not is nothing we intend to cover (why? well, the
user ID question becomes much harder, see below, and we simply saw no
usecase for it worth the trouble).

Problems

Booting up a system without a populated /var is relatively
straight-forward. With a
few lines of tmpfiles configuration
it is possible to populate
/var with its basic structure in a way that is sufficient to
make a system boot cleanly. systemd version 214 and newer ship with
support for this. Of course, support for this scheme in systemd is
only a small part of the solution. While a lot of software
reconstructs the directory hierarchy it needs in /var
automatically, many software does not. In case like this it is
necessary to ship a couple of additional tmpfiles lines that setup up
at boot-time the necessary files or directories in /var to
make the software operate, similar to what RPM or DEB packages would
set up at installation time.

Booting up a system without a populated /etc is a more
difficult task. In /etc we have a lot of configuration bits
that are essential for the system to operate, for example and most
importantly system user and group information in /etc/passwd
and /etc/group. If the system boots up without /etc
there must be a way to replicate the minimal information necessary in
it, so that the system manages to boot up fully.

To make this even more complex, in order to support “offline”
updates of /usr that are replicated into a number of systems
possessing private /etc and /var there needs to be a
way how these directories can be upgraded transparently when
necessary, for example by recreating caches like
/etc/ld.so.cache or adding missing system users to
/etc/passwd on next reboot.

Starting with systemd 215 (yet unreleased, as I type this) we will
ship with a number of features in systemd that make /etc-less
boots functional:

  • A new tool systemd-sysusers as been added. It introduces
    a new drop-in directory /usr/lib/sysusers.d/. Minimal
    descriptions of necessary system users and groups can be placed
    there. Whenever the tool is invoked it will create these users in
    /etc/passwd and /etc/group should they be
    missing. It is only suitable for creating system users and groups, not
    for normal users. It will write to the files directly via the
    appropriate glibc APIs, which is the right thing to do for system
    users. (For normal users no such APIs exist, as the users might be
    stored centrally on LDAP or suchlike, and they are out of focus for
    our usecase.) The major benefit of this tool is that system user
    definition can happen offline: a package simply has to drop in a new
    file to register a user. This makes system user registration
    declarative instead of imperative — which is the way
    how system users are traditionally created from RPM or DEB
    installation scripts. By being declarative it is easy to replicate the
    users on next boot to a number of system instances.

    To make this new
    tool interesting for packaging scripts we make it easy to
    alternatively invoke it during package installation time, thus being a
    good alternative to invocations of useradd -r and
    groupadd -r.

    Some OS designs use a static, fixed user/group list stored in
    /usr as primary database for users/groups, which fixed
    UID/GID mappings. While this works for specific systems, this cannot
    cover the general purpose. As the UID/GID range for system
    users/groups is very small (only containing 998 users and groups on most systems), the
    best has to be made from this space and only UIDs/GIDs necessary on
    the specific system should be allocated. This means allocation has to
    be dynamic and adjust to what is necessary.

    Also note that this tool has
    one very nice feature: in addition to fully dynamic, and fully static
    UID/GID assignment for the users to create, it supports reading
    UID/GID numbers off existing files in /usr, so that vendors
    can make use of setuid/setgid binaries owned by specific users.

  • We also added a default
    user definition list
    which creates the most basic users the system
    and systemd need. Of course, very likely downstream distributions
    might need to alter this default list, add new entries and possibly
    map specific users to particular numeric UIDs.
  • A new condition ConditionNeedsUpdate= has been
    added. With this mechanism it is possible to conditionalize execution
    of services depending on whether /usr is newer than
    /etc or /var. The idea is that various services that
    need to be added into the boot process on upgrades make use of this to
    not delay boot-ups on normal boots, but run as necessary should
    /usr have been update since the last boot. This is
    implemented based on the mtime timestamp of the
    /usr: if the OS has been updated the packaging software
    should touch the directory, thus informing all instances that
    an upgrade of /etc and /var might be necessary.
  • We added a number of service files, that make use of the new
    ConditionNeedsUpdate= switch, and run a couple of services
    after each update. Among them are the aforementiond
    systemd-sysusers tool, as well as services that rebuild the
    udev hardware database, the journal catalog database and the library
    cache in /etc/ld.so.cache.
  • If systemd detects an empty /etc at early boot it will
    now use the unit
    preset
    information to enable all services by default that the
    vendor or packager declared. It will then proceed booting.
  • We added a
    new tmpfiles snippet
    that is able to reconstruct the
    most basic structure of /etc if it is missing.
  • tmpfiles also gained the ability copy entire directory trees into
    place should they be missing. This is particularly useful for copying
    certain essential files or directories into /etc without
    which the system refuses to boot. Currently the most prominent
    candidates for this are /etc/pam.d and
    /etc/dbus-1. In the long run we hope that packages can be
    fixed so that they always work correctly without configuration in
    /etc. Depending on the software this means that they should
    come with compiled-in defaults that just work should their
    configuration file be missing, or that they should fall back to static
    vendor-supplied configuration in /usr that is used whenever
    /etc doesn’t have any configuration. Both the PAM and the
    D-Bus case are probably candidates for the latter. Given that there
    are probably many cases like this we are working with a number of
    folks to introduce a new directory called /usr/share/etc
    (name is not settled yet) to major distributions, that always
    contain the full, original, vendor-supplied configuration of all
    packages. This is very useful here, so that there’s an obvious place
    to copy the original configuration from, but it is also useful
    completely independently as this provides administrators with an easy
    place to diff their own configuration in /etc
    against to see what local changes are in place.
  • We added a new --tmpfs= switch to systemd-nspawn
    to make testing of systems with unpopulated /etc and
    /var easy. For example, to run a fully state-less container, use a command line like this:

    # system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b

    This command line will invoke the container tree stored in
    /srv/mycontainer in a read-only way, but with a (writable)
    tmpfs mounted to /var and /etc. With a very recent
    git snapshot of systemd invoking a Fedora rawhide system should mostly
    work OK, modulo the D-Bus and PAM problems mentioned above. A later
    version of systemd-nspawn is likely to gain a high-level
    switch --mode={stateful|volatile|stateless} that sets
    combines this into simple switches reusing the vocabulary introduced
    earlier.

What’s Next

Pulling this all together we are very close to making boots with
empty /etc and /var on general purpose Linux
operating systems a reality. Of course, while doing the groundwork in
systemd gets us some distance, there’s a lot of work left. Most
importantly: the majority of Linux packages are simply incomptible
with this scheme the way they are currently set up. They do not work
without configuration in /etc or state directories in
/var; they do not drop system user information in
/usr/lib/sysusers.d. However, we believe it’s our job to do
the groundwork, and to start somewhere.

So what does this mean for the next steps? Of course, currently
very little of this is available in any distribution (simply already
because 215 isn’t even released yet). However, this will hopefully
change quickly. As soon as that is accomplished we can start working
on making the other components of the OS work nicely in this
scheme. If you are an upstream developer, please consider making your
software work correctly if /etc and/or /var are not
populated. This means:

  • When you need a state directory in /var and it is missing,
    create it first. If you cannot do that, because you dropped priviliges
    or suchlike, please consider dropping in a tmpfiles snippet that
    creates the directory with the right permissions early at boot, should
    it be missing.
  • When you need configuration files in /etc to work
    properly, consider changing your application to work nicely when these
    files are missing, and automatically fall back to either built-in
    defaults, or to static vendor-supplied configuration files shipped in
    /usr, so that administrators can override configuration in
    /etc but if they don’t the default configuration counts.
  • When you need a system user or group, consider dropping in a file
    into /usr/lib/sysusers.d describing the users. (Currently
    documentation on this is minimal, we will provide more docs on this
    shortly.)

If you are a packager, you can also help on making this all work:

  • Ask upstream to implement what we describe above, possibly even preparing a patch for this.
  • If upstream will not make these changes, then consider dropping in
    tmpfiles snippets that copy the bare minimum of configuration files to
    make your software work from somewhere in /usr into
    /etc.
  • Consider moving from imperative useradd commands in
    packaging scripts, to declarative sysusers files. Ideally,
    this is shipped upstream too, but if that’s not possible then simply
    adding this to packages should be good enough.

Of course, before moving to declarative system user definitions you
should consult with your distribution whether their packaging policy
even allows that. Currently, most distributions will not, so we have
to work to get this changed first.

Anyway, so much about what we have been working on and where we want to take this.

Conclusion

Before we finish, let me stress again why we are doing all
this:

  1. For end-user machines like desktops, tablets or mobile phones, we
    want a generic way to implement factory reset, which the user can make
    use of when the system is broken (saves you support costs), or when he
    wants to sell it and get rid of his private data, and renew that “fresh
    car smell”.
  2. For embedded machines we want a generic way how to reset
    devices. We also want a way how every single boot can be identical to
    a factory reset, in a stateless system design.
  3. For all kinds of systems we want to centralize vendor data in
    /usr so that it can be strictly read-only, and fully
    cryptographically verified as one unit.
  4. We want to enable new kinds of OS installers that simply
    deserialize a vendor OS /usr snapshot into a new file system,
    install a boot loader and reboot, leaving all first-time configuration
    to the next boot.
  5. We want to enable new kinds of OS updaters that build on this, and
    manage a number of vendor OS /usr snapshots in verified states, and
    which can then update /etc and /var simply by
    rebooting into a newer version.
  6. We wanto to scale container setups naturally, by sharing a single
    golden master /usr tree with a large number of instances that
    simply maintain their own private /etc and /var for
    their private configuration and state, while still allowing clean
    updates of /usr.
  7. We want to make thin clients that share /usr across the
    network work by allowing stateless bootups. During all discussions on
    how /usr was to be organized this was fequently mentioned. A
    setup like this so far only worked in very specific cases, with this
    scheme we want to make this work in general case.

Of course, we have no illusions, just doing the groundwork for all
of this in systemd doesn’t make this all a real-life solution
yet. Also, it’s very unlikely that all of Fedora (or any other general
purpose distribution) will support this scheme for all its packages
soon, however, we are quite confident that the idea is convincing,
that we need to start somewhere, and that getting the most core
packages adapted to this shouldn’t be out of reach.

Oh, and of course, the concepts behind this are really not new, we
know that. However, what’s new here is that we try to make them
available in a general purpose OS core, instead of special purpose
systems.

Anyway, let’s get the ball rolling! Late’s make stateless systems a
reality!

And that’s all I have for now. I am sure this leaves a lot of
questions open. If you have any, join us on IRC on #systemd
on freenode or comment on Google+.

The Biggest Myths

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/the-biggest-myths.html

Since we first proposed systemd
for inclusion in the distributions it has been frequently discussed in
many forums, mailing lists and conferences. In these discussions one
can often hear certain myths about systemd, that are repeated over and
over again, but certainly don’t gain any truth by constant
repetition. Let’s take the time to debunk a few of them:

  1. Myth: systemd is monolithic.

    If you build systemd with all configuration options enabled you
    will build 69 individual binaries. These binaries all serve different
    tasks, and are neatly separated for a number of reasons. For example,
    we designed systemd with security in mind, hence most daemons run at
    minimal privileges (using kernel capabilities, for example) and are
    responsible for very specific tasks only, to minimize their security
    surface and impact. Also, systemd parallelizes the boot more than any
    prior solution. This parallization happens by running more processes
    in parallel. Thus it is essential that systemd is nicely split up into
    many binaries and thus processes. In fact, many of these
    binaries[1] are separated out so nicely, that they are very
    useful outside of systemd, too.

    A package involving 69 individual binaries can hardly be called
    monolithic. What is different from prior solutions however,
    is that we ship more components in a single tarball, and maintain them
    upstream in a single repository with a unified release cycle.

  2. Myth: systemd is about speed.

    Yes, systemd is fast (A
    pretty complete userspace boot-up in ~900ms, anyone?
    ), but that’s
    primarily just a side-effect of doing things right. In fact, we
    never really sat down and optimized the last tiny bit of performance
    out of systemd. Instead, we actually frequently knowingly picked the
    slightly slower code paths in order to keep the code more
    readable. This doesn’t mean being fast was irrelevant for us, but
    reducing systemd to its speed is certainly quite a misconception,
    since that is certainly not anywhere near the top of our list of
    goals.

  3. Myth: systemd’s fast boot-up is irrelevant for
    servers.

    That is just completely not true. Many administrators actually are
    keen on reduced downtimes during maintenance windows. In High
    Availability setups it’s kinda nice if the failed machine comes back
    up really fast. In cloud setups with a large number of VMs or
    containers the price of slow boots multiplies with the number of
    instances. Spending minutes of CPU and IO on really slow boots of
    hundreds of VMs or containers reduces your system’s density
    drastically, heck, it even costs you more energy. Slow boots can be
    quite financially expensive. Then, fast booting of containers allows
    you to implement a logic such as socket
    activated containers
    , allowing you to drastically increase the
    density of your cloud system.

    Of course, in many server setups boot-up is indeed irrelevant, but
    systemd is supposed to cover the whole range. And yes, I am aware
    that often it is the server firmware that costs the most time at
    boot-up, and the OS anyways fast compared to that, but well, systemd
    is still supposed to cover the whole range (see above…), and no,
    not all servers have such bad firmware, and certainly not VMs and
    containers, which are servers of a kind, too.[2]

  4. Myth: systemd is incompatible with shell scripts.

    This is entirely bogus. We just don’t use them for the boot
    process, because we believe they aren’t the best tool for that
    specific purpose, but that doesn’t mean systemd was incompatible with
    them. You can easily run shell scripts as systemd services, heck, you
    can run scripts written in any language as systemd services,
    systemd doesn’t care the slightest bit what’s inside your
    executable. Moreover, we heavily use shell scripts for our own
    purposes, for installing, building, testing systemd. And you can stick
    your scripts in the early boot process, use them for normal services,
    you can run them at latest shutdown, there are practically no
    limits.

  5. Myth: systemd is difficult.

    This also is entire non-sense. A systemd platform is actually much
    simpler than traditional Linuxes because it unifies
    system objects and their dependencies as systemd units. The
    configuration file language is very simple, and redundant
    configuration files we got rid of. We provide uniform tools for much
    of the configuration of the system. The system is much less
    conglomerate than traditional Linuxes are. We also have pretty
    comprehensive documentation (all linked
    from the homepage
    ) about pretty much every detail of systemd, and
    this not only covers admin/user-facing interfaces, but also developer
    APIs.

    systemd certainly comes with a learning curve. Everything
    does. However, we like to believe that it is actually simpler to
    understand systemd than a Shell-based boot for most people. Surprised
    we say that? Well, as it turns out, Shell is not a pretty language to
    learn, it’s syntax is arcane and complex. systemd unit files are
    substantially easier to understand, they do not expose a programming
    language, but are simple and declarative by nature. That all said, if
    you are experienced in shell, then yes, adopting systemd will take a
    bit of learning.

    To make learning easy we tried hard to provide the maximum
    compatibility to previous solutions. But not only that, on many
    distributions you’ll find that some of the traditional tools will now
    even tell you — while executing what you are asking for — how you
    could do it with the newer tools instead, in a possibly nicer way.

    Anyway, the take-away is probably that systemd is probably as
    simple as such a system can be, and that we try hard to make it easy
    to learn. But yes, if you know sysvinit then adopting systemd will
    require a bit learning, but quite frankly if you mastered sysvinit,
    then systemd should be easy for you.

  6. Myth: systemd is not modular.

    Not true at all. At compile time you have a number of
    configure switches to select what you want to build, and what
    not. And we
    document
    how you can select in even more detail what you need,
    going beyond our configure switches.

    This modularity is not totally unlike the one of the Linux kernel,
    where you can select many features individually at compile time. If the
    kernel is modular enough for you then systemd should be pretty close,
    too.

  7. Myth: systemd is only for desktops.

    That is certainly not true. With systemd we try to cover pretty
    much the same range as Linux itself does. While we care for desktop
    uses, we also care pretty much the same way for server uses, and
    embedded uses as well. You can bet that Red Hat wouldn’t make it a
    core piece of RHEL7 if it wasn’t the best option for managing services
    on servers.

    People from numerous companies work on systemd. Car manufactureres
    build it into cars, Red Hat uses it for a server operating system, and
    GNOME uses many of its interfaces for improving the desktop. You find
    it in toys, in space telescopes, and in wind turbines.

    Most features I most recently worked on are probably relevant
    primarily on servers, such as container
    support
    , resource
    management
    or the security
    features
    . We cover desktop systems pretty well already, and there
    are number of companies doing systemd development for embedded, some
    even offer consulting services in it.

  8. Myth: systemd was created as result of the NIH syndrome.

    This is not true. Before we began working on systemd we were
    pushing for Canonical’s Upstart to be widely adopted (and Fedora/RHEL
    used it too for a while). However, we eventually came to the
    conclusion that its design was inherently flawed at its core (at least
    in our eyes: most fundamentally, it leaves dependency management to
    the admin/developer, instead of solving this hard problem in code),
    and if something’s wrong in the core you better replace it, rather
    than fix it. This was hardly the only reason though, other things that
    came into play, such as the licensing/contribution agreement mess
    around it. NIH wasn’t one of the reasons, though…[3]

  9. Myth: systemd is a freedesktop.org project.

    Well, systemd is certainly hosted at fdo, but freedesktop.org is
    little else but a repository for code and documentation. Pretty much
    any coder can request a repository there and dump his stuff there (as
    long as it’s somewhat relevant for the infrastructure of free
    systems). There’s no cabal involved, no “standardization” scheme, no
    project vetting, nothing. It’s just a nice, free, reliable place to
    have your repository. In that regard it’s a bit like SourceForge,
    github, kernel.org, just not commercial and without over-the-top
    requirements, and hence a good place to keep our stuff.

    So yes, we host our stuff at fdo, but the implied assumption of
    this myth in that there was a group of people who meet and then agree
    on how the future free systems look like, is entirely bogus.

  10. Myth: systemd is not UNIX.

    There’s certainly some truth in that. systemd’s sources do not
    contain a single line of code originating from original UNIX. However,
    we derive inspiration from UNIX, and thus there’s a ton of UNIX in
    systemd. For example, the UNIX idea of “everything is a file” finds
    reflection in that in systemd all services are exposed at runtime in a
    kernel file system, the cgroupfs. Then, one of the original
    features of UNIX was multi-seat support, based on built-in terminal
    support. Text terminals are hardly the state of the art how you
    interface with your computer these days however. With systemd we
    brought native multi-seat
    support back, but this time with full support for today’s hardware,
    covering graphics, mice, audio, webcams and more, and all that fully
    automatic, hotplug-capable and without configuration. In fact the
    design of systemd as a suite of integrated tools that each have their
    individual purposes but when used together are more than just the sum
    of the parts, that’s pretty much at the core of UNIX philosophy. Then,
    the way our project is handled (i.e. maintaining much of the core OS
    in a single git repository) is much closer to the BSD model (which is
    a true UNIX, unlike Linux) of doing things (where most of the core OS
    is kept in a single CVS/SVN repository) than things on Linux ever
    were.

    Ultimately, UNIX is something different for everybody. For us
    systemd maintainers it is something we derive inspiration from. For
    others it is a religion, and much like the other world religions there
    are different readings and understandings of it. Some define UNIX
    based on specific pieces of code heritage, others see it just as a set
    of ideas, others as a set of commands or APIs, and even others as a
    definition of behaviours. Of course, it is impossible to ever make all
    these people happy.

    Ultimately the question whether something is UNIX or not matters
    very little. Being technically excellent is hardly exclusive to
    UNIX. For us, UNIX is a major influence (heck, the biggest one), but
    we also have other influences. Hence in some areas systemd will be
    very UNIXy, and in others a little bit less.

  11. Myth: systemd is complex.

    There’s certainly some truth in that. Modern computers are complex
    beasts, and the OS running on it will hence have to be complex
    too. However, systemd is certainly not more complex than prior
    implementations of the same components. Much rather, it’s simpler, and
    has less redundancy (see above). Moreover, building a simple OS based
    on systemd will involve much fewer packages than a traditional Linux
    did. Fewer packages makes it easier to build your system, gets rid of
    interdependencies and of much of the different behaviour of every
    component involved.

  12. Myth: systemd is bloated.

    Well, bloated certainly has many different definitions. But in
    most definitions systemd is probably the opposite of bloat. Since
    systemd components share a common code base, they tend to share much
    more code for common code paths. Here’s an example: in a traditional
    Linux setup, sysvinit, start-stop-daemon, inetd, cron, dbus, all
    implemented a scheme to execute processes with various configuration
    options in a certain, hopefully clean environment. On systemd the code
    paths for all of this, for the configuration parsing, as well as the
    actual execution is shared. This means less code, less place for
    mistakes, less memory and cache pressure, and is thus a very good
    thing. And as a side-effect you actually get a ton more functionality
    for it…

    As mentioned above, systemd is also pretty modular. You can choose
    at build time which components you need, and which you don’t
    need. People can hence specifically choose the level of “bloat” they
    want.

    When you build systemd, it only requires three dependencies: glibc,
    libcap and dbus. That’s it. It can make use of more dependencies, but
    these are entirely optional.

    So, yeah, whichever way you look at it, it’s really not
    bloated.

  13. Myth: systemd being Linux-only is not nice to the BSDs.

    Completely wrong. The BSD folks are pretty much uninterested in
    systemd. If systemd was portable, this would change nothing, they
    still wouldn’t adopt it. And the same is true for the other Unixes in
    the world. Solaris has SMF, BSD has their own “rc” system, and they
    always maintained it separately from Linux. The init system is very
    close to the core of the entire OS. And these other operating systems
    hence define themselves among other things by their core
    userspace. The assumption that they’d adopt our core userspace if we
    just made it portable, is completely without any foundation.

  14. Myth: systemd being Linux-only makes it impossible for Debian to adopt it as default.

    Debian supports non-Linux kernels in their distribution. systemd
    won’t run on those. Is that a problem though, and should that hinder
    them to adopt system as default? Not really. The folks who ported
    Debian to these other kernels were willing to invest time in a massive
    porting effort, they set up test and build systems, and patched and
    built numerous packages for their goal. The maintainance of both a
    systemd unit file and a classic init script for the packaged services
    is a negligable amount of work compared to that, especially since
    those scripts more often than not exist already.

  15. Myth: systemd could be ported to other kernels if its maintainers just wanted to.

    That is simply not true. Porting systemd to other kernel is not
    feasible. We just use too many Linux-specific interfaces. For a few
    one might find replacements on other kernels, some features one might
    want to turn off, but for most this is nor really possible. Here’s a
    small, very incomprehensive list: cgroups, fanotify, umount2(),
    /proc/self/mountinfo
    (including notification), /dev/swaps (same),
    udev, netlink,
    the structure of /sys, /proc/$PID/comm,
    /proc/$PID/cmdline, /proc/$PID/loginuid, /proc/$PID/stat,
    /proc/$PID/session, /proc/$PID/exe, /proc/$PID/fd, tmpfs, devtmpfs,
    capabilities, namespaces of all kinds, various prctl()s, numerous
    ioctls,
    the mount() system call and its semantics, selinux, audit,
    inotify, statfs, O_DIRECTORY, O_NOATIME, /proc/$PID/root, waitid(),
    SCM_CREDENTIALS, SCM_RIGHTS, mkostemp(), /dev/input, ...

    And no, if you look at this list and pick out the few where you can
    think of obvious counterparts on other kernels, then think again, and
    look at the others you didn’t pick, and the complexity of replacing
    them.

  16. Myth: systemd is not portable for no reason.

    Non-sense! We use the Linux-specific functionality because we need
    it to implement what we want. Linux has so many features that
    UNIX/POSIX didn’t have, and we want to empower the user with
    them. These features are incredibly useful, but only if they are
    actually exposed in a friendly way to the user, and that’s what we do
    with systemd.

  17. Myth: systemd uses binary configuration files.

    No idea who came up with this crazy myth, but it’s absolutely not
    true. systemd is configured pretty much exclusively via simple text
    files. A few settings you can also alter with the kernel command line
    and via environment variables. There’s nothing binary in its
    configuration (not even XML). Just plain, simple, easy-to-read text
    files.

  18. Myth: systemd is a feature creep.

    Well, systemd certainly covers more ground that it used to. It’s
    not just an init system anymore, but the basic userspace building
    block to build an OS from, but we carefully make sure to keep most of
    the features optional. You can turn a lot off at compile time, and
    even more at runtime. Thus you can choose freely how much feature
    creeping you want.

  19. Myth: systemd forces you to do something.

    systemd is not the mafia. It’s Free Software, you can do with it
    whatever you want, and that includes not using it. That’s pretty much
    the opposite of “forcing”.

  20. Myth: systemd makes it impossible to run syslog.

    Not true, we carefully made sure when we introduced
    the journal
    that all data is also passed on to any syslog daemon
    running. In fact, if something changed, then only that syslog gets
    more complete data now than it got before, since we now cover early
    boot stuff as well as STDOUT/STDERR of any system service.

  21. Myth: systemd is incompatible.

    We try very hard to provide the best possible compatibility with
    sysvinit. In fact, the vast majority of init scripts should work just
    fine on systemd, unmodified. However, there actually are indeed a few
    incompatibilities, but we try to document
    these
    and explain what to do about them. Ultimately every system
    that is not actually sysvinit itself will have a certain amount of
    incompatibilities with it since it will not share the exect same code
    paths.

    It is our goal to ensure that differences between the various
    distributions are kept at a minimum. That means unit files usually
    work just fine on a different distribution than you wrote it on, which
    is a big improvement over classic init scripts which are very hard to
    write in a way that they run on multiple Linux distributions, due to
    numerous incompatibilities between them.

  22. Myth: systemd is not scriptable, because of its D-Bus use.

    Not true. Pretty much every single D-Bus interface systemd provides
    is also available in a command line tool, for example in systemctl,
    loginctl,
    timedatectl,
    hostnamectl,
    localectl
    and suchlike. You can easily call these tools from shell scripts, they
    open up pretty much the entire API from the command line with
    easy-to-use commands.

    That said, D-Bus actually has bindings for almost any scripting
    language this world knows. Even from the shell you can invoke
    arbitrary D-Bus methods with dbus-send
    or gdbus. If
    anything, this improves scriptability due to the good support of D-Bus
    in the various scripting languages.

  23. Myth: systemd requires you to use some arcane configuration
    tools instead of allowing you to edit your configuration files
    directly.

    Not true at all. We offer some configuration tools, and using them
    gets you a bit of additional functionality (for example, command line
    completion for all settings!), but there’s no need at all to use
    them. You can always edit the files in question directly if you wish,
    and that’s fully supported. Of course sometimes you need to explicitly
    reload configuration of some daemon after editing the configuration,
    but that’s pretty much true for most UNIX services.

  24. Myth: systemd is unstable and buggy.

    Certainly not according to our data. We have been monitoring the
    Fedora bug tracker (and some others) closely for a long long time. The
    number of bugs is very low for such a central component of the OS,
    especially if you discount the numerous RFE bugs we track for the
    project. We are pretty good in keeping systemd out of the list of
    blocker bugs of the distribution. We have a relatively fast
    development cycle with mostly incremental changes to keep quality and
    stability high.

  25. Myth: systemd is not debuggable.

    False. Some people try to imply that the shell was a good
    debugger. Well, it isn’t really. In systemd we provide you with actual
    debugging features instead. For example: interactive debugging,
    verbose tracing, the ability to mask any component during boot, and
    more. Also, we provide documentation
    for it
    .

    It’s certainly well debuggable, we needed that for our own
    development work, after all. But we’ll grant you one thing: it uses
    different debugging tools, we believe more appropriate ones for the
    purpose, though.

  26. Myth: systemd makes changes for the changes’ sake.

    Very much untrue. We pretty much exclusively have technical
    reasons for the changes we make, and we explain them in the various
    pieces of documentation, wiki pages, blog articles, mailing list
    announcements. We try hard to avoid making incompatible changes, and
    if we do we try to document the why and how in detail. And if you
    wonder about something, just ask us!

  27. Myth: systemd is a Red-Hat-only project, is private property
    of some smart-ass developers, who use it to push their views to the
    world.

    Not true. Currently, there are 16 hackers with commit powers to the
    systemd git tree. Of these 16 only six are employed by Red Hat. The 10
    others are folks from ArchLinux, from Debian, from Intel, even from
    Canonical, Mandriva, Pantheon and a number of community folks with
    full commit rights. And they frequently commit big stuff, major
    changes. Then, there are 374 individuals with patches in our tree, and
    they too came from a number of different companies and backgrounds,
    and many of those have way more than one patch in the tree. The
    discussions about where we want to take systemd are done in the open,
    on our IRC channel (#systemd on freenode, you are always
    weclome), on our mailing
    list
    , and on public hackfests (such
    as our next one in Brno
    , you are invited). We regularly attend
    various conferences, to collect feedback, to explain what we are doing
    and why, like few others do. We maintain blogs, engage in social
    networks (we actually
    have some pretty interesting content on Google+
    , and our Google+
    Community is pretty alive, too
    .), and try really hard to explain
    the why and the how how we do things, and to listen to feedback and
    figure out where the current issues are (for example, from that
    feedback we compiled this lists of often heard myths about
    systemd…).

    What most systemd contributors probably share is a rough idea how a
    good OS should look like, and the desire to make it happen. However,
    by the very nature of the project being Open Source, and rooted in the
    community systemd is just what people want it to be, and if it’s not
    what they want then they can drive the direction with patches and
    code, and if that’s not feasible, then there are numerous other
    options to use, too, systemd is never exclusive.

    One goal of systemd is to unify the dispersed Linux landscape a
    bit. We try to get rid of many of the more pointless differences of
    the various distributions in various areas of the core OS. As part of
    that we sometimes adopt schemes that were previously used by only one
    of the distributions and push it to a level where it’s the default of
    systemd, trying to gently push everybody towards the same set of basic
    configuration. This is never exclusive though, distributions can
    continue to deviate from that if they wish, however, if they end-up
    using the well-supported default their work becomes much easier and
    they might gain a feature or two. Now, as it turns out, more
    frequently than not we actually adopted schemes that where Debianisms,
    rather than Fedoraisms/Redhatisms as best supported scheme by
    systemd. For example, systems running systemd now generally store
    their hostname in /etc/hostname, something that used to be
    specific to Debian and now is used across distributions.

    One thing we’ll grant you though, we sometimes can be
    smart-asses. We try to be prepared whenever we open our mouth, in
    order to be able to back-up with facts what we claim. That might make
    us appear as smart-asses.

    But in general, yes, some of the more influental contributors of
    systemd work for Red Hat, but they are in the minority, and systemd is
    a healthy, open community with different interests, different
    backgrounds, just unified by a few rough ideas where the trip should
    go, a community where code and its design counts, and certainly not
    company affiliation.

  28. Myth: systemd doesn’t support /usr split from the root directory.

    Non-sense. Since its beginnings systemd supports the
    --with-rootprefix= option to its configure script
    which allows you to tell systemd to neatly split up the stuff needed
    for early boot and the stuff needed for later on. All this logic is
    fully present and we keep it up-to-date right there in systemd’s build
    system.

    Of course, we still don’t think that actually
    booting with /usr unavailable is a good idea
    , but we
    support this just fine in our build system. This won’t fix the
    inherent problems of the scheme that you’ll encounter all across the
    board, but you can’t blame that on systemd, because in systemd we
    support this just fine.

  29. Myth: systemd doesn’t allow your to replace its components.

    Not true, you can turn off and replace pretty much any part of
    systemd, with very few exceptions. And those exceptions (such as
    journald) generally allow you to run an alternative side by side to
    it, while cooperating nicely with it.

  30. Myth: systemd’s use of D-Bus instead of sockets makes it intransparent.

    This claim is already contradictory in itself: D-Bus uses sockets
    as transport, too. Hence whenever D-Bus is used to send something
    around, a socket is used for that too. D-Bus is mostly a standardized
    serialization of messages to send over these sockets. If anything this
    makes it more transparent, since this serialization is well
    documented, understood and there are numerous tracing tools and
    language bindings for it. This is very much unlike the usual
    homegrown protocols the various classic UNIX daemons use to
    communicate locally.

Hmm, did I write I just wanted to debunk a “few” myths? Maybe these
were more than just a few… Anyway, I hope I managed to clear up a
couple of misconceptions. Thanks for your time.

Footnotes

[1] For example, systemd-detect-virt,
systemd-tmpfiles,
systemd-udevd are.

[2] Also, we are trying to do our little part on maybe
making this better. By exposing boot-time performance of the firmware
more prominently in systemd’s boot output we hope to shame the
firmware writers to clean up their stuff.

[3] And anyways, guess which project includes a library “libnih” — Upstart or systemd?[4]

[4] Hint: it’s not systemd!

systemd for Administrators, Part XVIII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/resources.html

Hot
on
the
heels
of
the
previous
story
, here’s
now the eighteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Managing Resources

An important facet of modern computing is resource management: if
you run more than one program on a single machine you want to assign
the available resources to them enforcing particular policies. This is
particularly crucial on smaller, embedded or mobile systems where the
scarce resources are the main constraint, but equally for large
installations such as cloud setups, where resources are plenty, but
the number of programs/services/containers on a single node is
drastically higher.

Traditionally, on Linux only one policy was really available: all
processes got about the same CPU time, or IO bandwith, modulated a bit
via the process nice value. This approach is very simple and
covered the various uses for Linux quite well for a long
time. However, it has drawbacks: not all all processes deserve to be
even, and services involving lots of processes (think: Apache with a
lot of CGI workers) this way would get more resources than services
whith very few (think: syslog).

When thinking about service management for systemd, we quickly
realized that resource management must be core functionality of it. In
a modern world — regardless if server or embedded — controlling CPU,
Memory, and IO resources of the various services cannot be an
afterthought, but must be built-in as first-class service settings. And
it must be per-service and not per-process as the traditional nice
values or POSIX
Resource Limits
were.

In this story I want to shed some light on what you can do to
enforce resource policies on systemd services. Resource Management in
one way or another has been available in systemd for a while already,
so it’s really time we introduce this to the broader audience.

In an
earlier blog post
I highlighted the difference between Linux
Control Croups (cgroups) as a labelled, hierarchal grouping mechanism,
and Linux cgroups as a resource controlling subsystem. While systemd
requires the former, the latter is optional. And this optional latter
part is now what we can make use of to manage per-service
resources. (At this points, it’s probably a good idea to read up on cgroups before
reading on, to get at least a basic idea what they are and what they
accomplish. Even thought the explanations below will be pretty
high-level, it all makes a lot more sense if you grok the background a
bit.)

The main Linux cgroup controllers for resource management are cpu,
memory
and blkio. To
make use of these, they need to be enabled in the kernel, which many
distributions (including Fedora) do. systemd exposes a couple of high-level service
settings to make use of these controllers without requiring too much
knowledge of the gory kernel details.

Managing CPU

As a nice default, if the cpu controller is enabled in the
kernel, systemd will create a cgroup for each service when starting
it. Without any further configuration this already has one nice
effect: on a systemd system every system service will get an even
amount of CPU, regardless how many processes it consists off. Or in
other words: on your web server MySQL will get the roughly same amount
of CPU as Apache, even if the latter consists a 1000 CGI script
processes, but the former only of a few worker tasks. (This behavior can
be turned off, see DefaultControllers=
in /etc/systemd/system.conf.)

On top of this default, it is possible to explicitly configure the
CPU shares a service gets with the CPUShares=
setting. The default value is 1024, if you increase this number you’ll
assign more CPU to a service than an unaltered one at 1024, if you decrease it, less.

Let’s see in more detail, how we can make use of this. Let’s say we
want to assign Apache 1500 CPU shares instead of the default of
1024. For that, let’s create a new administrator service file for
Apache in /etc/systemd/system/httpd.service, overriding the
vendor supplied one in /usr/lib/systemd/system/httpd.service,
but let’s change the CPUShares= parameter:

.include /usr/lib/systemd/system/httpd.service

[Service]
CPUShares=1500

The first line will pull in the vendor service file. Now, lets’s
reload systemd’s configuration and restart Apache so that the new
service file is taken into account:

systemctl daemon-reload
systemctl restart httpd.service

And yeah, that’s already it, you are done!

(Note that setting CPUShares= in a unit file will cause the
specific service to get its own cgroup in the cpu hierarchy,
even if cpu is not included in
DefaultControllers=.)

Analyzing Resource usage

Of course, changing resource assignments without actually
understanding the resource usage of the services in questions is like
blind flying. To help you understand the resource usage of all
services, we created the tool systemd-cgtop,
that will enumerate all cgroups of the system, determine their
resource usage (CPU, Memory, and IO) and present them in a top-like fashion. Building
on the fact that systemd services are managed in cgroups this tool
hence can present to you for services what top shows you for
processes.

Unfortunately, by default cgtop will only be able to chart
CPU usage per-service for you, IO and Memory are only tracked as total
for the entire machine. The reason for this is simply that by default
there are no per-service cgroups in the blkio and
memory controller hierarchies but that’s what we need to
determine the resource usage. The best way to get this data for all
services is to simply add the memory and blkio
controllers to the aforementioned DefaultControllers= setting
in system.conf.

Managing Memory

To enforce limits on memory systemd provides the
MemoryLimit=, and MemorySoftLimit= settings for
services, summing up the memory of all its processes. These settings
take memory sizes in bytes that are the total memory limit for the
service. This setting understands the usual K, M, G, T suffixes for
Kilobyte, Megabyte, Gigabyte, Terabyte (to the base of 1024).

.include /usr/lib/systemd/system/httpd.service

[Service]
MemoryLimit=1G

(Analogue to CPUShares= above setting this option will cause
the service to get its own cgroup in the memory cgroup
hierarchy.)

Managing Block IO

To control block IO multiple settings are available. First of all
BlockIOWeight= may be used which assigns an IO weight
to a specific service. In behaviour the weight concept is not
unlike the shares concept of CPU resource control (see
above). However, the default weight is 1000, and the valid range is
from 10 to 1000:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=500

Optionally, per-device weights can be specified:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/dev/disk/by-id/ata-SAMSUNG_MMCRE28G8MXP-0VBL1_DC06K01009SE009B5252 750

Instead of specifiying an actual device node you also specify any
path in the file system:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/home/lennart 750

If the specified path does not refer to a device node systemd will
determine the block device /home/lennart is on, and assign
the bandwith weight to it.

You can even add per-device and normal lines at the same time,
which will set the per-device weight for the device, and the other
value as default for everything else.

Alternatively one may control explicit bandwith limits with the
BlockIOReadBandwidth= and BlockIOWriteBandwidth=
settings. These settings take a pair of device node and bandwith rate
(in bytes per second) or of a file path and bandwith rate:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOReadBandwith=/var/log 5M

This sets the maximum read bandwith on the block device backing
/var/log to 5Mb/s.

(Analogue to CPUShares= and MemoryLimit= using
any of these three settings will result in the service getting its own
cgroup in the blkio hierarchy.)

Managing Other Resource Parameters

The options described above cover only a small subset of the
available controls the various Linux control group controllers
expose. We picked these and added high-level options for them since we
assumed that these are the most relevant for most folks, and that they
really needed a nice interface that can handle units properly and
resolve block device names.

In many cases the options explained above might not be sufficient
for your usecase, but a low-level kernel cgroup setting might help. It
is easy to make use of these options from systemd unit files, without
having them covered with a high-level setting. For example, sometimes
it might be useful to set the swappiness of a service. The
kernel makes this controllable via the memory.swappiness
cgroup attribute, but systemd does not expose it as a high-level
option. Here’s how you use it nonetheless, using the low-level
ControlGroupAttribute= setting:

.include /usr/lib/systemd/system/httpd.service

[Service]
ControlGroupAttribute=memory.swappiness 70

(Analogue to the other cases this too causes the service to be
added to the memory hierarchy.)

Later on we might add more high-level controls for the
various cgroup attributes. In fact, please ping us if you frequently
use one and believe it deserves more focus. We’ll consider adding a
high-level option for it then. (Even better: send us a patch!)

Disclaimer: note that making use of the various resource
controllers does have a runtime impact on the system. Enforcing
resource limits comes at a price. If you do use them, certain
operations do get slower. Especially the memory controller
has (used to have?) a bad reputation to come at a performance
cost.

For more details on all of this, please have a look at the
documenation of the mentioned
unit settings
, and of the cpu,
memory
and blkio
controllers.

And that’s it for now. Of course, this blog story only focussed on
the per-service resource settings. On top this, you can also
set the more traditional, well-known per-process resource
settings, which will then be inherited by the various subprocesses,
but always only be enforced per-process. More specifically that’s
IOSchedulingClass=, IOSchedulingPriority=,
CPUSchedulingPolicy=, CPUSchedulingPriority=,
CPUAffinity=, LimitCPU= and related. These do not
make use of cgroup controllers and have a much lower performance
cost. We might cover those in a later article in more detail.

systemd for Administrators, Part XVI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/serial-console.html

And,
yes,
here’s
now the sixteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Gettys on Serial Consoles (and Elsewhere)

TL;DR: To make use of a serial console, just use
console=ttyS0 on the kernel command line, and systemd will
automatically start a getty on it for you.

While physical RS232 serial ports
have become exotic in today’s PCs they play an important role in
modern servers and embedded hardware. They provide a relatively robust
and minimalistic way to access the console of your device, that works
even when the network is hosed, or the primary UI is unresponsive. VMs
frequently emulate a serial port as well.

Of course, Linux has always had good support for serial consoles,
but with systemd we
tried to make serial console support even simpler to use. In the
following text I’ll try to give an overview how serial console gettys on
systemd work, and how TTYs of any kind are handled.

Let’s start with the key take-away: in most cases, to get a login
prompt on your serial prompt you don’t need to do anything. systemd
checks the kernel configuration for the selected kernel console and
will simply spawn a serial getty on it. That way it is entirely
sufficient to configure your kernel console properly (for example, by
adding console=ttyS0 to the kernel command line) and that’s
it. But let’s have a look at the details:

In systemd, two template units are responsible for bringing up a
login prompt on text consoles:

  1. [email protected] is responsible for virtual
    terminal
    (VT) login prompts, i.e. those on your VGA screen as
    exposed in /dev/tty1 and similar devices.
  2. [email protected] is responsible for all other
    terminals, including serial ports such as /dev/ttyS0. It
    differs in a couple of ways from [email protected]: among other
    things the $TERM environment variable is set to
    vt102 (hopefully a good default for most serial terminals)
    rather than linux (which is the right choice for VTs only),
    and a special logic that clears the VT scrollback buffer (and only
    work on VTs) is skipped.
Virtual Terminals

Let’s have a closer look how [email protected] is started,
i.e. how login prompts on the virtual terminal (i.e. non-serial TTYs)
work. Traditionally, the init system on Linux machines was configured
to spawn a fixed number login prompts at boot. In most cases six
instances of the getty program were spawned, on the first six VTs,
tty1 to tty6.

In a systemd world we made this more dynamic: in order to make
things more efficient login prompts are now started on demand only. As
you switch to the VTs the getty service is instantiated to
[email protected], [email protected] and so
on. Since we don’t have to unconditionally start the getty processes
anymore this allows us to save a bit of resources, and makes start-up
a bit faster. This behaviour is mostly transparent to the user: if the
user activates a VT the getty is started right-away, so that the user
will hardly notice that it wasn’t running all the time. If he then
logs in and types ps he’ll notice however that getty
instances are only running for the VTs he so far switched to.

By default this automatic spawning is done for the VTs up to VT6
only (in order to be close to the traditional default configuration of
Linux systems)[1]. Note that the auto-spawning of gettys
is only attempted if no other subsystem took possession of the VTs
yet. More specifically, if a user makes frequent use of fast user
switching
via GNOME he’ll get his X sessions on the first six VTs,
too, since the lowest available VT is allocated for each session.

Two VTs are handled specially by the auto-spawning logic: firstly
tty1 gets special treatment: if we boot into graphical mode
the display manager takes possession of this VT. If we boot into
multi-user (text) mode a getty is started on it — unconditionally,
without any on-demand logic[2].

Secondly, tty6 is
especially reserved for auto-spawned gettys and unavailable to other
subsystems such as X[3]. This is done in order to ensure
that there’s always a way to get a text login, even if due to
fast user switching X took possession of more than 5 VTs.

Serial Terminals

Handling of login prompts on serial terminals (and all other kind
of non-VT terminals) is different from that of VTs. By default systemd
will instantiate one [email protected] on the main
kernel[4] console, if it is not a virtual terminal. The
kernel console is where the kernel outputs its own log messages and is
usually configured on the kernel command line in the boot loader via
an argument such as console=ttyS0[5]. This logic ensures that
when the user asks the kernel to redirect its output onto a certain
serial terminal, he will automatically also get a login prompt on it
as the boot completes[6]. systemd will also spawn a login
prompt on the first special VM console (that’s /dev/hvc0,
/dev/xvc0, /dev/hvsi0), if the system is run in a VM
that provides these devices. This logic is implemented in a generator
called systemd-getty-generator
that is run early at boot and pulls in the necessary services
depending on the execution environment.

In many cases, this automatic logic should already suffice to get
you a login prompt when you need one, without any specific
configuration of systemd. However, sometimes there’s the need to
manually configure a serial getty, for example, if more than one
serial login prompt is needed or the kernel console should be
redirected to a different terminal than the login prompt. To
facilitate this it is sufficient to instantiate
[email protected] once for each serial port you want it
to run on[7]:

# systemctl enable [email protected]
# systemctl start [email protected]

And that’s it. This will make sure you get the login prompt on the
chosen port on all subsequent boots, and starts it right-away
too.

Sometimes, there’s the need to configure the login prompt in even
more detail. For example, if the default baud rate configured by the
kernel is not correct or other agetty parameters need to
be changed. In such a case simply copy the default unit template to
/etc/systemd/system and edit it there:

# cp /usr/lib/systemd/system/[email protected] /etc/systemd/system/[email protected]
# vi /etc/systemd/system/[email protected]
 .... now make your changes to the agetty command line ...
# ln -s /etc/systemd/system/[email protected] /etc/systemd/system/getty.target.wants/
# systemctl daemon-reload
# systemctl start [email protected]

This creates a unit file that is specific to serial port
ttyS2, so that you can make specific changes to this port and
this port only.

And this is pretty much all there’s to say about serial ports, VTs
and login prompts on them. I hope this was interesting, and please
come back soon for the next installment of this series!

Footnotes

[1] You can easily modify this by changing
NAutoVTs= in logind.conf.

[2] Note that whether the getty on VT1 is started on-demand
or not hardly makes a difference, since VT1 is the default active VT
anyway, so the demand is there anyway at boot.

[3] You can easily change this special reserved VT by
modifying ReserveVT= in logind.conf.

[4] If multiple kernel consoles are used simultaneously, the
main console is the one listed first in
/sys/class/tty/console/active, which is the last one
listed on the kernel command line.

[5] See kernel-parameters.txt
for more information on this kernel command line
option.

[6] Note that agetty -s is used here so that the
baud rate configured at the kernel command line is not altered and
continued to be used by the login prompt.

[7] Note that this systemctl enable syntax only
works with systemd 188 and newer (i.e. F18). On older versions use
ln -s /usr/lib/systemd/system/[email protected]
/etc/systemd/system/getty.target.wants/[email protected] ; systemctl
daemon-reload
instead.

systemd for Administrators, Part XV

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/watchdog.html

Quickly
following the previous iteration
, here’s
now the fifteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd:
the embedded/mobile folks, the desktop people and the server
folks. While the systems used by embedded/mobile tend to be
underpowered and have few resources are available, desktops tend to be
much more powerful machines — but still much less resourceful than
servers. Nonetheless there are surprisingly many features that matter
to both extremes of this axis (embedded and servers), but not the
center (desktops). On of them is support for watchdogs in
hardware and software.

Embedded devices frequently rely on watchdog hardware that resets
it automatically if software stops responding (more specifically,
stops signalling the hardware in fixed intervals that it is still
alive). This is required to increase reliability and make sure that
regardless what happens the best is attempted to get the system
working again. Functionality like this makes little sense on the
desktop[1]. However, on
high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for
hardware watchdogs (as exposed in /dev/watchdog to
userspace), as well as supervisor (software) watchdog support for
invidual system services. The basic idea is the following: if enabled,
systemd will regularly ping the watchdog hardware. If systemd or the
kernel hang this ping will not happen anymore and the hardware will
automatically reset the system. This way systemd and the kernel are
protected from boundless hangs — by the hardware. To make the chain
complete, systemd then exposes a software watchdog interface for
individual services so that they can also be restarted (or some other
action taken) if they begin to hang. This software watchdog logic can
be configured individually for each service in the ping frequency and
the action to take. Putting both parts together (i.e. hardware
watchdogs supervising systemd and the kernel, as well as systemd
supervising all other services) we have a reliable way to watchdog
every single component of the system.

To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec= option in
/etc/systemd/system.conf. It defaults to 0 (i.e. no hardware
watchdog use). Set it to a value like 20s and the watchdog is
enabled. After 20s of no keep-alive pings the hardware will reset
itself. Note that systemd will send a ping to the hardware at half the
specified interval, i.e. every 10s. And that’s already all there is to
it. By enabling this single, simple option you have turned on
supervision by the hardware of systemd and the kernel beneath
it.[2]

Note that the hardware watchdog device (/dev/watchdog) is
single-user only. That means that you can either enable this
functionality in systemd, or use a separate external watchdog daemon,
such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be
configured in /etc/systemd/system.conf. It controls the
watchdog interval to use during reboots. It defaults to 10min, and
adds extra reliability to the system reboot logic: if a clean reboot
is not possible and shutdown hangs, we rely on the watchdog hardware
to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are
really everything that is necessary to make use of the hardware
watchdogs. Now, let’s have a look how to add watchdog logic to
individual services.

First of all, to make software watchdog-supervisable it needs to be
patched to send out “I am alive” signals in regular intervals in its
event loop. Patching this is relatively easy. First, a daemon needs to
read the WATCHDOG_USEC= environment variable. If it is set,
it will contain the watchdog interval in usec formatted as ASCII text
string, as it is configured for the service. The daemon should then
issue sd_notify("WATCHDOG=1")
calls every half of that interval. A daemon patched this way should
transparently support watchdog functionality by checking whether the
environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been
patched to support the logic pointed out above) it is sufficient to
set the WatchdogSec= to the desired failure latency. See systemd.service(5)
for details on this setting. This causes WATCHDOG_USEC= to be
set for the service’s processes and will cause the service to enter a
failure state as soon as no keep-alive ping is received within the
configured interval.

If a service enters a failure state as soon as the watchdog logic
detects a hang, then this is hardly sufficient to build a reliable
system. The next step is to configure whether the service shall be
restarted and how often, and what to do if it then still fails. To
enable automatic service restarts on failure set
Restart=on-failure for the service. To configure how many
times a service shall be attempted to be restarted use the combination
of StartLimitBurst= and StartLimitInterval= which
allow you to configure how often a service may restart within a time
interval. If that limit is reached, a special action can be
taken. This action is configured with StartLimitAction=. The
default is a none, i.e. that no further action is taken and
the service simply remains in the failure state without any further
attempted restarts. The other three possible values are
reboot, reboot-force and
reboot-immediate. reboot attempts a clean reboot,
going through the usual, clean shutdown logic. reboot-force
is more abrupt: it will not actually try to cleanly shutdown any
services, but immediately kills all remaining services and unmounts
all file systems and then forcibly reboots (this way all file systems
will be clean but reboot will still be very fast). Finally,
reboot-immediate does not attempt to kill any process or
unmount any file systems. Instead it just hard reboots the machine
without delay. reboot-immediate hence comes closest to a
reboot triggered by a hardware watchdog. All these settings are
documented in systemd.service(5).

Putting this all together we now have pretty flexible options to
watchdog-supervise a specific service and configure automatic restarts
of the service if it hangs, plus take ultimate action if that doesn’t
help.

Here’s an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn’t pinged
the system manager for longer than 30s or if it fails otherwise. If it
is restarted this way more often than 4 times in 5min action is taken
and the system quickly rebooted, with all file systems being clean
when it comes up again.

And that’s already all I wanted to tell you about! With hardware
watchdog support right in PID 1, as well as supervisor watchdog
support for individual services we should provide everything you need
for most watchdog usecases. Regardless if you are building an embedded
or mobile applience, or if your are working with high-availability
servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with
/dev/watchdog, and why this shouldn’t be kept in a separate
daemon, then please read this again and try to understand that this is
all about the supervisor chain we are building here, where the hardware watchdog
supervises systemd, and systemd supervises the individual
services. Also, we believe that a service not responding should be
treated in a similar way as any other service error. Finally, pinging
/dev/watchdog is one of the most trivial operations in the OS
(basically little more than a ioctl() call), to the support for this
is not more than a handful lines of code. Maintaining this externally
with complex IPC between PID 1 (and the daemons) and this watchdog
daemon would be drastically more complex, error-prone and resource
intensive.)

Note that the built-in hardware watchdog support of systemd does
not conflict with other watchdog software by default. systemd does not
make use of /dev/watchdog by default, and you are welcome to
use external watchdog daemons in conjunction with systemd, if this
better suits your needs.

And one last thing: if you wonder whether your hardware has a
watchdog, then the answer is: almost definitely yes — if it is anything more
recent than a few years. If you want to verify this, try the wdctl
tool from recent util-linux, which shows you everything you need to
know about your watchdog hardware.

I’d like to thank the great folks from Pengutronix for contributing
most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog
hardware these days too, as this is cheap to build and available in
most modern PC chipsets.

[2] So, here’s a free tip for you if you hack on the core
OS: don’t enable this feature while you hack. Otherwise your system
might suddenly reboot if you are in the middle of tracing through PID
1 with gdb and cause it to be stopped for a moment, so that no
hardware ping can be done…

The Most Awesome, Least-Advertised Fedora 17 Feature

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/multi-seat.html

There’s one feature In the upcoming Fedora 17 release that is
immensly useful but very little known, since its feature page
‘ckremoval’
does not explicitly refer to it in its name: true
automatic multi-seat support for Linux.

A multi-seat computer is a system that offers not only one local
seat for a user, but multiple, at the same time. A seat refers to a
combination of a screen, a set of input devices (such as mice and
keyboards), and maybe an audio card or webcam, as individual local
workplace for a user. A multi-seat computer can drive an entire class
room of seats with only a fraction of the cost in hardware, energy,
administration and space: you only have one PC, which usually has way
enough CPU power to drive 10 or more workplaces. (In fact, even a
Netbook has fast enough to drive a couple of seats!) Automatic
multi-seat
refers to an entirely automatically managed seat setup:
whenever a new seat is plugged in a new login screen immediately
appears — without any manual configuration –, and when the seat is
unplugged all user sessions on it are removed without delay.

In Fedora 17 we added this functionality to the low-level user and
device tracking of systemd, replacing the previous ConsoleKit logic
that lacked support for automatic multi-seat. With all the ground work
done in systemd, udev and the other components of our plumbing layer
the last remaining bits were surprisingly easy to add.

Currently, the automatic multi-seat logic works best with the USB
multi-seat hardware from Plugable
you can buy cheaply on Amazon
(US)
. These devices require exactly zero configuration with the
new scheme implemented in Fedora 17: just plug them in at any time,
login screens pop up on them, and you have your additional
seats. Alternatively you can also assemble your seat manually with a
few easy loginctl
attach
commands, from any kind of hardware you might have lying
around. To get a full seat you need multiple graphics cards, keyboards
and mice: one set for each seat. (Later on we’ll probably have a graphical
setup utility for additional seats, but that’s not a pressing issue we
believe, as the plug-n-play multi-seat support with the Plugable
devices is so awesomely nice.)

Plugable provided us for free with hardware for testing
multi-seat. They are also involved with the upstream development of
the USB DisplayLink driver for Linux. Due to their positive
involvement with Linux we can only recommend to buy their
hardware. They are good guys, and support Free Software the way all
hardware vendors should! (And besides that, their hardware is also
nicely put together. For example, in contrast to most similar vendors
they actually assign proper vendor/product IDs to their USB hardware
so that we can easily recognize their hardware when plugged in to set
up automatic seats.)

Currently, all this magic is only implemented in the GNOME stack
with the biggest component getting updated being the GNOME Display
Manager. On the Plugable USB hardware you get a full GNOME Shell
session with all the usual graphical gimmicks, the same way as on any
other hardware. (Yes, GNOME 3 works perfectly fine on simpler graphics
cards such as these USB devices!) If you are hacking on a different
desktop environment, or on a different display manager, please have a
look at the
multi-seat documentation
we put together, and particularly at
our short piece about writing
display managers
which are multi-seat capable.

If you work on a major desktop environment or display manager and
would like to implement multi-seat support for it, but lack the
aforementioned Plugable hardware, we might be able to provide you with
the hardware for free. Please contact us directly, and we might be
able to send you a device. Note that we don’t have unlimited devices
available, hence we’ll probably not be able to pass hardware to
everybody who asks, and we will pass the hardware preferably to people
who work on well-known software or otherwise have contributed good
code to the community already. Anyway, if in doubt, ping us, and
explain to us why you should get the hardware, and we’ll consider you!
(Oh, and this not only applies to display managers, if you hack on some other
software where multi-seat awareness would be truly useful, then don’t
hesitate and ping us!)

Phoronix has this
story about this new multi-seat
support which is quite interesting and
full of pictures. Please have a look.

Plugable started a Pledge
drive
to lower the price of the Plugable USB multi-seat terminals
further. It’s full of pictures (and a video showing all this in action!), and uses the code we now make
available in Fedora 17 as base. Please consider pledging a few
bucks.

Recently David Zeuthen added
multi-seat support to udisks
as well. With this in place, a user
logged in on a specific seat can only see the USB storage plugged into
his individual seat, but does not see any USB storage plugged into any
other local seat. With this in place we closed the last missing bit of
multi-seat support in our desktop stack.

With this code in Fedora 17 we cover the big use cases of
multi-seat already: internet cafes, class rooms and similar
installations can provide PC workplaces cheaply and easily without any
manual configuration. Later on we want to build on this and make this
useful for different uses too: for example, the ability to get a login
screen as easily as plugging in a USB connector makes this not useful
only for saving money in setups for many people, but also in embedded
environments (consider monitoring/debugging screens made available via
this hotplug logic) or servers (get trivially quick local access to
your otherwise head-less server). To be truly useful in these areas we
need one more thing though: the ability to run a simply getty
(i.e. text login) on the seat, without necessarily involving a
graphical UI.

The well-known X successor Wayland already comes out of the box with multi-seat
support based on this logic.

Oh, and BTW, as Ubuntu appears to be “focussing” on “clarity” in the
cloud” now ;-), and chose Upstart instead of systemd, this feature
won’t be available in Ubuntu any time soon. That’s (one detail of) the
price Ubuntu has to pay for choosing to maintain it’s own (largely
legacy, such as ConsoleKit) plumbing stack.

Multi-seat has a long history on Unix. Since the earliest days Unix
systems could be accessed by multiple local terminals at the same
time. Since then local terminal support (and hence multi-seat)
gradually moved out of view in computing. The fewest machines these
days have more than one seat, the concept of terminals survived almost
exclusively in the context of PTYs (i.e. fully virtualized API
objects, disconnected from any real hardware seat) and VCs (i.e. a
single virtualized local seat), but almost not in any other way (well,
server setups still use serial terminals for emergency remote access,
but they almost never have more than one serial terminal). All what we
do in systemd is based on the ideas originally brought forward in
Unix; with systemd we now try to bring back a number of the good ideas
of Unix that since the old times were lost on the roadside. For
example, in true Unix style we already started to expose the concept
of a service in the file system (in
/sys/fs/cgroup/systemd/system/), something where on Linux the
(often misunderstood) “everything is a file” mantra previously
fell short. With automatic multi-seat support we bring back support
for terminals, but updated with all the features of today’s desktops:
plug and play, zero configuration, full graphics, and not limited to
input devices and screens, but extending to all kinds of devices, such
as audio, webcams or USB memory sticks.

Anyway, this is all for now; I’d like to thank everybody who was
involved with making multi-seat work so nicely and natively on the
Linux platform. You know who you are! Thanks a ton!

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-3.html

It
has been way too long since my last status update on
systemd
. Here’s another short, incomprehensive status update on
what we worked on for systemd since
then.

We have been working hard to turn systemd into the most viable set
of components to build operating systems, appliances and devices from,
and make it the best choice for servers, for desktops and for embedded
environments alike. I think we have a really convincing set of
features now, but we are actively working on making it even
better.

Here’s a list of some more and some less interesting features, in
no particular order:

  1. We added an automatic pager to systemctl (and related tools), similar
    to how git has it.
  2. systemctl learnt a new switch --failed, to show only
    failed services.
  3. You may now start services immediately, overrding all dependency
    logic by passing --ignore-dependencies to
    systemctl. This is mostly a debugging tool and nothing people
    should use in real life.
  4. Sending SIGKILL as final part of the implicit shutdown
    logic of services is now optional and may be configured with the
    SendSIGKILL= option individually for each service.
  5. We split off the Vala/Gtk tools into its own project systemd-ui.
  6. systemd-tmpfiles learnt file globbing and creating FIFO
    special files as well as character and block device nodes, and
    symlinks. It also is capable of relabelling certain directories at
    boot now (in the SELinux sense).
  7. Immediately before shuttding dow we will now invoke all binaries
    found in /lib/systemd/system-shutdown/, which is useful for
    debugging late shutdown.
  8. You may now globally control where STDOUT/STDERR of services goes
    (unless individual service configuration overrides it).
  9. There’s a new ConditionVirtualization= option, that makes
    systemd skip a specific service if a certain virtualization technology
    is found or not found. Similar, we now have a new option to detect
    whether a certain security technology (such as SELinux) is available,
    called ConditionSecurity=. There’s also
    ConditionCapability= to check whether a certain process
    capability is in the capability bounding set of the system. There’s
    also a new ConditionFileIsExecutable=,
    ConditionPathIsMountPoint=,
    ConditionPathIsReadWrite=,
    ConditionPathIsSymbolicLink=.
  10. The file system condition directives now support globbing.
  11. Service conditions may now be “triggering” and “mandatory”, meaning that
    they can be a necessary requirement to hold for a service to start, or
    simply one trigger among many.
  12. At boot time we now print warnings if: /usr
    is on a split-off partition but not already mounted by an initrd
    ;
    if /etc/mtab is not a symlink to /proc/mounts; CONFIG_CGROUPS
    is not enabled in the kernel
    . We’ll also expose this as
    tainted flag on the bus.
  13. You may now boot the same OS image on a bare metal machine and in
    Linux namespace containers and will get a clean boot in both
    cases. This is more complicated than it sounds since device management
    with udev or write access to /sys, /proc/sys or
    things like /dev/kmsg is not available in a container. This
    makes systemd a first-class choice for managing thin container
    setups. This is all tested with systemd’s own systemd-nspawn
    tool but should work fine in LXC setups, too. Basically this means
    that you do not have to adjust your OS manually to make it work in a
    container environment, but will just work out of the box. It also
    makes it easier to convert real systems into containers.
  14. We now automatically spawn gettys on HVC ttys when booting in VMs.
  15. We introduced /etc/machine-id as a generalization of
    D-Bus machine ID logic. See this
    blog story for more information
    . On stateless/read-only systems
    the machine ID is initialized randomly at boot. In virtualized
    environments it may be passed in from the machine manager (with qemu’s
    -uuid switch, or via the container
    interface
    ).
  16. All of the systemd-specific /etc/fstab mount options are
    now in the x-systemd-xyz format.
  17. To make it easy to find non-converted services we will now
    implicitly prefix all LSB and SysV init script descriptions with the
    strings “LSB:” resp. “SYSV:“.
  18. We introduced /run and made it a hard dependency of
    systemd. This directory is now widely accepted and implemented on all
    relevant Linux distributions.
  19. systemctl can now execute all its operations remotely too (-H switch).
  20. We now ship systemd-nspawn,
    a really powerful tool that can be used to start containers for
    debugging, building and testing, much like chroot(1). It is useful to
    just get a shell inside a build tree, but is good enough to boot up a
    full system in it, too.
  21. If we query the user for a hard disk password at boot he may hit
    TAB to hide the asterisks we normally show for each key that is
    entered, for extra paranoia.
  22. We don’t enable udev-settle.service anymore, which is
    only required for certain legacy software that still hasn’t been
    updated to follow devices coming and going cleanly.
  23. We now include a tool that can plot boot speed graphs, similar to
    bootchartd, called systemd-analyze.
  24. At boot, we now initialize the kernel’s binfmt_misc logic with the data from /etc/binfmt.d.
  25. systemctl now recognizes if it is run in a chroot()
    environment and will work accordingly (i.e. apply changes to the tree
    it is run in, instead of talking to the actual PID 1 for this). It also has a new --root= switch to work on an OS tree from outside of it.
  26. There’s a new unit dependency type OnFailureIsolate= that
    allows entering a different target whenever a certain unit fails. For
    example, this is interesting to enter emergency mode if file system
    checks of crucial file systems failed.
  27. Socket units may now listen on Netlink sockets, special files
    from /proc and POSIX message queues, too.
  28. There’s a new IgnoreOnIsolate= flag which may be used to
    ensure certain units are left untouched by isolation requests. There’s
    a new IgnoreOnSnapshot= flag which may be used to exclude
    certain units from snapshot units when they are created.
  29. There’s now small mechanism services for
    changing the local hostname and other host meta data
    , changing
    the system locale and console settings
    and the system
    clock
    .
  30. We now limit the capability bounding set for a number of our
    internal services by default.
  31. Plymouth may now be disabled globally with
    plymouth.enable=0 on the kernel command line.
  32. We now disallocate VTs when a getty finished running (and
    optionally other tools run on VTs). This adds extra security since it
    clears up the scrollback buffer so that subsequent users cannot get
    access to a user’s session output.
  33. In socket units there are now options to control the
    IP_TRANSPARENT, SO_BROADCAST, SO_PASSCRED,
    SO_PASSSEC socket options.
  34. The receive and send buffers of socket units may now be set larger
    than the default system settings if needed by using
    SO_{RCV,SND}BUFFORCE.
  35. We now set the hardware timezone as one of the first things in PID
    1, in order to avoid time jumps during normal userspace operation, and
    to guarantee sensible times on all generated logs. We also no longer
    save the system clock to the RTC on shutdown, assuming that this is
    done by the clock control tool when the user modifies the time, or
    automatically by the kernel if NTP is enabled.
  36. The SELinux directory got moved from /selinux to
    /sys/fs/selinux.
  37. We added a small service systemd-logind that keeps tracks
    of logged in users and their sessions. It creates control groups for
    them, implements the XDG_RUNTIME_DIR
    specification
    for them, maintains seats and device node ACLs and
    implements shutdown/idle inhibiting for clients. It auto-spawns gettys
    on all local VTs when the user switches to them (instead of starting
    six of them unconditionally), thus reducing the resource foot print by
    default. It has a D-Bus interface as well as a
    simple synchronous library interface
    . This mechanism obsoletes
    ConsoleKit which is now deprecated and should no longer be used.
  38. There’s now full, automatic multi-seat support, and this is
    enabled in GNOME 3.4. Just by pluging in new seat hardware you get a
    new login screen on your seat’s screen.
  39. There is now an option ControlGroupModify= to allow
    services to change the properties of their control groups dynamically,
    and one to make control groups persistent in the tree
    (ControlGroupPersistent=) so that they can be created and
    maintained by external tools.
  40. We now jump back into the initrd in shutdown, so that it can
    detach the root file system and the storage devices backing it. This
    allows (for the first time!) to reliably undo complex storage setups
    on shutdown and leave them in a clean state.
  41. systemctl now supports presets, a way for distributions and
    administrators to define their own policies on whether services should
    be enabled or disabled by default on package installation.
  42. systemctl now has high-level verbs for masking/unmasking
    units. There’s also a new command (systemctl list-unit-files)
    for determining the list of all installed unit file files and whether
    they are enabled or not.
  43. We now apply sysctl variables to each new network device, as it
    appears. This makes /etc/sysctl.d compatible with hot-plug
    network devices.
  44. There’s limited profiling for SELinux start-up perfomance built
    into PID 1.
  45. There’s a new switch PrivateNetwork=
    to turn of any network access for a specific service.
  46. Service units may now include configuration for control group
    parameters. A few (such as MemoryLimit=) are exposed with
    high-level options, and all others are available via the generic
    ControlGroupAttribute= setting.
  47. There’s now the option to mount certain cgroup controllers
    jointly at boot. We do this now for cpu and
    cpuacct by default.
  48. We added the
    journal
    and turned it on by default.
  49. All service output is now written to the Journal by default,
    regardless whether it is sent via syslog or simply written to
    stdout/stderr. Both message streams end up in the same location and
    are interleaved the way they should. All log messages even from the
    kernel and from early boot end up in the journal. Now, no service
    output gets unnoticed and is saved and indexed at the same
    location.
  50. systemctl status will now show the last 10 log lines for
    each service, directly from the journal.
  51. We now show the progress of fsck at boot on the console,
    again. We also show the much loved colorful [ OK ] status
    messages at boot again, as known from most SysV implementations.
  52. We merged udev into systemd.
  53. We implemented and documented interfaces to container
    managers
    and initrds
    for passing execution data to systemd. We also implemented and
    documented an
    interface for storage daemons that are required to back the root file
    system
    .
  54. There are two new options in service files to propagate reload requests between several units.
  55. systemd-cgls won’t show kernel threads by default anymore, or show empty control groups.
  56. We added a new tool systemd-cgtop that shows resource
    usage of whole services in a top(1) like fasion.
  57. systemd may now supervise services in watchdog style. If enabled
    for a service the daemon daemon has to ping PID 1 in regular intervals
    or is otherwise considered failed (which might then result in
    restarting it, or even rebooting the machine, as configured). Also,
    PID 1 is capable of pinging a hardware watchdog. Putting this
    together, the hardware watchdogs PID 1 and PID 1 then watchdogs
    specific services. This is highly useful for high-availability servers
    as well as embedded machines. Since watchdog hardware is noawadays
    built into all modern chipsets (including desktop chipsets), this
    should hopefully help to make this a more widely used
    functionality.
  58. We added support for a new kernel command line option
    systemd.setenv= to set an environment variable
    system-wide.
  59. By default services which are started by systemd will have SIGPIPE
    set to ignored. The Unix SIGPIPE logic is used to reliably implement
    shell pipelines and when left enabled in services is usually just a
    source of bugs and problems.
  60. You may now configure the rate limiting that is applied to
    restarts of specific services. Previously the rate limiting parameters
    were hard-coded (similar to SysV).
  61. There’s now support for loading the IMA integrity policy into the
    kernel early in PID 1, similar to how we already did it with the
    SELinux policy.
  62. There’s now an official API to schedule and query scheduled shutdowns.
  63. We changed the license from GPL2+ to LGPL2.1+.
  64. We made systemd-detect-virt
    an official tool in the tool set. Since we already had code to detect
    certain VM and container environments we now added an official tool
    for administrators to make use of in shell scripts and suchlike.
  65. We documented numerous
    interfaces
    systemd introduced.

Much of the stuff above is already available in Fedora 15 and 16,
or will be made available in the upcoming Fedora 17.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

I’d like to thank everybody who contributed to systemd over the past years.

Thanks for your interest!

/etc/os-release

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/os-release.html

One of
the new configuration files systemd introduced is /etc/os-release
.
It replaces the multitude of per-distribution release files[1] with
a single one. Yesterday we decided
to drop
support for systems lacking /etc/os-release
in systemd since recently the majority of the big distributions adopted
/etc/os-release and many small ones did, too[2]. It’s our
hope that by dropping support for non-compliant distributions we gently put
some pressure on the remaining hold-outs to adopt this scheme as well.

I’d like to take the opportunity to explain a bit what the new file offers,
why application developers should care, and why the distributions should adopt
it. Of course, this file is pretty much a triviality in many ways,
but I guess it’s still one that deserves explanation.

So, you ask why this all?

  • It relieves application developers who just want to know the
    distribution they are running on to check for a multitude of individual release files.
  • It provides both a “pretty” name (i.e. one to show to the user), and
    machine parsable version/OS identifiers (i.e. for use in build systems).
  • It is extensible, can easily learn new fields if needed. For example, since
    we want to print a welcome message in the color of your distribution at boot
    we make it possible to configure the ANSI color for that in the file.

FAQs

There’s already the lsb_release tool for this, why don’t you
just use that?
Well, it’s a very strange interface: a shell script you have
to invoke (and hence spawn asynchronously from your C code), and it’s not
written to be extensible. It’s an optional package in many distributions, and
nothing we’d be happy to invoke as part of early boot in order to show a
welcome message. (In times with sub-second userspace boot times we really don’t
want to invoke a huge shell script for a triviality like showing the welcome
message). The lsb_release tool to us appears to be an attempt of
abstracting distribution checks, where standardization of distribution checks
is needed. It’s simply a badly designed interface. In our opinion, it
has its use as an interface to determine the LSB version itself, but not for
checking the distribution or version.

Why haven’t you adopted one of the generic release files, such as
Fedora’s /etc/system-release?
Well, they are much nicer than
lsb_release, so much is true. However, they are not extensible and
are not really parsable, if the distribution needs to be identified
programmatically or a specific version needs to be verified.

Why didn’t you call this file /etc/bikeshed instead? The name
/etc/os-release sucks!
In a way, I think you kind of answered your
own question there already.

Does this mean my distribution can now drop our equivalent of
/etc/fedora-release?
Unlikely, too much code exists that still
checks for the individual release files, and you probably shouldn’t break that.
This new file makes things easy for applications, not for distributions:
applications can now rely on a single file only, and use it in a nice way.
Distributions will have to continue to ship the old files unless they are
willing to break compatibility here.

This is so useless! My application needs to be compatible with distros
from 1998, so how could I ever make use of the new file? I will have to
continue using the old ones!
True, if you need compatibility with really
old distributions you do. But for new code this might not be an issue, and in
general new APIs are new APIs. So if you decide to depend on it, you add a
dependency on it. However, even if you need to stay compatible it might make
sense to check /etc/os-release first and just fall back to the old
files if it doesn’t exist. The least it does for you is that you don’t need 25+
open() attempts on modern distributions, but just one.

You evil people are forcing my beloved distro $XYZ to adopt your awful
systemd schemes. I hate you!
You hate too much, my friend. Also, I am
pretty sure it’s not difficult to see the benefit of this new file
independently of systemd, and it’s truly useful on systems without systemd,
too.

I hate what you people do, can I just ignore this? Well, you really
need to work on your constant feelings of hate, my friend. But, to a certain
degree yes, you can ignore this for a while longer. But already, there are a
number of applications making use of this file. You lose compatibility with
those. Also, you are kinda working towards the further balkanization of the
Linux landscape, but maybe that’s your intention?

You guys add a new file because you think there are already too many? You
guys are so confused!
None of the existing files is generic and extensible
enough to do what we want it to do. Hence we had to introduce a new one. We
acknowledge the irony, however.

The file is extensible? Awesome! I want a new field XYZ= in it! Sure,
it’s extensible, and we are happy if distributions extend it. Please prefix
your keys with your distribution’s name however. Or even better: talk to us and
we might be able update the documentation and make your field standard, if you
convince us that it makes sense.

Anyway, to summarize all this: if you work on an application that needs to
identify the OS it is being built on or is being run on, please consider making
use of this new file, we created it for you. If you work on a distribution, and
your distribution doesn’t support this file yet, please consider adopting this
file, too.

If you are working on a small/embedded distribution, or a legacy-free
distribution we encourage you to adopt only this file and not establish any
other per-distro release file.

Read the documentation for /etc/os-release.

Footnotes

[1] Yes, multitude, there’s at least: /etc/redhat-release,
/etc/SuSE-release, /etc/debian_version,
/etc/arch-release, /etc/gentoo-release,
/etc/slackware-version, /etc/frugalware-release,
/etc/altlinux-release, /etc/mandriva-release,
/etc/meego-release, /etc/angstrom-version,
/etc/mageia-release. And some distributions even have multiple, for
example Fedora has already four different files.

[2] To our knowledge at least OpenSUSE, Fedora, ArchLinux, Angstrom,
Frugalware have adopted this. (This list is not comprehensive, there are
probably more.)

What You Need to Know When Becoming a Free Software Hacker

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/hinter-den-kulissen.html

Earlier today I gave a presentation at the Technical University Berlin about
things you need to know, things you should expect and things you shouldn’t
expect when your are aspiring to become a successful Free Software Hacker.

I have put my slides up on Google Docs in case you are interested, either
because you are the target audience (i.e. a university student) or because you
need inspiration for a similar talk about the same topic.

The first two slides are in German language, so skip over them. The
interesting bits are all in English. I hope it’s quite comprehensive (though of
course terse). Enjoy:

In case your feed reader/planet messes this up, here’s the non-embedded version.

Oh, and thanks to everybody who reviewed and suggested additions to the the slides on +.

Plumbers Conference 2011

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/lpc2011.html

The Linux Plumbers
Conference 2011 in Santa Rosa, CA, USA
is coming nearer (Sep. 7-9).
Together with Kay Sievers I am running the Boot&Init track, and together with
Mark Brown the Audio track.

For both tracks we still need proposals. So if you haven’t submitted
anything yet, please consider doing so. And that quickly. i.e. if you can
arrange for it, last sunday would be best, since that was actually the final
deadline. However, the submission form is still open, so if you submit
something really, really quickly we’ll ignore the absence of time travel and the calendar for a bit. So, go,
submit something. Now.

What are we looking for? Well, here’s what I just posted on the audio
related mailing lists
:

So, please consider submitting something if you haven't done so yet. We
are looking for all kinds of technical talks covering everything audio
plumbing related: audio drivers, audio APIs, sound servers, pro audio,
consumer audio. If you can propose something audio related -- like talks
on media controller routing, on audio for ASOC/Embedded, submit
something! If you care for low-latency audio, submit something. If you
care about the Linux audio stack in general, submit something.

LPC is probably the most relevant technical conference on the general
Linux platform, so be sure that if you want your project, your work,
your ideas to be heard then this is the right forum for everything
related to the Linux stack. And the Audio track covers everything in our
Audio Stack, regardless whether it is pro or consumer audio.

And here’s what I posted to the init
related lists
:

So, please consider submitting something if you haven't done so yet. We
are looking for all kinds of technical talks covering everything from
the BIOS (i.e. CoreBoot and friends), over boot loaders (i.e. GRUB and
friends), to initramfs (i.e. Dracut and friends) and init systems
(i.e. systemd and friends). If you have something smart to say about any
of these areas or maybe about related tools (i.e. you wrote a fancy new
tool to measure boot performance) or fancy boot schemes in your
favourite Linux based OS (i.e. the new Meego zero second boot ;-)) then
don't hesitate to submit something on the LPC web site, in the Boot&Init
track!

And now, quickly, go to the
LPC website
and post your session proposal in the Audio resp. Boot&Init; track! Thank you!

systemd for Developers I

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/socket-activation.html

systemd
not only brings improvements for administrators and users, it also
brings a (small) number of new APIs with it. In this blog story (which might
become the first of a series) I hope to shed some light on one of the
most important new APIs in systemd:

Socket Activation

In the original blog
story about systemd
I tried to explain why socket activation is a
wonderful technology to spawn services. Let’s reiterate the background
here a bit.

The basic idea of socket activation is not new. The inetd
superserver was a standard component of most Linux and Unix systems
since time began: instead of spawning all local Internet services
already at boot, the superserver would listen on behalf of the
services and whenever a connection would come in an instance of the
respective service would be spawned. This allowed relatively weak
machines with few resources to offer a big variety of services at the
same time. However it quickly got a reputation for being somewhat
slow: since daemons would be spawned for each incoming connection a
lot of time was spent on forking and initialization of the services
— once for each connection, instead of once for them all.

Spawning one instance per connection was how inetd was primarily
used, even though inetd actually understood another mode: on the first
incoming connection it would notice this via poll() (or
select()) and spawn a single instance for all future
connections. (This was controllable with the
wait/nowait options.) That way the first connection
would be slow to set up, but subsequent ones would be as fast as with
a standalone service. In this mode inetd would work in a true
on-demand mode: a service would be made available lazily when it was
required.

inetd’s focus was clearly on AF_INET (i.e. Internet) sockets. As
time progressed and Linux/Unix left the server niche and became
increasingly relevant on desktops, mobile and embedded environments
inetd was somehow lost in the troubles of time. Its reputation for
being slow, and the fact that Linux’ focus shifted away from only
Internet servers made a Linux machine running inetd (or one of its newer
implementations, like xinetd) the exception, not the rule.

When Apple engineers worked on optimizing the MacOS boot time they
found a new way to make use of the idea of socket activation: they
shifted the focus away from AF_INET sockets towards AF_UNIX
sockets. And they noticed that on-demand socket activation was only
part of the story: much more powerful is socket activation when used
for all local services including those which need to be started
anyway on boot. They implemented these ideas in launchd, a central building
block of modern MacOS X systems, and probably the main reason why
MacOS is so fast booting up.

But, before we continue, let’s have a closer look what the benefits
of socket activation for non-on-demand, non-Internet services in
detail are. Consider the four services Syslog, D-Bus, Avahi and the
Bluetooth daemon. D-Bus logs to Syslog, hence on traditional Linux
systems it would get started after Syslog. Similarly, Avahi requires
Syslog and D-Bus, hence would get started after both. Finally
Bluetooth is similar to Avahi and also requires Syslog and D-Bus but
does not interface at all with Avahi. Sinceoin a traditional
SysV-based system only one service can be in the process of getting
started at a time, the following serialization of startup would take
place: Syslog → D-Bus → Avahi → Bluetooth (Of course, Avahi and
Bluetooth could be started in the opposite order too, but we have to
pick one here, so let’s simply go alphabetically.). To illustrate
this, here’s a plot showing the order of startup beginning with system
startup (at the top).

Parallelization plot

Certain distributions tried to improve this strictly serialized
start-up: since Avahi and Bluetooth are independent from each other,
they can be started simultaneously. The parallelization is increased,
the overall startup time slightly smaller. (This is visualized in the
middle part of the plot.)

Socket activation makes it possible to start all four services
completely simultaneously, without any kind of ordering. Since the
creation of the listening sockets is moved outside of the daemons
themselves we can start them all at the same time, and they are able
to connect to each other’s sockets right-away. I.e. in a single step
the /dev/log and /run/dbus/system_bus_socket sockets
are created, and in the next step all four services are spawned
simultaneously. When D-Bus then wants to log to syslog, it just writes
its messages to /dev/log. As long as the socket buffer does
not run full it can go on immediately with what else it wants to do
for initialization. As soon as the syslog service catches up it will
process the queued messages. And if the socket buffer runs full then
the client logging will temporarily block until the socket is writable
again, and continue the moment it can write its log messages. That
means the scheduling of our services is entirely done by the kernel:
from the userspace perspective all services are run at the same time,
and when one service cannot keep up the others needing it will
temporarily block on their request but go on as soon as these
requests are dispatched. All of this is completely automatic and
invisible to userspace. Socket activation hence allows us to
drastically parallelize start-up, enabling simultaneous start-up of
services which previously were thought to strictly require
serialization. Most Linux services use sockets as communication
channel. Socket activation allows starting of clients and servers of
these channels at the same time.

But it’s not just about parallelization. It offers a number of
other benefits:

  • We no longer need to configure dependencies explicitly. Since the
    sockets are initialized before all services they are simply available,
    and no userspace ordering of service start-up needs to take place
    anymore. Socket activation hence drastically simplifies configuration
    and development of services.
  • If a service dies its listening socket stays around, not losing a
    single message. After a restart of the crashed service it can continue
    right where it left off.
  • If a service is upgraded we can restart the service while keeping
    around its sockets, thus ensuring the service is continously
    responsive. Not a single connection is lost during the upgrade.
  • We can even replace a service during runtime in a way that is
    invisible to the client. For example, all systems running systemd
    start up with a tiny syslog daemon at boot which passes all log
    messages written to /dev/log on to the kernel message
    buffer. That way we provide reliable userspace logging starting from
    the first instant of boot-up. Then, when the actual rsyslog daemon is
    ready to start we terminate the mini daemon and replace it with the
    real daemon. And all that while keeping around the original logging
    socket and sharing it between the two daemons and not losing a single
    message. Since rsyslog flushes the kernel log buffer to disk after
    start-up all log messages from the kernel, from early-boot and from
    runtime end up on disk.

For another explanation of this idea consult the original blog
story about systemd
.

Socket activation has been available in systemd since its
inception. On Fedora 15 a number of services have been modified to
implement socket activation, including Avahi, D-Bus and rsyslog (to continue with the example above).

systemd’s socket activation is quite comprehensive. Not only classic
sockets are support but related technologies as well:

  • AF_UNIX sockets, in the flavours SOCK_DGRAM, SOCK_STREAM and SOCK_SEQPACKET; both in the filesystem and in the abstract namespace
  • AF_INET sockets, i.e. TCP/IP and UDP/IP; both IPv4 and IPv6
  • Unix named pipes/FIFOs in the filesystem
  • AF_NETLINK sockets, to subscribe to certain kernel features. This
    is currently used by udev, but could be useful for other
    netlink-related services too, such as audit.
  • Certain special files like /proc/kmsg or device nodes like /dev/input/*.
  • POSIX Message Queues

A service capable of socket activation must be able to receive its
preinitialized sockets from systemd, instead of creating them
internally. For most services this requires (minimal)
patching. However, since systemd actually provides inetd compatibility
a service working with inetd will also work with systemd — which is
quite useful for services like sshd for example.

So much about the background of socket activation, let’s now have a
look how to patch a service to make it socket activatable. Let’s start
with a theoretic service foobard. (In a later blog post we’ll focus on
real-life example.)

Our little (theoretic) service includes code like the following for
creating sockets (most services include code like this in one way or
another):

/* Source Code Example #1: ORIGINAL, NOT SOCKET-ACTIVATABLE SERVICE */
...
union {
        struct sockaddr sa;
        struct sockaddr_un un;
} sa;
int fd;

fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0) {
        fprintf(stderr, "socket(): %m\n");
        exit(1);
}

memset(&sa, 0, sizeof(sa));
sa.un.sun_family = AF_UNIX;
strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
        fprintf(stderr, "bind(): %m\n");
        exit(1);
}

if (listen(fd, SOMAXCONN) < 0) {
        fprintf(stderr, "listen(): %m\n");
        exit(1);
}
...

A socket activatable service may use the following code instead:

/* Source Code Example #2: UPDATED, SOCKET-ACTIVATABLE SERVICE */
...
#include "sd-daemon.h"
...
int fd;

if (sd_listen_fds(0) != 1) {
        fprintf(stderr, "No or too many file descriptors received.\n");
        exit(1);
}

fd = SD_LISTEN_FDS_START + 0;
...

systemd might pass you more than one socket (based on
configuration, see below). In this example we are interested in one
only. sd_listen_fds()
returns how many file descriptors are passed. We simply compare that
with 1, and fail if we got more or less. The file descriptors systemd
passes to us are inherited one after the other beginning with fd
#3. (SD_LISTEN_FDS_START is a macro defined to 3). Our code hence just
takes possession of fd #3.

As you can see this code is actually much shorter than the
original. This of course comes at the price that our little service
with this change will no longer work in a non-socket-activation
environment. With minimal changes we can adapt our example to work nicely
both with and without socket activation:

/* Source Code Example #3: UPDATED, SOCKET-ACTIVATABLE SERVICE WITH COMPATIBILITY */
...
#include "sd-daemon.h"
...
int fd, n;

n = sd_listen_fds(0);
if (n > 1) {
        fprintf(stderr, "Too many file descriptors received.\n");
        exit(1);
} else if (n == 1)
        fd = SD_LISTEN_FDS_START + 0;
else {
        union {
                struct sockaddr sa;
                struct sockaddr_un un;
        } sa;

        fd = socket(AF_UNIX, SOCK_STREAM, 0);
        if (fd < 0) {
                fprintf(stderr, "socket(): %m\n");
                exit(1);
        }

        memset(&sa, 0, sizeof(sa));
        sa.un.sun_family = AF_UNIX;
        strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

        if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
                fprintf(stderr, "bind(): %m\n");
                exit(1);
        }

        if (listen(fd, SOMAXCONN) < 0) {
                fprintf(stderr, "listen(): %m\n");
                exit(1);
        }
}
...

With this simple change our service can now make use of socket
activation but still works unmodified in classic environments. Now,
let’s see how we can enable this service in systemd. For this we have
to write two systemd unit files: one describing the socket, the other
describing the service. First, here’s foobar.socket:

[Socket]
ListenStream=/run/foobar.sk

[Install]
WantedBy=sockets.target

And here’s the matching service file foobar.service:

[Service]
ExecStart=/usr/bin/foobard

If we place these two files in /etc/systemd/system we can
enable and start them:

# systemctl enable foobar.socket
# systemctl start foobar.socket

Now our little socket is listening, but our service not running
yet. If we now connect to /run/foobar.sk the service will be
automatically spawned, for on-demand service start-up. With a
modification of foobar.service we can start our service
already at startup, thus using socket activation only for
parallelization purposes, not for on-demand auto-spawning anymore:

[Service]
ExecStart=/usr/bin/foobard

[Install]
WantedBy=multi-user.target

And now let’s enable this too:

# systemctl enable foobar.service
# systemctl start foobar.service

Now our little daemon will be started at boot and on-demand,
whatever comes first. It can be started fully in parallel with its
clients, and when it dies it will be automatically restarted when it
is used the next time.

A single .socket file can include multiple ListenXXX stanzas, which
is useful for services that listen on more than one socket. In this
case all configured sockets will be passed to the service in the exact
order they are configured in the socket unit file. Also,
you may configure various socket settings in the .socket
files.

In real life it’s a good idea to include description strings in
these unit files, to keep things simple we’ll leave this out of our
example. Speaking of real-life: our next installment will cover an
actual real-life example. We’ll add socket activation to the CUPS
printing server.

The sd_listen_fds() function call is defined in sd-daemon.h
and sd-daemon.c. These
two files are currently drop-in .c sources which projects should
simply copy into their source tree. Eventually we plan to turn this
into a proper shared library, however using the drop-in files allows
you to compile your project in a way that is compatible with socket
activation even without any compile time dependencies on
systemd. sd-daemon.c is liberally licensed, should compile
fine on the most exotic Unixes and the algorithms are trivial enough
to be reimplemented with very little code if the license should
nonetheless be a problem for your project. sd-daemon.c
contains a couple of other API functions besides
sd_listen_fds() that are useful when implementing socket
activation in a project. For example, there’s sd_is_socket()
which can be used to distuingish and identify particular sockets when
a service gets passed more than one.

Let me point out that the interfaces used here are in no way bound
directly to systemd. They are generic enough to be implemented in
other systems as well. We deliberately designed them as simple and
minimal as possible to make it possible for others to adopt similar
schemes.

Stay tuned for the next installment. As mentioned, it will cover a
real-life example of turning an existing daemon into a
socket-activatable one: the CUPS printing service. However, I hope
this blog story might already be enough to get you started if you plan
to convert an existing service into a socket activatable one. We
invite everybody to convert upstream projects to this scheme. If you
have any questions join us on #systemd on freenode.

Why systemd?

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/why.html

systemd is
still a young project, but it is not a baby anymore. The initial
announcement
I posted precisely a year ago. Since then most of the
big distributions have decided to adopt it in one way or another, many
smaller distributions have already switched. The first big
distribution with systemd by default will be Fedora 15, due end of
May. It is expected that the others will follow the lead a bit later
(with one exception). Many
embedded developers have already adopted it too, and there’s even a company specializing on engineering and
consulting services for systemd
. In short: within one year
systemd became a really successful project.

However, there are still folks who we haven’t won over yet. If you
fall into one of the following categories, then please have a look on
the comparison of init systems below:

  • You are working on an embedded project and are wondering whether
    it should be based on systemd.
  • You are a user or administrator and wondering which distribution
    to pick, and are pondering whether it should be based on systemd or
    not.
  • You are a user or administrator and wondering why your favourite
    distribution has switched to systemd, if everything already worked so
    well before.
  • You are developing a distribution that hasn’t switched yet, and
    you are wondering whether to invest the work and go systemd.

And even if you don’t fall into any of these categories, you might still
find the comparison interesting.

We’ll be comparing the three most relevant init systems for Linux:
sysvinit, Upstart and systemd. Of course there are other init systems
in existance, but they play virtually no role in the big
picture. Unless you run Android (which is a completely different beast
anyway), you’ll almost definitely run one of these three init systems
on your Linux kernel. (OK, or busybox, but then you are basically not
running any init system at all.) Unless you have a soft spot for
exotic init systems there’s little need to look further. Also, I am
kinda lazy, and don’t want to spend the time on analyzing those other
systems in enough detail to be completely fair to them.

Speaking of fairness: I am of course one of the creators of
systemd. I will try my best to be fair to the other two contenders,
but in the end, take it with a grain of salt. I am sure though that
should I be grossly unfair or otherwise incorrect somebody will point
it out in the comments of this story, so consider having a look on
those, before you put too much trust in what I say.

We’ll look at the currently implemented features in a released
version. Grand plans don’t count.

General Features

sysvinit Upstart systemd
Interfacing via D-Bus no yes yes
Shell-free bootup no no yes
Modular C coded early boot services included no no yes
Read-Ahead no no[1] yes
Socket-based Activation no no[2] yes
Socket-based Activation: inetd compatibility no no[2] yes
Bus-based Activation no no[3] yes
Device-based Activation no no[4] yes
Configuration of device dependencies with udev rules no no yes
Path-based Activation (inotify) no no yes
Timer-based Activation no no yes
Mount handling no no[5] yes
fsck handling no no[5] yes
Quota handling no no yes
Automount handling no no yes
Swap handling no no yes
Snapshotting of system state no no yes
XDG_RUNTIME_DIR Support no no yes
Optionally kills remaining processes of users logging out no no yes
Linux Control Groups Integration no no yes
Audit record generation for started services no no yes
SELinux integration no no yes
PAM integration no no yes
Encrypted hard disk handling (LUKS) no no yes
SSL Certificate/LUKS Password handling, including Plymouth, Console, wall(1), TTY and GNOME agents no no yes
Network Loopback device handling no no yes
binfmt_misc handling no no yes
System-wide locale handling no no yes
Console and keyboard setup no no yes
Infrastructure for creating, removing, cleaning up of temporary and volatile files no no yes
Handling for /proc/sys sysctl no no yes
Plymouth integration no yes yes
Save/restore random seed no no yes
Static loading of kernel modules no no yes
Automatic serial console handling no no yes
Unique Machine ID handling no no yes
Dynamic host name and machine meta data handling no no yes
Reliable termination of services no no yes
Early boot /dev/log logging no no yes
Minimal kmsg-based syslog daemon for embedded use no no yes
Respawning on service crash without losing connectivity no no yes
Gapless service upgrades no no yes
Graphical UI no no yes
Built-In Profiling and Tools no no yes
Instantiated services no yes yes
PolicyKit integration no no yes
Remote access/Cluster support built into client tools no no yes
Can list all processes of a service no no yes
Can identify service of a process no no yes
Automatic per-service CPU cgroups to even out CPU usage between them no no yes
Automatic per-user cgroups no no yes
SysV compatibility yes yes yes
SysV services controllable like native services yes no yes
SysV-compatible /dev/initctl yes no yes
Reexecution with full serialization of state yes no yes
Interactive boot-up no[6] no[6] yes
Container support (as advanced chroot() replacement) no no yes
Dependency-based bootup no[7] no yes
Disabling of services without editing files yes no yes
Masking of services without editing files no no yes
Robust system shutdown within PID 1 no no yes
Built-in kexec support no no yes
Dynamic service generation no no yes
Upstream support in various other OS components yes no yes
Service files compatible between distributions no no yes
Signal delivery to services no no yes
Reliable termination of user sessions before shutdown no no yes
utmp/wtmp support yes yes yes
Easily writable, extensible and parseable service files, suitable for manipulation with enterprise management tools no no yes

[1] Read-Ahead implementation for Upstart available in separate package ureadahead, requires non-standard kernel patch.

[2] Socket activation implementation for Upstart available as preview, lacks parallelization support hence entirely misses the point of socket activation.

[3] Bus activation implementation for Upstart posted as patch, not merged.

[4] udev device event bridge implementation for Upstart available as preview, forwards entire udev database into Upstart, not practical.

[5] Mount handling utility mountall for Upstart available in separate package, covers only boot-time mounts, very limited dependency system.

[6] Some distributions offer this implemented in shell.

[7] LSB init scripts support this, if they are used.

Available Native Service Settings

sysvinit Upstart systemd
OOM Adjustment no yes[1] yes
Working Directory no yes yes
Root Directory (chroot()) no yes yes
Environment Variables no yes yes
Environment Variables from external file no no yes
Resource Limits no some[2] yes
umask no yes yes
User/Group/Supplementary Groups no no yes
IO Scheduling Class/Priority no no yes
CPU Scheduling Nice Value no yes yes
CPU Scheduling Policy/Priority no no yes
CPU Scheduling Reset on fork() control no no yes
CPU affinity no no yes
Timer Slack no no yes
Capabilities Control no no yes
Secure Bits Control no no yes
Control Group Control no no yes
High-level file system namespace control: making directories inacessible no no yes
High-level file system namespace control: making directories read-only no no yes
High-level file system namespace control: private /tmp no no yes
High-level file system namespace control: mount inheritance no no yes
Input on Console yes yes yes
Output on Syslog no no yes
Output on kmsg/dmesg no no yes
Output on arbitrary TTY no no yes
Kill signal control no no yes
Conditional execution: by identified CPU virtualization/container no no yes
Conditional execution: by file existance no no yes
Conditional execution: by security framework no no yes
Conditional execution: by kernel command line no no yes

[1] Upstart supports only the deprecated oom_score_adj mechanism, not the current oom_adj logic.

[2] Upstart lacks support for RLIMIT_RTTIME and RLIMIT_RTPRIO.

Note that some of these options are relatively easily added to SysV
init scripts, by editing the shell sources. The table above focusses
on easily accessible options that do not require source code
editing.

Miscellaneous

sysvinit Upstart systemd
Maturity > 15 years 6 years 1 year
Specialized professional consulting and engineering services available no no yes
SCM Subversion Bazaar git
Copyright-assignment-free contributing yes no yes

Summary

As the tables above hopefully show in all clarity systemd
has left behind both sysvinit and Upstart in almost every
aspect. With the exception of the project’s age/maturity systemd wins
in every category. At this point in time it will be very hard for
sysvinit and Upstart to catch up with the features systemd provides
today. In one year we managed to push systemd forward much further
than Upstart has been pushed in six.

It is our intention to drive forward the development of the Linux
platform with systemd. In the next release cycle we will focus more
strongly on providing the same features and speed improvement we
already offer for the system to the user login session. This will
bring much closer integration with the other parts of the OS and
applications, making the most of the features the service manager
provides, and making it available to login sessions. Certain
components such as ConsoleKit will be made redundant by these
upgrades, and services relying on them will be updated. The
burden for maintaining these then obsolete components
will be passed on the vendors who plan to continue to rely on
them.

If you are wondering whether or not to adopt systemd, then systemd
obviously wins when it comes to mere features. Of course that should
not be the only aspect to keep in mind. In the long run, sticking with
the existing infrastructure (such as ConsoleKit) comes at a price:
porting work needs to take place, and additional maintainance work for
bitrotting code needs to be done. Going it on your own means increased
workload.

That said, adopting systemd is also not free. Especially if you
made investments in the other two solutions adopting systemd means
work. The basic work to adopt systemd is relatively minimal for
porting over SysV systems (since compatibility is provided), but can
mean substantial work when coming from Upstart. If you plan to go for
a 100% systemd system without any SysV compatibility (recommended for
embedded, long run goal for the big distributions) you need to be
willing to invest some work to rewrite init scripts as simple systemd
unit files.

systemd is in the process of becoming a comprehensive, integrated
and modular platform providing everything needed to bootstrap and
maintain an operating system’s userspace. It includes C rewrites of
all basic early boot init scripts that are shipped with the various
distributions. Especially for the embedded case adopting systemd
provides you in one step with almost everything you need, and you can
pick the modules you want. The other two init systems are singular
individual components, which to be useful need a great number of
additional components with differing interfaces. The emphasis of
systemd to provide a platform instead of just a component allows for
closer integration, and cleaner APIs. Sooner or later this will
trickle up to the applications. Already, there are accepted XDG
specifications (e.g. XDG basedir spec, more specifically
XDG_RUNTIME_DIR) that are not supported on the other init systems.

systemd is also a big opportunity for Linux standardization. Since
it standardizes many interfaces of the system that previously have
been differing on every distribution, on every implementation,
adopting it helps to work against the balkanization of the Linux
interfaces. Choosing systemd means redefining more closely
what the Linux platform is about. This improves the lifes of
programmers, users and administrators alike.

I believe that momentum is clearly with systemd. We invite you to
join our community and be part of that momentum.