Tag Archives: TDO

systemd for Developers III

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journal-submit.html

Here’s the third episode of of my
systemd for Developers series.

Logging to the Journal

In a recent blog
story
intended for administrators I shed some light on how to use
the journalctl(1)
tool to browse and search the systemd journal. In this blog story for developers
I want to explain a little how to get log data into the systemd
Journal in the first place.

The good thing is that getting log data into the Journal is not
particularly hard, since there’s a good chance the Journal already
collects it anyway and writes it to disk. The journal collects:

  1. All data logged via libc syslog()
  2. The data from the kernel logged with printk()
  3. Everything written to STDOUT/STDERR of any system service

This covers pretty much all of the traditional log output of a
Linux system, including messages from the kernel initialization phase,
the initial RAM disk, the early boot logic, and the main system
runtime.

syslog()

Let’s have a quick look how syslog() is used again. Let’s
write a journal message using this call:

#include <syslog.h>

int main(int argc, char *argv[]) {
        syslog(LOG_NOTICE, "Hello World");
        return 0;
}

This is C code, of course. Many higher level languages provide APIs
that allow writing local syslog messages. Regardless which language
you choose, all data written like this ends up in the Journal.

Let’s have a look how this looks after it has been written into the
journal (this is the JSON
output
journalctl -o json-pretty generates):

{
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "_TRANSPORT" : "syslog",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "SYSLOG_FACILITY" : "1",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "_PID" : "3068",
        "SYSLOG_IDENTIFIER" : "test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351126905014938"
}

This nicely shows how the Journal implicitly augmented our little
log message with various meta data fields which describe in more
detail the context our message was generated from. For an explanation
of the various fields, please refer to systemd.journal-fields(7)

printf()

If you are writing code that is run as a systemd service, generating journal
messages is even easier:

#include <stdio.h>

int main(int argc, char *argv[]) {
        printf("Hello World\n");
        return 0;
}

Yupp, that’s easy, indeed.

The printed string in this example is logged at a default log
priority of LOG_INFO[1]. Sometimes it is useful to change
the log priority for such a printed string. When systemd parses
STDOUT/STDERR of a service it will look for priority values enclosed
in < > at the beginning of each line[2], following the scheme
used by the kernel’s printk() which in turn took
inspiration from the BSD syslog network serialization of messages. We
can make use of this systemd feature like this:

#include <stdio.h>

#define PREFIX_NOTICE "<5>"

int main(int argc, char *argv[]) {
        printf(PREFIX_NOTICE "Hello World\n");
        return 0;
}

Nice! Logging with nothing but printf() but we still get
log priorities!

This scheme works with any programming language, including, of course, shell:

#!/bin/bash

echo "<5>Hellow world"

Native Messages

Now, what I explained above is not particularly exciting: the
take-away is pretty much only that things end up in the journal if
they are output using the traditional message printing APIs. Yaaawn!

Let’s make this more interesting, let’s look at what the Journal
provides as native APIs for logging, and let’s see what its benefits
are. Let’s translate our little example into the 1:1 counterpart
using the Journal’s logging API sd_journal_print(3):

#include <systemd/sd-journal.h>

int main(int argc, char *argv[]) {
        sd_journal_print(LOG_NOTICE, "Hello World");
        return 0;
}

This doesn’t look much more interesting than the two examples
above, right? After compiling this with `pkg-config --cflags
--libs libsystemd-journal`
appended to the compiler parameters,
let’s have a closer look at the JSON representation of the journal
entry this generates:

 {
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World",
        "CODE_LINE" : "4",
        "_PID" : "3516",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351128226954170"
}

This looks pretty much the same, right? Almost! I highlighted three new
lines compared to the earlier output. Yes, you guessed it, by using
sd_journal_print() meta information about the generating
source code location is implicitly appended to each
message[3], which is helpful for a developer to identify
the source of a problem.

The primary reason for using the Journal’s native logging APIs is a
not just the source code location however: it is to allow
passing additional structured log messages from the program into the
journal. This additional log data may the be used to search the
journal for, is available for consumption for other programs, and
might help the administrator to track down issues beyond what is
expressed in the human readable message text. Here’s and example how
to do that with sd_journal_send():

#include <systemd/sd-journal.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
        sd_journal_send("MESSAGE=Hello World!",
                        "MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555",
                        "PRIORITY=5",
                        "HOME=%s", getenv("HOME"),
                        "TERM=%s", getenv("TERM"),
                        "PAGE_SIZE=%li", sysconf(_SC_PAGESIZE),
                        "N_CPUS=%li", sysconf(_SC_NPROCESSORS_ONLN),
                        NULL);
        return 0;
}

This will write a log message to the journal much like the earlier
examples. However, this times a few additional, structured fields are
attached:

{
        "__CURSOR" : "s=ac9e9c423355411d87bf0ba1a9b424e8;i=5930;b=5335e9cf5d954633bb99aefc0ec38c25;m=16544f875b;t=4ccd863cdc4f0;x=896defe53cc1a96a",
        "__REALTIME_TIMESTAMP" : "1351129666274544",
        "__MONOTONIC_TIMESTAMP" : "95903778651",
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_PID" : "4049",
        "CODE_LINE" : "6",
        "MESSAGE_ID" : "52fb62f99e2c49d89cfbf9d6de5e3555",
        "HOME" : "/home/lennart",
        "TERM" : "xterm-256color",
        "PAGE_SIZE" : "4096",
        "N_CPUS" : "4",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351129666241467"
}

Awesome! Our simple example worked! The five meta data fields we
attached to our message appeared in the journal. We used sd_journal_send()
for this which works much like sd_journal_print() but takes a
NULL terminated list of format strings each followed by its
arguments. The format strings must include the field name and a ‘=’
before the values.

Our little structured message included seven fields. The first three we passed are well-known fields:

  1. MESSAGE= is the actual human readable message part of the structured message.
  2. PRIORITY= is the numeric message priority value as known from BSD syslog formatted as an integer string.
  3. MESSAGE_ID= is a 128bit ID that identifies our specific
    message call, formatted as hexadecimal string. We randomly generated
    this string with journalctl --new-id128. This can be used by
    applications to track down all occasions of this specific
    message. The 128bit can be a UUID, but this is not a requirement or enforced.

Applications may relatively freely define additional fields as they
see fit (we defined four pretty arbitrary ones in our example). A
complete list of the currently well-known fields is available in systemd.journal-fields(7).

Let’s see how the message ID helps us finding this message and all
its occasions in the journal:

$ journalctl MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555
-- Logs begin at Thu, 2012-10-18 04:07:03 CEST, end at Thu, 2012-10-25 04:48:21 CEST. --
Oct 25 03:47:46 epsilon test-journal-se[4049]: Hello World!
Oct 25 04:40:36 epsilon test-journal-se[4480]: Hello World!

Seems I already invoked this example tool twice!

Many messages systemd itself generates have
message IDs
. This is useful for example, to find all occasions
where a program dumped core (journalctl
MESSAGE_ID=fc2e22bc6ee647b6b90729ab34a250b1
), or when a user
logged in (journalctl
MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66
). If your application
generates a message that might be interesting to recognize in the
journal stream later on, we recommend attaching such a message ID to
it. You can easily allocate a new one for your message with journalctl
--new-id128
.

This example shows how we can use the Journal’s native APIs to
generate structured, recognizable messages. You can do much more than
this with the C API. For example, you may store binary data in journal
fields as well, which is useful to attach coredumps or hard disk SMART
states to events where this applies. In order to make this blog story
not longer than it already is we’ll not go into detail about how to do
this, an I ask you to check out sd_journal_send(3)
for further information on this.

Python

The examples above focus on C. Structured logging to the Journal is
also available from other languages. Along with systemd itself we ship
bindings for Python. Here’s an example how to use this:

from systemd import journal
journal.send('Hello world')
journal.send('Hello, again, world', FIELD2='Greetings!', FIELD3='Guten tag')

Other binding exist for Node.js,
PHP, Lua.

Portability

Generating structured data is a very useful feature for services to
make their logs more accessible both for administrators and other
programs. In addition to the implicit structure the Journal
adds to all logged messages it is highly beneficial if the various
components of our stack also provide explicit structure
in their messages, coming from within the processes themselves.

Porting an existing program to the Journal’s logging APIs comes
with one pitfall though: the Journal is Linux-only. If non-Linux
portability matters for your project it’s a good idea to provide an
alternative log output, and make it selectable at compile-time.

Regardless which way to log you choose, in all cases we’ll forward
the message to a classic syslog daemon running side-by-side with the
Journal, if there is one. However, much of the structured meta data of
the message is not forwarded since the classic syslog protocol simply
has no generally accepted way to encode this and we shouldn’t attempt
to serialize meta data into classic syslog messages which might turn
/var/log/messages into an unreadable dump of machine
data. Anyway, to summarize this: regardless if you log with
syslog(), printf(), sd_journal_print() or
sd_journal_send(), the message will be stored and indexed by
the journal and it will also be forwarded to classic syslog.

And that’s it for today. In a follow-up episode we’ll focus on
retrieving messages from the Journal using the C API, possibly
filtering for a specific subset of messages. Later on, I hope to give
a real-life example how to port an existing service to the Journal’s
logging APIs. Stay tuned!

Footnotes

[1] This can be changed with the SyslogLevel= service
setting. See systemd.exec(5)
for details.

[2] Interpretation of the < > prefixes of logged lines
may be disabled with the SyslogLevelPrefix= service setting. See systemd.exec(5)
for details.

[3] Appending the code location to the log messages can be
turned off at compile time by defining
-DSD_JOURNAL_SUPPRESS_LOCATION.

systemd for Administrators, Part XVII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already
posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement
.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

  • Lines of error priority (and higher) will be highlighted red.
  • Lines of notice/warning priority will be highlighted bold.
  • The timestamps are converted into your local time-zone.
  • The output is auto-paged with your pager of choice (defaults to less).
  • This will show all available data, including rotated logs.
  • Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page
.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work
, and I hope it will make appearence in
F18.

systemd for Administrators, Part XV

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/watchdog.html

Quickly
following the previous iteration
, here’s
now the fifteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd:
the embedded/mobile folks, the desktop people and the server
folks. While the systems used by embedded/mobile tend to be
underpowered and have few resources are available, desktops tend to be
much more powerful machines — but still much less resourceful than
servers. Nonetheless there are surprisingly many features that matter
to both extremes of this axis (embedded and servers), but not the
center (desktops). On of them is support for watchdogs in
hardware and software.

Embedded devices frequently rely on watchdog hardware that resets
it automatically if software stops responding (more specifically,
stops signalling the hardware in fixed intervals that it is still
alive). This is required to increase reliability and make sure that
regardless what happens the best is attempted to get the system
working again. Functionality like this makes little sense on the
desktop[1]. However, on
high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for
hardware watchdogs (as exposed in /dev/watchdog to
userspace), as well as supervisor (software) watchdog support for
invidual system services. The basic idea is the following: if enabled,
systemd will regularly ping the watchdog hardware. If systemd or the
kernel hang this ping will not happen anymore and the hardware will
automatically reset the system. This way systemd and the kernel are
protected from boundless hangs — by the hardware. To make the chain
complete, systemd then exposes a software watchdog interface for
individual services so that they can also be restarted (or some other
action taken) if they begin to hang. This software watchdog logic can
be configured individually for each service in the ping frequency and
the action to take. Putting both parts together (i.e. hardware
watchdogs supervising systemd and the kernel, as well as systemd
supervising all other services) we have a reliable way to watchdog
every single component of the system.

To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec= option in
/etc/systemd/system.conf. It defaults to 0 (i.e. no hardware
watchdog use). Set it to a value like 20s and the watchdog is
enabled. After 20s of no keep-alive pings the hardware will reset
itself. Note that systemd will send a ping to the hardware at half the
specified interval, i.e. every 10s. And that’s already all there is to
it. By enabling this single, simple option you have turned on
supervision by the hardware of systemd and the kernel beneath
it.[2]

Note that the hardware watchdog device (/dev/watchdog) is
single-user only. That means that you can either enable this
functionality in systemd, or use a separate external watchdog daemon,
such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be
configured in /etc/systemd/system.conf. It controls the
watchdog interval to use during reboots. It defaults to 10min, and
adds extra reliability to the system reboot logic: if a clean reboot
is not possible and shutdown hangs, we rely on the watchdog hardware
to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are
really everything that is necessary to make use of the hardware
watchdogs. Now, let’s have a look how to add watchdog logic to
individual services.

First of all, to make software watchdog-supervisable it needs to be
patched to send out “I am alive” signals in regular intervals in its
event loop. Patching this is relatively easy. First, a daemon needs to
read the WATCHDOG_USEC= environment variable. If it is set,
it will contain the watchdog interval in usec formatted as ASCII text
string, as it is configured for the service. The daemon should then
issue sd_notify("WATCHDOG=1")
calls every half of that interval. A daemon patched this way should
transparently support watchdog functionality by checking whether the
environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been
patched to support the logic pointed out above) it is sufficient to
set the WatchdogSec= to the desired failure latency. See systemd.service(5)
for details on this setting. This causes WATCHDOG_USEC= to be
set for the service’s processes and will cause the service to enter a
failure state as soon as no keep-alive ping is received within the
configured interval.

If a service enters a failure state as soon as the watchdog logic
detects a hang, then this is hardly sufficient to build a reliable
system. The next step is to configure whether the service shall be
restarted and how often, and what to do if it then still fails. To
enable automatic service restarts on failure set
Restart=on-failure for the service. To configure how many
times a service shall be attempted to be restarted use the combination
of StartLimitBurst= and StartLimitInterval= which
allow you to configure how often a service may restart within a time
interval. If that limit is reached, a special action can be
taken. This action is configured with StartLimitAction=. The
default is a none, i.e. that no further action is taken and
the service simply remains in the failure state without any further
attempted restarts. The other three possible values are
reboot, reboot-force and
reboot-immediate. reboot attempts a clean reboot,
going through the usual, clean shutdown logic. reboot-force
is more abrupt: it will not actually try to cleanly shutdown any
services, but immediately kills all remaining services and unmounts
all file systems and then forcibly reboots (this way all file systems
will be clean but reboot will still be very fast). Finally,
reboot-immediate does not attempt to kill any process or
unmount any file systems. Instead it just hard reboots the machine
without delay. reboot-immediate hence comes closest to a
reboot triggered by a hardware watchdog. All these settings are
documented in systemd.service(5).

Putting this all together we now have pretty flexible options to
watchdog-supervise a specific service and configure automatic restarts
of the service if it hangs, plus take ultimate action if that doesn’t
help.

Here’s an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn’t pinged
the system manager for longer than 30s or if it fails otherwise. If it
is restarted this way more often than 4 times in 5min action is taken
and the system quickly rebooted, with all file systems being clean
when it comes up again.

And that’s already all I wanted to tell you about! With hardware
watchdog support right in PID 1, as well as supervisor watchdog
support for individual services we should provide everything you need
for most watchdog usecases. Regardless if you are building an embedded
or mobile applience, or if your are working with high-availability
servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with
/dev/watchdog, and why this shouldn’t be kept in a separate
daemon, then please read this again and try to understand that this is
all about the supervisor chain we are building here, where the hardware watchdog
supervises systemd, and systemd supervises the individual
services. Also, we believe that a service not responding should be
treated in a similar way as any other service error. Finally, pinging
/dev/watchdog is one of the most trivial operations in the OS
(basically little more than a ioctl() call), to the support for this
is not more than a handful lines of code. Maintaining this externally
with complex IPC between PID 1 (and the daemons) and this watchdog
daemon would be drastically more complex, error-prone and resource
intensive.)

Note that the built-in hardware watchdog support of systemd does
not conflict with other watchdog software by default. systemd does not
make use of /dev/watchdog by default, and you are welcome to
use external watchdog daemons in conjunction with systemd, if this
better suits your needs.

And one last thing: if you wonder whether your hardware has a
watchdog, then the answer is: almost definitely yes — if it is anything more
recent than a few years. If you want to verify this, try the wdctl
tool from recent util-linux, which shows you everything you need to
know about your watchdog hardware.

I’d like to thank the great folks from Pengutronix for contributing
most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog
hardware these days too, as this is cheap to build and available in
most modern PC chipsets.

[2] So, here’s a free tip for you if you hack on the core
OS: don’t enable this feature while you hack. Otherwise your system
might suddenly reboot if you are in the middle of tracing through PID
1 with gdb and cause it to be stopped for a moment, so that no
hardware ping can be done…

systemd for Administrators, Part XIII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemctl-journal.html

Here’s
the thirteenth installment
of

my ongoing series
on
systemd
for
Administrators:

Log and Service Status

This one is a short episode. One of the most commonly used commands
on a systemd
system is systemctl status which may be used to determine the
status of a service (or other unit). It always has been a valuable
tool to figure out the processes, runtime information and other meta
data of a daemon running on the system.

With Fedora 17 we introduced the
journal
, our new logging scheme that provides structured, indexed
and reliable logging on systemd systems, while providing a certain
degree of compatibility with classic syslog implementations. The
original reason we started to work on the journal was one specific
feature idea, that to the outsider might appear simple but without the
journal is difficult and inefficient to implement: along with the
output of systemctl status we wanted to show the last 10 log
messages of the daemon. Log data is some of the most essential bits of
information we have on the status of a service. Hence it it is an
obvious choice to show next to the general status of the
service.

And now to make it short: at the same time as we integrated the
journal into systemd and Fedora we also hooked up
systemctl with it. Here’s an example output:

$ systemctl status avahi-daemon.service
avahi-daemon.service - Avahi mDNS/DNS-SD Stack
	  Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; enabled)
	  Active: active (running) since Fri, 18 May 2012 12:27:37 +0200; 14s ago
	Main PID: 8216 (avahi-daemon)
	  Status: "avahi-daemon 0.6.30 starting up."
	  CGroup: name=systemd:/system/avahi-daemon.service
		  ├ 8216 avahi-daemon: running [omega.local]
		  └ 8217 avahi-daemon: chroot helper

May 18 12:27:37 omega avahi-daemon[8216]: Joining mDNS multicast group on interface eth1.IPv4 with address 172.31.0.52.
May 18 12:27:37 omega avahi-daemon[8216]: New relevant interface eth1.IPv4 for mDNS.
May 18 12:27:37 omega avahi-daemon[8216]: Network interface enumeration completed.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for 192.168.122.1 on virbr0.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for fd00::e269:95ff:fe87:e282 on eth1.*.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for 172.31.0.52 on eth1.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering HINFO record with values 'X86_64'/'LINUX'.
May 18 12:27:38 omega avahi-daemon[8216]: Server startup complete. Host name is omega.local. Local service cookie is 3555095952.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/ssh.service) successfully established.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/sftp-ssh.service) successfully established.

This, of course, shows the status of everybody’s favourite
mDNS/DNS-SD daemon with a list of its processes, along with — as
promised — the 10 most recent log lines. Mission accomplished!

There are a couple of switches available to alter the output
slightly and adjust it to your needs. The two most interesting
switches are -f to enable follow mode (as in tail
-f
) and -n to change the number of lines to show (you
guessed it, as in tail -n).

The log data shown comes from three sources: everything any of the
daemon’s processes logged with libc’s syslog() call,
everything submitted using the native Journal API, plus everything any
of the daemon’s processes logged to STDOUT or STDERR. In short:
everything the daemon generates as log data is collected, properly
interleaved and shown in the same format.

And that’s it already for today. It’s a very simple feature, but an
immensely useful one for every administrator. One of the kind “Why didn’t
we already do this 15 years ago?”.

Stay tuned for the next installment!

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-3.html

It
has been way too long since my last status update on
systemd
. Here’s another short, incomprehensive status update on
what we worked on for systemd since
then.

We have been working hard to turn systemd into the most viable set
of components to build operating systems, appliances and devices from,
and make it the best choice for servers, for desktops and for embedded
environments alike. I think we have a really convincing set of
features now, but we are actively working on making it even
better.

Here’s a list of some more and some less interesting features, in
no particular order:

  1. We added an automatic pager to systemctl (and related tools), similar
    to how git has it.
  2. systemctl learnt a new switch --failed, to show only
    failed services.
  3. You may now start services immediately, overrding all dependency
    logic by passing --ignore-dependencies to
    systemctl. This is mostly a debugging tool and nothing people
    should use in real life.
  4. Sending SIGKILL as final part of the implicit shutdown
    logic of services is now optional and may be configured with the
    SendSIGKILL= option individually for each service.
  5. We split off the Vala/Gtk tools into its own project systemd-ui.
  6. systemd-tmpfiles learnt file globbing and creating FIFO
    special files as well as character and block device nodes, and
    symlinks. It also is capable of relabelling certain directories at
    boot now (in the SELinux sense).
  7. Immediately before shuttding dow we will now invoke all binaries
    found in /lib/systemd/system-shutdown/, which is useful for
    debugging late shutdown.
  8. You may now globally control where STDOUT/STDERR of services goes
    (unless individual service configuration overrides it).
  9. There’s a new ConditionVirtualization= option, that makes
    systemd skip a specific service if a certain virtualization technology
    is found or not found. Similar, we now have a new option to detect
    whether a certain security technology (such as SELinux) is available,
    called ConditionSecurity=. There’s also
    ConditionCapability= to check whether a certain process
    capability is in the capability bounding set of the system. There’s
    also a new ConditionFileIsExecutable=,
    ConditionPathIsMountPoint=,
    ConditionPathIsReadWrite=,
    ConditionPathIsSymbolicLink=.
  10. The file system condition directives now support globbing.
  11. Service conditions may now be “triggering” and “mandatory”, meaning that
    they can be a necessary requirement to hold for a service to start, or
    simply one trigger among many.
  12. At boot time we now print warnings if: /usr
    is on a split-off partition but not already mounted by an initrd
    ;
    if /etc/mtab is not a symlink to /proc/mounts; CONFIG_CGROUPS
    is not enabled in the kernel
    . We’ll also expose this as
    tainted flag on the bus.
  13. You may now boot the same OS image on a bare metal machine and in
    Linux namespace containers and will get a clean boot in both
    cases. This is more complicated than it sounds since device management
    with udev or write access to /sys, /proc/sys or
    things like /dev/kmsg is not available in a container. This
    makes systemd a first-class choice for managing thin container
    setups. This is all tested with systemd’s own systemd-nspawn
    tool but should work fine in LXC setups, too. Basically this means
    that you do not have to adjust your OS manually to make it work in a
    container environment, but will just work out of the box. It also
    makes it easier to convert real systems into containers.
  14. We now automatically spawn gettys on HVC ttys when booting in VMs.
  15. We introduced /etc/machine-id as a generalization of
    D-Bus machine ID logic. See this
    blog story for more information
    . On stateless/read-only systems
    the machine ID is initialized randomly at boot. In virtualized
    environments it may be passed in from the machine manager (with qemu’s
    -uuid switch, or via the container
    interface
    ).
  16. All of the systemd-specific /etc/fstab mount options are
    now in the x-systemd-xyz format.
  17. To make it easy to find non-converted services we will now
    implicitly prefix all LSB and SysV init script descriptions with the
    strings “LSB:” resp. “SYSV:“.
  18. We introduced /run and made it a hard dependency of
    systemd. This directory is now widely accepted and implemented on all
    relevant Linux distributions.
  19. systemctl can now execute all its operations remotely too (-H switch).
  20. We now ship systemd-nspawn,
    a really powerful tool that can be used to start containers for
    debugging, building and testing, much like chroot(1). It is useful to
    just get a shell inside a build tree, but is good enough to boot up a
    full system in it, too.
  21. If we query the user for a hard disk password at boot he may hit
    TAB to hide the asterisks we normally show for each key that is
    entered, for extra paranoia.
  22. We don’t enable udev-settle.service anymore, which is
    only required for certain legacy software that still hasn’t been
    updated to follow devices coming and going cleanly.
  23. We now include a tool that can plot boot speed graphs, similar to
    bootchartd, called systemd-analyze.
  24. At boot, we now initialize the kernel’s binfmt_misc logic with the data from /etc/binfmt.d.
  25. systemctl now recognizes if it is run in a chroot()
    environment and will work accordingly (i.e. apply changes to the tree
    it is run in, instead of talking to the actual PID 1 for this). It also has a new --root= switch to work on an OS tree from outside of it.
  26. There’s a new unit dependency type OnFailureIsolate= that
    allows entering a different target whenever a certain unit fails. For
    example, this is interesting to enter emergency mode if file system
    checks of crucial file systems failed.
  27. Socket units may now listen on Netlink sockets, special files
    from /proc and POSIX message queues, too.
  28. There’s a new IgnoreOnIsolate= flag which may be used to
    ensure certain units are left untouched by isolation requests. There’s
    a new IgnoreOnSnapshot= flag which may be used to exclude
    certain units from snapshot units when they are created.
  29. There’s now small mechanism services for
    changing the local hostname and other host meta data
    , changing
    the system locale and console settings
    and the system
    clock
    .
  30. We now limit the capability bounding set for a number of our
    internal services by default.
  31. Plymouth may now be disabled globally with
    plymouth.enable=0 on the kernel command line.
  32. We now disallocate VTs when a getty finished running (and
    optionally other tools run on VTs). This adds extra security since it
    clears up the scrollback buffer so that subsequent users cannot get
    access to a user’s session output.
  33. In socket units there are now options to control the
    IP_TRANSPARENT, SO_BROADCAST, SO_PASSCRED,
    SO_PASSSEC socket options.
  34. The receive and send buffers of socket units may now be set larger
    than the default system settings if needed by using
    SO_{RCV,SND}BUFFORCE.
  35. We now set the hardware timezone as one of the first things in PID
    1, in order to avoid time jumps during normal userspace operation, and
    to guarantee sensible times on all generated logs. We also no longer
    save the system clock to the RTC on shutdown, assuming that this is
    done by the clock control tool when the user modifies the time, or
    automatically by the kernel if NTP is enabled.
  36. The SELinux directory got moved from /selinux to
    /sys/fs/selinux.
  37. We added a small service systemd-logind that keeps tracks
    of logged in users and their sessions. It creates control groups for
    them, implements the XDG_RUNTIME_DIR
    specification
    for them, maintains seats and device node ACLs and
    implements shutdown/idle inhibiting for clients. It auto-spawns gettys
    on all local VTs when the user switches to them (instead of starting
    six of them unconditionally), thus reducing the resource foot print by
    default. It has a D-Bus interface as well as a
    simple synchronous library interface
    . This mechanism obsoletes
    ConsoleKit which is now deprecated and should no longer be used.
  38. There’s now full, automatic multi-seat support, and this is
    enabled in GNOME 3.4. Just by pluging in new seat hardware you get a
    new login screen on your seat’s screen.
  39. There is now an option ControlGroupModify= to allow
    services to change the properties of their control groups dynamically,
    and one to make control groups persistent in the tree
    (ControlGroupPersistent=) so that they can be created and
    maintained by external tools.
  40. We now jump back into the initrd in shutdown, so that it can
    detach the root file system and the storage devices backing it. This
    allows (for the first time!) to reliably undo complex storage setups
    on shutdown and leave them in a clean state.
  41. systemctl now supports presets, a way for distributions and
    administrators to define their own policies on whether services should
    be enabled or disabled by default on package installation.
  42. systemctl now has high-level verbs for masking/unmasking
    units. There’s also a new command (systemctl list-unit-files)
    for determining the list of all installed unit file files and whether
    they are enabled or not.
  43. We now apply sysctl variables to each new network device, as it
    appears. This makes /etc/sysctl.d compatible with hot-plug
    network devices.
  44. There’s limited profiling for SELinux start-up perfomance built
    into PID 1.
  45. There’s a new switch PrivateNetwork=
    to turn of any network access for a specific service.
  46. Service units may now include configuration for control group
    parameters. A few (such as MemoryLimit=) are exposed with
    high-level options, and all others are available via the generic
    ControlGroupAttribute= setting.
  47. There’s now the option to mount certain cgroup controllers
    jointly at boot. We do this now for cpu and
    cpuacct by default.
  48. We added the
    journal
    and turned it on by default.
  49. All service output is now written to the Journal by default,
    regardless whether it is sent via syslog or simply written to
    stdout/stderr. Both message streams end up in the same location and
    are interleaved the way they should. All log messages even from the
    kernel and from early boot end up in the journal. Now, no service
    output gets unnoticed and is saved and indexed at the same
    location.
  50. systemctl status will now show the last 10 log lines for
    each service, directly from the journal.
  51. We now show the progress of fsck at boot on the console,
    again. We also show the much loved colorful [ OK ] status
    messages at boot again, as known from most SysV implementations.
  52. We merged udev into systemd.
  53. We implemented and documented interfaces to container
    managers
    and initrds
    for passing execution data to systemd. We also implemented and
    documented an
    interface for storage daemons that are required to back the root file
    system
    .
  54. There are two new options in service files to propagate reload requests between several units.
  55. systemd-cgls won’t show kernel threads by default anymore, or show empty control groups.
  56. We added a new tool systemd-cgtop that shows resource
    usage of whole services in a top(1) like fasion.
  57. systemd may now supervise services in watchdog style. If enabled
    for a service the daemon daemon has to ping PID 1 in regular intervals
    or is otherwise considered failed (which might then result in
    restarting it, or even rebooting the machine, as configured). Also,
    PID 1 is capable of pinging a hardware watchdog. Putting this
    together, the hardware watchdogs PID 1 and PID 1 then watchdogs
    specific services. This is highly useful for high-availability servers
    as well as embedded machines. Since watchdog hardware is noawadays
    built into all modern chipsets (including desktop chipsets), this
    should hopefully help to make this a more widely used
    functionality.
  58. We added support for a new kernel command line option
    systemd.setenv= to set an environment variable
    system-wide.
  59. By default services which are started by systemd will have SIGPIPE
    set to ignored. The Unix SIGPIPE logic is used to reliably implement
    shell pipelines and when left enabled in services is usually just a
    source of bugs and problems.
  60. You may now configure the rate limiting that is applied to
    restarts of specific services. Previously the rate limiting parameters
    were hard-coded (similar to SysV).
  61. There’s now support for loading the IMA integrity policy into the
    kernel early in PID 1, similar to how we already did it with the
    SELinux policy.
  62. There’s now an official API to schedule and query scheduled shutdowns.
  63. We changed the license from GPL2+ to LGPL2.1+.
  64. We made systemd-detect-virt
    an official tool in the tool set. Since we already had code to detect
    certain VM and container environments we now added an official tool
    for administrators to make use of in shell scripts and suchlike.
  65. We documented numerous
    interfaces
    systemd introduced.

Much of the stuff above is already available in Fedora 15 and 16,
or will be made available in the upcoming Fedora 17.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

I’d like to thank everybody who contributed to systemd over the past years.

Thanks for your interest!

systemd for Administrators, Part XI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/inetd.html

Here’s the eleventh installment
of
my ongoing series
on
systemd
for
Administrators:

Converting inetd Services

In a
previous episode of this series
I covered how to convert a SysV
init script to a systemd unit file. In this story I hope to explain
how to convert inetd services into systemd units.

Let’s start with a bit of background. inetd has a long tradition as
one of the classic Unix services. As a superserver it listens on
an Internet socket on behalf of another service and then activate that
service on an incoming connection, thus implementing an on-demand
socket activation system. This allowed Unix machines with limited
resources to provide a large variety of services, without the need to
run processes and invest resources for all of them all of the
time. Over the years a number of independent implementations of inetd
have been shipped on Linux distributions. The most prominent being the
ones based on BSD inetd and xinetd. While inetd used to be installed
on most distributions by default, it nowadays is used only for very
few selected services and the common services are all run
unconditionally at boot, primarily for (perceived) performance
reasons.

One of the core feature of systemd (and Apple’s launchd for the
matter) is socket activation, a scheme pioneered by inetd, however
back then with a different focus. Systemd-style socket activation focusses on
local sockets (AF_UNIX), not so much Internet sockets (AF_INET), even
though both are supported. And more importantly even, socket
activation in systemd is not primarily about the on-demand aspect that
was key in inetd, but more on increasing parallelization (socket
activation allows starting clients and servers of the socket at the
same time), simplicity (since the need to configure explicit
dependencies between services is removed) and robustness (since
services can be restarted or may crash without loss of connectivity of the
socket). However, systemd can also activate services on-demand when
connections are incoming, if configured that way.

Socket activation of any kind requires support in the services
themselves. systemd provides a very simple interface that services may
implement to provide socket activation, built around sd_listen_fds(). As such
it is already a very minimal, simple scheme
. However, the
traditional inetd interface is even simpler. It allows passing only a
single socket to the activated service: the socket fd is simply
duplicated to STDIN and STDOUT of the process spawned, and that’s
already it. In order to provide compatibility systemd optionally
offers the same interface to processes, thus taking advantage of the
many services that already support inetd-style socket activation, but not yet
systemd’s native activation.

Before we continue with a concrete example, let’s have a look at
three different schemes to make use of socket activation:

  1. Socket activation for parallelization, simplicity,
    robustness:
    sockets are bound during early boot and a singleton
    service instance to serve all client requests is immediately started
    at boot. This is useful for all services that are very likely used
    frequently and continously, and hence starting them early and in
    parallel with the rest of the system is advisable. Examples: D-Bus,
    Syslog.
  2. On-demand socket activation for singleton services: sockets
    are bound during early boot and a singleton service instance is
    executed on incoming traffic. This is useful for services that are
    seldom used, where it is advisable to save the resources and time at
    boot and delay activation until they are actually needed. Example: CUPS.
  3. On-demand socket activation for per-connection service
    instances:
    sockets are bound during early boot and for each
    incoming connection a new service instance is instantiated and the
    connection socket (and not the listening one) is passed to it. This is
    useful for services that are seldom used, and where performance is not
    critical, i.e. where the cost of spawning a new service process for
    each incoming connection is limited. Example: SSH.

The three schemes provide different performance characteristics. After
the service finishes starting up the performance provided by the first two
schemes is identical to a stand-alone service (i.e. one that is
started without a super-server, without socket activation), since the
listening socket is passed to the actual service, and code paths from
then on are identical to those of a stand-alone service and all
connections are processes exactly the same way as they are in a
stand-alone service. On the other hand, performance of the third scheme
is usually not as good: since for each connection a new service needs
to be started the resource cost is much higher. However, it also has a
number of advantages: for example client connections are better
isolated and it is easier to develop services activated this way.

For systemd primarily the first scheme is in focus, however the
other two schemes are supported as well. (In fact, the blog story I
covered the necessary code changes for systemd-style socket activation
in
was about a service of the second type, i.e. CUPS). inetd
primarily focusses on the third scheme, however the second scheme is
supported too. (The first one isn’t. Presumably due the focus on the
third scheme inetd got its — a bit unfair — reputation for being
“slow”.)

So much about the background, let’s cut to the beef now and show an
inetd service can be integrated into systemd’s socket
activation. We’ll focus on SSH, a very common service that is widely
installed and used but on the vast majority of machines probably not
started more often than 1/h in average (and usually even much
less). SSH has supported inetd-style activation since a long time,
following the third scheme mentioned above. Since it is started only
every now and then and only with a limited number of connections at
the same time it is a very good candidate for this scheme as the extra
resource cost is negligble: if made socket-activatable SSH is
basically free as long as nobody uses it. And as soon as somebody logs
in via SSH it will be started and the moment he or she disconnects all
its resources are freed again. Let’s find out how to make SSH
socket-activatable in systemd taking advantage of the provided inetd
compatibility!

Here’s the configuration line used to hook up SSH with classic inetd:

ssh stream tcp nowait root /usr/sbin/sshd sshd -i

And the same as xinetd configuration fragment:

service ssh {
        socket_type = stream
        protocol = tcp
        wait = no
        user = root
        server = /usr/sbin/sshd
        server_args = -i
}

Most of this should be fairly easy to understand, as these two
fragments express very much the same information. The non-obvious
parts: the port number (22) is not configured in inetd configuration,
but indirectly via the service database in /etc/services: the
service name is used as lookup key in that database and translated to
a port number. This indirection via /etc/services has been
part of Unix tradition though has been getting more and more out of
fashion, and the newer xinetd hence optionally allows configuration
with explicit port numbers. The most interesting setting here is the
not very intuitively named nowait (resp. wait=no)
option. It configures whether a service is of the second
(wait) resp. third (nowait) scheme mentioned
above. Finally the -i switch is used to enabled inetd mode in
SSH.

The systemd translation of these configuration fragments are the
following two units. First: sshd.socket is a unit encapsulating
information about a socket to listen on:

[Unit]
Description=SSH Socket for Per-Connection Servers

[Socket]
ListenStream=22
Accept=yes

[Install]
WantedBy=sockets.target

Most of this should be self-explanatory. A few notes:
Accept=yes corresponds to nowait. It’s hopefully
better named, referring to the fact that for nowait the
superserver calls accept() on the listening socket, where for
wait this is the job of the executed
service process. WantedBy=sockets.target is used to ensure that when
enabled this unit is activated at boot at the right time.

And here’s the matching service file [email protected]:

[Unit]
Description=SSH Per-Connection Server

[Service]
ExecStart=-/usr/sbin/sshd -i
StandardInput=socket

This too should be mostly self-explanatory. Interesting is
StandardInput=socket, the option that enables inetd
compatibility for this service. StandardInput= may be used to
configure what STDIN of the service should be connected for this
service (see the man
page for details
). By setting it to socket we make sure
to pass the connection socket here, as expected in the simple inetd
interface. Note that we do not need to explicitly configure
StandardOutput= here, since by default the setting from
StandardInput= is inherited if nothing else is
configured. Important is the “-” in front of the binary name. This
ensures that the exit status of the per-connection sshd process is
forgotten by systemd. Normally, systemd will store the exit status of
a all service instances that die abnormally. SSH will sometimes die
abnormally with an exit code of 1 or similar, and we want to make sure
that this doesn’t cause systemd to keep around information for
numerous previous connections that died this way (until this
information is forgotten with systemctl reset-failed).

[email protected] is an instantiated service, as described in the preceeding
installment of this series
. For each incoming connection systemd
will instantiate a new instance of [email protected], with the
instance identifier named after the connection credentials.

You may wonder why in systemd configuration of an inetd service
requires two unit files instead of one. The reason for this is that to
simplify things we want to make sure that the relation between live
units and unit files is obvious, while at the same time we can order
the socket unit and the service units independently in the dependency
graph and control the units as independently as possible. (Think: this
allows you to shutdown the socket independently from the instances,
and each instance individually.)

Now, let’s see how this works in real life. If we drop these files
into /etc/systemd/system we are ready to enable the socket and
start it:

# systemctl enable sshd.socket
ln -s '/etc/systemd/system/sshd.socket' '/etc/systemd/system/sockets.target.wants/sshd.socket'
# systemctl start sshd.socket
# systemctl status sshd.socket
sshd.socket - SSH Socket for Per-Connection Servers
	  Loaded: loaded (/etc/systemd/system/sshd.socket; enabled)
	  Active: active (listening) since Mon, 26 Sep 2011 20:24:31 +0200; 14s ago
	Accepted: 0; Connected: 0
	  CGroup: name=systemd:/system/sshd.socket

This shows that the socket is listening, and so far no connections
have been made (Accepted: will show you how many connections
have been made in total since the socket was started,
Connected: how many connections are currently active.)

Now, let’s connect to this from two different hosts, and see which services are now active:

$ systemctl --full | grep ssh
[email protected]:22-172.31.0.4:47779.service  loaded active running       SSH Per-Connection Server
[email protected]:22-172.31.0.54:52985.service loaded active running       SSH Per-Connection Server
sshd.socket                                   loaded active listening     SSH Socket for Per-Connection Servers

As expected, there are now two service instances running, for the
two connections, and they are named after the source and destination
address of the TCP connection as well as the port numbers. (For
AF_UNIX sockets the instance identifier will carry the PID and UID of
the connecting client.) This allows us to invidiually introspect or
kill specific sshd instances, in case you want to terminate the
session of a specific client:

# systemctl kill [email protected]:22-172.31.0.4:47779.service

And that’s probably already most of what you need to know for
hooking up inetd services with systemd and how to use them afterwards.

In the case of SSH it is probably a good suggestion for most
distributions in order to save resources to default to this kind of
inetd-style socket activation, but provide a stand-alone unit file to
sshd as well which can be enabled optionally. I’ll soon file a
wishlist bug about this against our SSH package in Fedora.

A few final notes on how xinetd and systemd compare feature-wise,
and whether xinetd is fully obsoleted by systemd. The short answer
here is that systemd does not provide the full xinetd feature set and
that is does not fully obsolete xinetd. The longer answer is a bit
more complex: if you look at the multitude of options
xinetd provides you’ll notice that systemd does not compare. For
example, systemd does not come with built-in echo,
time, daytime or discard servers, and never
will include those. TCPMUX is not supported, and neither are RPC
services. However, you will also find that most of these are either
irrelevant on today’s Internet or became other way out-of-fashion. The
vast majority of inetd services do not directly take advantage of
these additional features. In fact, none of the xinetd services
shipped on Fedora make use of these options. That said, there are a
couple of useful features that systemd does not support, for example
IP ACL management. However, most administrators will probably agree
that firewalls are the better solution for these kinds of problems and
on top of that, systemd supports ACL management via tcpwrap for those
who indulge in retro technologies like this. On the other hand systemd
also provides numerous features xinetd does not provide,
starting with the individual control of instances shown above, or the
more expressive configurability of the execution
context for the instances
. I believe that what systemd provides is
quite comprehensive, comes with little legacy cruft but should provide
you with everything you need. And if there’s something systemd does
not cover, xinetd will always be there to fill the void as
you can easily run it in conjunction with systemd. For the
majority of uses systemd should cover what is necessary, and allows
you cut down on the required components to build your system from. In
a way, systemd brings back the functionality of classic Unix inetd and
turns it again into a center piece of a Linux system.

And that’s all for now. Thanks for reading this long piece. And
now, get going and convert your services over! Even better, do this
work in the individual packages upstream or in your distribution!

How to Write syslog Daemons Which Cooperate Nicely With systemd

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/syslog.html

I just finished putting together a text on the systemd wiki explaining what
to do to write a syslog service that is nicely integrated with systemd, and
does all the right things. It’s supposed to be a checklist for all syslog
hackers:

Read it now.

rsyslog already implements everything on this list afaics, and that’s
pretty cool. If other implementations want to catch up, please consider
following these recommendations, too.

I put this together since I have changed systemd 35 to set
StandardOutput=syslog as default, so that all stdout/stderr of all
services automatically ends up in syslog. And since that change requires some
(minimal) changes to all syslog implementations I decided to document this all
properly (if you are curious: they need to set StandardOutput=null to
opt out of this default in order to avoid logging loops).

Anyway, please have a peek and comment if you spot a mistake or
something I forgot. Or if you have questions, just ask.

systemd for Administrators, Part IX

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/on-etc-sysinit.html

Here’s the ninth installment
of
my
ongoing
series
on
systemd
for
Administrators:

On /etc/sysconfig and /etc/default

So, here’s a bit of an opinion piece on the /etc/sysconfig/ and
/etc/default directories that exist on the various distributions in
one form or another, and why I believe their use should be faded out. Like
everything I say on this blog what follows is just my personal opinion, and not
the gospel and has nothing to do with the position of the Fedora project or my
employer. The topic of /etc/sysconfig has been coming up in
discussions over and over again. I hope with this blog story I can explain a
bit what we as systemd upstream think about these files.

A few lines about the historical context: I wasn’t around when
/etc/sysconfig was introduced — suffice to say it has been around on Red Hat
and SUSE distributions since a long long time. Eventually /etc/default was
introduced on Debian with very similar semantics. Many other distributions know
a directory with similar semantics too, most of them call it either one or the
other way. In fact, even other Unix-OSes sported a directory like this. (Such
as SCO. If you are interested in the details, I am sure a Unix greybeard of
your trust can fill in what I am leaving vague here.) So, even though a
directory like this has been known widely on Linuxes and Unixes, it never has
been standardized, neither in POSIX nor in LSB/FHS. These directories very much
are something where distributions distuingish themselves from each other.

The semantics of /etc/default and /etc/sysconfig are very
losely defined only. What almost all files stored in these directories have in common
though is that they are sourcable shell scripts which primarily consist of
environment variable assignments. Most of the files in these directories are
sourced by the SysV init scripts of the same name. The Debian
Policy Manual (9.3.2)
and the Fedora Packaging
Guidelines
suggest this use of the directories, however both distributions
also have files in them that do not follow this scheme, i.e. that do not have a
matching SysV init script — or not even are shell scripts at all.

Why have these files been introduced? On SysV systems services are started
via init scripts in /etc/rc.d/init.d (or a similar directory).
/etc/ is (these days) considered the place where system configuration
is stored. Originally these init scripts were subject to customization by the
administrator. But as they grew and become complex most distributions no longer
considered them true configuration files, but more just a special kind of programs.
To make customization easy and guarantee a safe upgrade path the customizable
bits hence have been moved to separate configuration files, which the init
scripts then source.

Let’s have a quick look what kind of configuration you can do with these
files. Here’s a short incomprehensive list of various things that can be
configured via environment settings in these source files I found browsing
through the directories on a Fedora and a Debian machine:

  • Additional command line parameters for the daemon binaries
  • Locale settings for a daemon
  • Shutdown time-out for a daemon
  • Shutdown mode for a daemon
  • System configuration like system locale, time zone information, console keyboard
  • Redundant system configuration, like whether the RTC is in local timezone
  • Firewall configuration data, not in shell format (!)
  • CPU affinity for a daemon
  • Settings unrelated to boot, for example including information how to install a new kernel package, how to configure nspluginwrap or whether to do library prelinking
  • Whether a specific service should be started or not
  • Networking configuration
  • Which kernel modules to statically load
  • Whether to halt or power-off on shutdown
  • Access modes for device nodes (!)
  • A description string for the SysV service (!)
  • The user/group ID, umask to run specific daemons as
  • Resource limits to set for a specific daemon
  • OOM adjustment to set for a specific daemon

Now, let’s go where the beef is: what’s wrong with /etc/sysconfig
(resp. /etc/default)? Why might it make sense to fade out use of these
files in a systemd world?

  • For the majority of these files the reason for having them simply does not
    exist anymore: systemd unit files are not programs like SysV init scripts
    were. Unit files are simple, declarative descriptions, that usually do not consist of more
    than 6 lines or so. They can easily be generated, parsed without a Bourne
    interpreter and understood by the reader. Also, they are very easy to modify:
    just copy them from /lib/systemd/system to
    /etc/systemd/system and edit them there, where they will not be
    modified by the package manager. The need to separate code and configuration
    that was the original reason to introduce these files does not exist anymore,
    as systemd unit files do not include code. These files hence now are a solution
    looking for a problem that no longer exists.
  • They are inherently distribution-specific. With systemd we hope to encourage
    standardization between distributions. Part of this is that we want that unit files are
    supplied with upstream, and not just added by the packager — how it has usually
    been done in the SysV world. Since the location of the directory and the
    available variables in the files is very different on each distribution,
    supporting /etc/sysconfig files in upstream unit files is not
    feasible. Configuration stored in these files works against de-balkanization of
    the Linux platform.
  • Many settings are fully redundant in a systemd world. For example, various
    services support configuration of the process credentials like the user/group
    ID, resource limits, CPU affinity or the OOM adjustment settings. However, these settings are
    supported only by some SysV init scripts, and often have different names if
    supported in multiple of them. OTOH in systemd, all these settings are
    available equally and uniformly for all services, with the same configuration
    option in unit files.
  • Unit files know a large number of easy-to-use process context settings,
    that are more comprehensive than what most /etc/sysconfig files offer.
  • A number of these settings are entirely questionnabe. For example, the
    aforementiond configuration option for the user/group ID a service runs as is
    primarily something the distributor has to take care of. There is little to win
    for administrators to change these settings, and only the distributor has the
    broad overview to make sure that UID/GID and name collisions do not
    happen.
  • The file format is not ideal. Since the files are usually sourced as shell
    scripts, parse errors are very hard to decypher and are not logged along the
    other configuration problems of the services. Generally, unknown variable
    assignments simply have no effect but this is not warned about. This makes
    these files harder to debug than necessary.
  • Configuration files sources from shell scripts are subject to the execution
    parameters of the interpreter, and it has many: settings like IFS or LANG tend
    to modify drastically how shell scripts are parsed and understood. This makes
    them fragile.
  • Interpretation of these files is slow, since it requires spawning of a
    shell, which adds at least one process for each service to be spawned at boot.
  • Often, files in /etc/sysconfig are used to “fake” configuration
    files for daemons which do not support configuration files natively. This is
    done by glueing together command line arguments from these variable assignments
    that are then passed to the daemon. In general proper, native configuration
    files in these daemons are the much prettier solution however. Command line
    options like “-k”, “-a” or “-f” are not self-explanatory and have a very
    cryptic syntax. Moreover the same switches in many daemons have (due to the
    limited vocabulary) often very much contradicting effects. (On one daemon
    -f might cause the daemon to daemonize, while on another one this
    option turns exactly this behaviour off.) Command lines generally cannot include
    sensible comments which most configuration files however can.
  • A number of configuration settings in /etc/sysconfig are entirely
    redundant: for example, on many distributions it can be controlled via
    /etc/sysconfig files whether the RTC is in UTC or local time. Such an
    option already exists however in the 3rd line of the /etc/adjtime
    (which is known on all distributions). Adding a second, redundant,
    distribution-specific option overriding this is hence needless and complicates
    things for no benefit.
  • Many of the configuration settings in /etc/sysconfig allow
    disabling services. By this they basically become a second level of
    enabling/disabling over what the init system already offers: when a service is
    enabled with systemctl enable or chkconfig on these settings
    override this, and turn the daemon of even though the init system was
    configured to start it. This of course is very confusing to the
    user/administrator, and brings virtually no benefit.
  • For options like the configuration of static kernel modules to load: there
    are nowadays usually much better ways to load kernel modules at boot. For
    example, most modules may now be autoloaded by udev when the right hardware is
    found. This goes very far, and even includes ACPI and other high-level
    technologies. One of the very few exceptions where we currently do not do
    kernel module autoloading is CPU feature and model based autoloading which
    however will be supported soon too. And even if your specific module cannot be
    auto-loaded there’s usually a better way to statically load it, for example by
    sticking it in /etc/load-modules.d so that the administrator can check
    a standardized place for all statically loaded modules.
  • Last but not least, /etc already is intended to be the place for system
    configuration (“Host-specific system configuration” according to FHS). A
    subdirectory beneath it called sysconfig to place system configuration
    in is hence entirely redundant, already on the language level.

What to use instead? Here are a few recommendations of what to do with these
files in the long run in a systemd world:

  • Just drop them without replacement. If they are fully redundant (like the
    local/UTC RTC setting) this is should be a relatively easy way out (well,
    ignoring the need for compatibility). If systemd natively supports an
    equivalent option in the unit files there is no need to duplicate these
    settings in sysconfig files. For a list of execution options you may
    set for a service check out the respective man pages: systemd.exec(5)
    and systemd.service(5).
    If your setting simply adds another layer where a service can be disabled,
    remove it to keep things simple. There’s no need to have multiple ways to
    disable a service.
  • Find a better place for them. For configuration of the system locale or
    system timezone we hope to gently push distributions into the right direction,
    for more details see previous
    episode of this series
    .
  • Turn these settings into native settings of the daemon. If necessary add
    support for reading native configuration files to the daemon. Thankfully, most
    of the stuff we run on Linux is Free Software, so this can relatively easily be
    done.

Of course, there’s one very good reason for supporting these files for a bit
longer: compatibility for upgrades. But that’s is really the only one I could
come up with. It’s reason enough to keep compatibility for a while, but I think
it is a good idea to phase out usage of these files at least in new packages.

If compatibility is important, then systemd will still allow you to read
these configuration files even if you otherwise use native systemd unit files.
If your sysconfig file only knows simple options
EnvironmentFile=-/etc/sysconfig/foobar (See systemd.exec(5) for more information about this option.) may be used to import the
settings into the environment and use them to put together command lines. If
you need a programming language to make sense of these settings, then use a
programming language like shell. For example, place an short shell script in
/usr/lib/<your package>/ which reads these files for
compatibility, and then exec‘s the actual daemon binary. Then spawn
this script instead of the actual daemon binary with ExecStart= in the
unit file.

And this is all for now. Thank you very much
for your interest.

systemd Documentation

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-docs.html

Fedora 15 is out. Get it
while it is hot! It is probably the biggest distribution release of a all time
with being first in shipping both GNOME 3 and systemd.

Since this is the first distribution release based on systemd, it might be interesting to
read up on what it is all about. Here’s a little compilation of the available
documentation for systemd.

The Manual Pages

Here’s the full list of all man pages.

The Blog Stories

Some of the systemd for Administrators blog posts are available in Russian language, too.

Other Documentation

Fedora Documentation

In The Press

Other Distributions’ Documentation

And, if you still have questions after all of this, please join
our mailing list
, or our IRC channel #systemd on
irc.freenode.org. Alternatively, if you are looking for paid
consulting services for systemd contact our
friends at ProFUSION
.

Why systemd?

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/why.html

systemd is
still a young project, but it is not a baby anymore. The initial
announcement
I posted precisely a year ago. Since then most of the
big distributions have decided to adopt it in one way or another, many
smaller distributions have already switched. The first big
distribution with systemd by default will be Fedora 15, due end of
May. It is expected that the others will follow the lead a bit later
(with one exception). Many
embedded developers have already adopted it too, and there’s even a company specializing on engineering and
consulting services for systemd
. In short: within one year
systemd became a really successful project.

However, there are still folks who we haven’t won over yet. If you
fall into one of the following categories, then please have a look on
the comparison of init systems below:

  • You are working on an embedded project and are wondering whether
    it should be based on systemd.
  • You are a user or administrator and wondering which distribution
    to pick, and are pondering whether it should be based on systemd or
    not.
  • You are a user or administrator and wondering why your favourite
    distribution has switched to systemd, if everything already worked so
    well before.
  • You are developing a distribution that hasn’t switched yet, and
    you are wondering whether to invest the work and go systemd.

And even if you don’t fall into any of these categories, you might still
find the comparison interesting.

We’ll be comparing the three most relevant init systems for Linux:
sysvinit, Upstart and systemd. Of course there are other init systems
in existance, but they play virtually no role in the big
picture. Unless you run Android (which is a completely different beast
anyway), you’ll almost definitely run one of these three init systems
on your Linux kernel. (OK, or busybox, but then you are basically not
running any init system at all.) Unless you have a soft spot for
exotic init systems there’s little need to look further. Also, I am
kinda lazy, and don’t want to spend the time on analyzing those other
systems in enough detail to be completely fair to them.

Speaking of fairness: I am of course one of the creators of
systemd. I will try my best to be fair to the other two contenders,
but in the end, take it with a grain of salt. I am sure though that
should I be grossly unfair or otherwise incorrect somebody will point
it out in the comments of this story, so consider having a look on
those, before you put too much trust in what I say.

We’ll look at the currently implemented features in a released
version. Grand plans don’t count.

General Features

sysvinitUpstartsystemd
Interfacing via D-Busnoyesyes
Shell-free bootupnonoyes
Modular C coded early boot services includednonoyes
Read-Aheadnono[1]yes
Socket-based Activationnono[2]yes
Socket-based Activation: inetd compatibilitynono[2]yes
Bus-based Activationnono[3]yes
Device-based Activationnono[4]yes
Configuration of device dependencies with udev rulesnonoyes
Path-based Activation (inotify)nonoyes
Timer-based Activationnonoyes
Mount handlingnono[5]yes
fsck handlingnono[5]yes
Quota handlingnonoyes
Automount handlingnonoyes
Swap handlingnonoyes
Snapshotting of system statenonoyes
XDG_RUNTIME_DIR Supportnonoyes
Optionally kills remaining processes of users logging outnonoyes
Linux Control Groups Integrationnonoyes
Audit record generation for started servicesnonoyes
SELinux integrationnonoyes
PAM integrationnonoyes
Encrypted hard disk handling (LUKS)nonoyes
SSL Certificate/LUKS Password handling, including Plymouth, Console, wall(1), TTY and GNOME agentsnonoyes
Network Loopback device handlingnonoyes
binfmt_misc handlingnonoyes
System-wide locale handlingnonoyes
Console and keyboard setupnonoyes
Infrastructure for creating, removing, cleaning up of temporary and volatile filesnonoyes
Handling for /proc/sys sysctlnonoyes
Plymouth integrationnoyesyes
Save/restore random seednonoyes
Static loading of kernel modulesnonoyes
Automatic serial console handlingnonoyes
Unique Machine ID handlingnonoyes
Dynamic host name and machine meta data handlingnonoyes
Reliable termination of servicesnonoyes
Early boot /dev/log loggingnonoyes
Minimal kmsg-based syslog daemon for embedded usenonoyes
Respawning on service crash without losing connectivitynonoyes
Gapless service upgradesnonoyes
Graphical UInonoyes
Built-In Profiling and Toolsnonoyes
Instantiated servicesnoyesyes
PolicyKit integrationnonoyes
Remote access/Cluster support built into client toolsnonoyes
Can list all processes of a servicenonoyes
Can identify service of a processnonoyes
Automatic per-service CPU cgroups to even out CPU usage between themnonoyes
Automatic per-user cgroupsnonoyes
SysV compatibilityyesyesyes
SysV services controllable like native servicesyesnoyes
SysV-compatible /dev/initctlyesnoyes
Reexecution with full serialization of stateyesnoyes
Interactive boot-upno[6]no[6]yes
Container support (as advanced chroot() replacement)nonoyes
Dependency-based bootupno[7]noyes
Disabling of services without editing filesyesnoyes
Masking of services without editing filesnonoyes
Robust system shutdown within PID 1nonoyes
Built-in kexec supportnonoyes
Dynamic service generationnonoyes
Upstream support in various other OS componentsyesnoyes
Service files compatible between distributionsnonoyes
Signal delivery to servicesnonoyes
Reliable termination of user sessions before shutdownnonoyes
utmp/wtmp supportyesyesyes
Easily writable, extensible and parseable service files, suitable for manipulation with enterprise management toolsnonoyes

[1] Read-Ahead implementation for Upstart available in separate package ureadahead, requires non-standard kernel patch.

[2] Socket activation implementation for Upstart available as preview, lacks parallelization support hence entirely misses the point of socket activation.

[3] Bus activation implementation for Upstart posted as patch, not merged.

[4] udev device event bridge implementation for Upstart available as preview, forwards entire udev database into Upstart, not practical.

[5] Mount handling utility mountall for Upstart available in separate package, covers only boot-time mounts, very limited dependency system.

[6] Some distributions offer this implemented in shell.

[7] LSB init scripts support this, if they are used.

Available Native Service Settings

sysvinitUpstartsystemd
OOM Adjustmentnoyes[1]yes
Working Directorynoyesyes
Root Directory (chroot())noyesyes
Environment Variablesnoyesyes
Environment Variables from external filenonoyes
Resource Limitsnosome[2]yes
umasknoyesyes
User/Group/Supplementary Groupsnonoyes
IO Scheduling Class/Prioritynonoyes
CPU Scheduling Nice Valuenoyesyes
CPU Scheduling Policy/Prioritynonoyes
CPU Scheduling Reset on fork() controlnonoyes
CPU affinitynonoyes
Timer Slacknonoyes
Capabilities Controlnonoyes
Secure Bits Controlnonoyes
Control Group Controlnonoyes
High-level file system namespace control: making directories inacessiblenonoyes
High-level file system namespace control: making directories read-onlynonoyes
High-level file system namespace control: private /tmpnonoyes
High-level file system namespace control: mount inheritancenonoyes
Input on Consoleyesyesyes
Output on Syslognonoyes
Output on kmsg/dmesgnonoyes
Output on arbitrary TTYnonoyes
Kill signal controlnonoyes
Conditional execution: by identified CPU virtualization/containernonoyes
Conditional execution: by file existancenonoyes
Conditional execution: by security frameworknonoyes
Conditional execution: by kernel command linenonoyes

[1] Upstart supports only the deprecated oom_score_adj mechanism, not the current oom_adj logic.

[2] Upstart lacks support for RLIMIT_RTTIME and RLIMIT_RTPRIO.

Note that some of these options are relatively easily added to SysV
init scripts, by editing the shell sources. The table above focusses
on easily accessible options that do not require source code
editing.

Miscellaneous

sysvinitUpstartsystemd
Maturity> 15 years6 years1 year
Specialized professional consulting and engineering services availablenonoyes
SCMSubversionBazaargit
Copyright-assignment-free contributingyesnoyes

Summary

As the tables above hopefully show in all clarity systemd
has left behind both sysvinit and Upstart in almost every
aspect. With the exception of the project’s age/maturity systemd wins
in every category. At this point in time it will be very hard for
sysvinit and Upstart to catch up with the features systemd provides
today. In one year we managed to push systemd forward much further
than Upstart has been pushed in six.

It is our intention to drive forward the development of the Linux
platform with systemd. In the next release cycle we will focus more
strongly on providing the same features and speed improvement we
already offer for the system to the user login session. This will
bring much closer integration with the other parts of the OS and
applications, making the most of the features the service manager
provides, and making it available to login sessions. Certain
components such as ConsoleKit will be made redundant by these
upgrades, and services relying on them will be updated. The
burden for maintaining these then obsolete components
will be passed on the vendors who plan to continue to rely on
them.

If you are wondering whether or not to adopt systemd, then systemd
obviously wins when it comes to mere features. Of course that should
not be the only aspect to keep in mind. In the long run, sticking with
the existing infrastructure (such as ConsoleKit) comes at a price:
porting work needs to take place, and additional maintainance work for
bitrotting code needs to be done. Going it on your own means increased
workload.

That said, adopting systemd is also not free. Especially if you
made investments in the other two solutions adopting systemd means
work. The basic work to adopt systemd is relatively minimal for
porting over SysV systems (since compatibility is provided), but can
mean substantial work when coming from Upstart. If you plan to go for
a 100% systemd system without any SysV compatibility (recommended for
embedded, long run goal for the big distributions) you need to be
willing to invest some work to rewrite init scripts as simple systemd
unit files.

systemd is in the process of becoming a comprehensive, integrated
and modular platform providing everything needed to bootstrap and
maintain an operating system’s userspace. It includes C rewrites of
all basic early boot init scripts that are shipped with the various
distributions. Especially for the embedded case adopting systemd
provides you in one step with almost everything you need, and you can
pick the modules you want. The other two init systems are singular
individual components, which to be useful need a great number of
additional components with differing interfaces. The emphasis of
systemd to provide a platform instead of just a component allows for
closer integration, and cleaner APIs. Sooner or later this will
trickle up to the applications. Already, there are accepted XDG
specifications (e.g. XDG basedir spec, more specifically
XDG_RUNTIME_DIR) that are not supported on the other init systems.

systemd is also a big opportunity for Linux standardization. Since
it standardizes many interfaces of the system that previously have
been differing on every distribution, on every implementation,
adopting it helps to work against the balkanization of the Linux
interfaces. Choosing systemd means redefining more closely
what the Linux platform is about. This improves the lifes of
programmers, users and administrators alike.

I believe that momentum is clearly with systemd. We invite you to
join our community and be part of that momentum.

systemd for Administrators, Part VIII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/the-new-configuration-files.html

Another episode of my
ongoing
series
on
systemd
for
Administrators:

The New Configuration Files

One of the formidable new features of systemd is
that it comes with a complete set of modular early-boot services that are
written in simple, fast, parallelizable and robust C, replacing the
shell “novels” the various distributions featured before. Our little
Project Zero Shell[1] has been a full success. We currently
cover pretty much everything most desktop and embedded
distributions should need, plus a big part of the server needs:

  • Checking and mounting of all file systems
  • Updating and enabling quota on all file systems
  • Setting the host name
  • Configuring the loopback network device
  • Loading the SELinux policy and relabelling /run and /dev as necessary on boot
  • Registering additional binary formats in the kernel, such as Java, Mono and WINE binaries
  • Setting the system locale
  • Setting up the console font and keyboard map
  • Creating, removing and cleaning up of temporary and volatile files and directories
  • Applying mount options from /etc/fstab to pre-mounted API VFS
  • Applying sysctl kernel settings
  • Collecting and replaying readahead information
  • Updating utmp boot and shutdown records
  • Loading and saving the random seed
  • Statically loading specific kernel modules
  • Setting up encrypted hard disks and partitions
  • Spawning automatic gettys on serial kernel consoles
  • Maintenance of Plymouth
  • Machine ID maintenance
  • Setting of the UTC distance for the system clock

On a standard Fedora 15 install, only a few legacy and storage
services still require shell scripts during early boot. If you don’t
need those, you can easily disable them end enjoy your shell-free boot
(like I do every day). The shell-less boot systemd offers you is a
unique feature on Linux.

Many of these small components are configured via configuration
files in /etc. Some of these are fairly standardized among
distributions and hence supporting them in the C implementations was
easy and obvious. Examples include: /etc/fstab,
/etc/crypttab or /etc/sysctl.conf. However, for
others no standardized file or directory existed which forced us to add
#ifdef orgies to our sources to deal with the different
places the distributions we want to support store these things. All
these configuration files have in common that they are dead-simple and
there is simply no good reason for distributions to distuingish
themselves with them: they all do the very same thing, just
a bit differently.

To improve the situation and benefit from the unifying force that
systemd is we thus decided to read the per-distribution configuration
files only as fallbacks — and to introduce new configuration
files as primary source of configuration wherever applicable. Of
course, where possible these standardized configuration files should
not be new inventions but rather just standardizations of the best
distribution-specific configuration files previously used. Here’s a
little overview over these new common configuration files systemd
supports on all distributions:

  • /etc/hostname:
    the host name for the system. One of the most basic and trivial
    system settings. Nonetheless previously all distributions used
    different files for this. Fedora used /etc/sysconfig/network,
    OpenSUSE /etc/HOSTNAME. We chose to standardize on the
    Debian configuration file /etc/hostname.
  • /etc/vconsole.conf:
    configuration of the default keyboard mapping and console font.
  • /etc/locale.conf:
    configuration of the system-wide locale.
  • /etc/modules-load.d/*.conf:
    a drop-in directory for kernel modules to statically load at
    boot (for the very few that still need this).
  • /etc/sysctl.d/*.conf:
    a drop-in directory for kernel sysctl parameters, extending what you
    can already do with /etc/sysctl.conf.
  • /etc/tmpfiles.d/*.conf:
    a drop-in directory for configuration of runtime files that need to be
    removed/created/cleaned up at boot and during uptime.
  • /etc/binfmt.d/*.conf:
    a drop-in directory for registration of additional binary formats for
    systems like Java, Mono and WINE.
  • /etc/os-release:
    a standardization of the various distribution ID files like
    /etc/fedora-release and similar. Really every distribution
    introduced their own file here; writing a simple tool that just prints
    out the name of the local distribution usually means including a
    database of release files to check. The LSB tried to standardize
    something like this with the lsb_release
    tool, but quite frankly the idea of employing a shell script in this
    is not the best choice the LSB folks ever made. To rectify this we
    just decided to generalize this, so that everybody can use the same
    file here.
  • /etc/machine-id:
    a machine ID file, superseding D-Bus’ machine ID file. This file is
    guaranteed to be existing and valid on a systemd system, covering also
    stateless boots. By moving this out of the D-Bus logic it is hopefully
    interesting for a lot of additional uses as a unique and stable
    machine identifier.
  • /etc/machine-info:
    a new information file encoding meta data about a host, like a pretty
    host name and an icon name, replacing stuff like
    /etc/favicon.png and suchlike. This is maintained by systemd-hostnamed.

It is our definite intention to convince you to use these new
configuration files in your configuration tools: if your
configuration frontend writes these files instead of the old ones, it
automatically becomes more portable between Linux distributions, and
you are helping standardizing Linux. This makes things simpler to
understand and more obvious for users and administrators. Of course,
right now, only systemd-based distributions read these files, but that
already covers all important distributions in one way or another, except for one. And it’s a bit of a
chicken-and-egg problem: a standard becomes a standard by being
used. In order to gently push everybody to standardize on these files
we also want to make clear that sooner or later we plan to drop the
fallback support for the old configuration files from
systemd. That means adoption of this new scheme can happen slowly and piece
by piece. But the final goal of only having one set of configuration
files must be clear.

Many of these configuration files are relevant not only for
configuration tools but also (and sometimes even primarily) in
upstream projects. For example, we invite projects like Mono, Java, or
WINE to install a drop-in file in /etc/binfmt.d/ from their
upstream build systems. Per-distribution downstream support for binary
formats would then no longer be necessary and your platform would work
the same on all distributions. Something similar applies to all
software which need creation/cleaning of certain runtime files and
directories at boot, for example beneath the /run hierarchy
(i.e. /var/run as it used to be known). These
projects should just drop in configuration files in
/etc/tmpfiles.d, also from the upstream build systems. This
also helps speeding up the boot process, as separate per-project SysV
shell scripts which implement trivial things like registering a binary
format or removing/creating temporary/volatile files at boot are no
longer necessary. Or another example, where upstream support would be
fantastic: projects like X11 could probably benefit from reading the
default keyboard mapping for its displays from
/etc/vconsole.conf.

Of course, I have no doubt that not everybody is happy with our
choice of names (and formats) for these configuration files. In the
end we had to pick something, and from all the choices these appeared
to be the most convincing. The file formats are as simple as they can
be, and usually easily written and read even from shell scripts. That
said, /etc/bikeshed.conf could of course also have been a
fantastic configuration file name!

So, help us standardizing Linux! Use the new configuration files!
Adopt them upstream, adopt them downstream, adopt them all across the
distributions!

Oh, and in case you are wondering: yes, all of these files were
discussed in one way or another with various folks from the various
distributions. And there has even been some push towards supporting
some of these files even outside of systemd systems.

Footnotes

[1] Our slogan: “The only shell that should get started
during boot is gnome-shell!
” — Yes, the slogan needs a bit of
work, but you get the idea.

systemd for Administrators, Part IV

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-for-admins-4.html

Here’s the fourth installment of my ongoing
series
about
systemd

for administrators
.

Killing Services

Killing a system daemon is easy, right? Or is it?

Sure, as long as your daemon persists only of a single process this might
actually be somewhat true. You type killall rsyslogd and the syslog
daemon is gone. However it is a bit dirty to do it like that given that this
will kill all processes which happen to be called like this, including those an
unlucky user might have named that way by accident. A slightly more correct
version would be to read the .pid file, i.e. kill `cat
/var/run/syslogd.pid`
. That already gets us much further, but still, is
this really what we want?

More often than not it actually isn’t. Consider a service like Apache, or
crond, or atd, which as part of their usual operation spawn child processes.
Arbitrary, user configurable child processes, such as cron or at jobs, or CGI
scripts, even full application servers. If you kill the main apache/crond/atd
process this might or might not pull down the child processes too, and it’s up
to those processes whether they want to stay around or go down as well.
Basically that means that terminating Apache might very well cause its CGI
scripts to stay around, reassigned to be children of init, and difficult to
track down.

systemd to
the rescue: With systemctl kill you can easily send a signal to all
processes of a service. Example:

# systemctl kill crond.service

This will ensure that SIGTERM is delivered to all processes of the crond
service, not just the main process. Of course, you can also send a different
signal if you wish. For example, if you are bad-ass you might want to go for
SIGKILL right-away:

# systemctl kill -s SIGKILL crond.service

And there you go, the service will be brutally slaughtered in its entirety,
regardless how many times it forked, whether it tried to escape supervision by
double forking or fork bombing.

Sometimes all you need is to send a specific signal to the main process of a
service, maybe because you want to trigger a reload via SIGHUP. Instead of going via the
PID file, here’s an easier way to do this:

# systemctl kill -s HUP --kill-who=main crond.service

So again, what is so new and fancy about killing services in systemd? Well,
for the first time on Linux we can actually properly do that. Previous
solutions were always depending on the daemons to actually cooperate to bring
down everything they spawned if they themselves terminate. However, usually if
you want to use SIGTERM or SIGKILL you are doing that because they actually do
not cooperate properly with you.

How does this relate to systemctl stop? kill goes directly
and sends a signal to every process in the group, however stop goes
through the official configured way to shut down a service, i.e. invokes the
stop command configured with ExecStop= in the service file. Usually
stop should be sufficient. kill is the tougher version, for
cases where you either don’t want the official shutdown command of a service to
run, or when the service is hosed and hung in other ways.

(It’s up to you BTW to specify signal names with or without the SIG prefix
on the -s switch. Both works.)

It’s a bit surprising that we have come so far on Linux without even being
able to properly kill services. systemd for the first time enables you to do
this properly.

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-2.html

It has been a
while since my last status update on systemd
. Here’s another short,
incomprehensive status update on what we worked on for systemd since then.

  • Fedora F15 (Rawhide) now includes a split up
    /etc/init.d/rc.sysinit (Bill Nottingham). This allows us to keep only
    a minimal compatibility set of shell scripts around, and boot otherwise a
    system without any shell scripts at all. In fact, shell scripts during early
    boot are only used in exceptional cases, i.e. when you enabled autoswapping
    (bad idea anyway), when a full SELinux relabel is necessary, during the first
    boot after initialization, if you have static kernel modules to load (which are
    not configured via the systemd-native way to do that), if you boot from a
    read-only NFS server, or when you rely on LVM/RAID/Multipath. If nothing of
    this applies to you can easily disable these parts of early boot and
    save several seconds on boot. How to do this I will describe in a later blog
    story.
  • We have a fully C coded shutdown logic that kills all remaining processes,
    unmounts all remaining file systems, detaches all loop devices and DM volumes
    and does that in the right way to ensure that all these things are properly
    teared down even if they depend on each other in arbitrary ways. This is not
    only considerably faster then the traditional shell hackery for this, but also
    a lot safer, since we try to unmount/remount the remaining file systems with a
    little bit of brains. This feature is available via systemctl --force
    poweroff
    to the administrator. The --force controls whether the
    usual shutdown of all services is run or whether this is skipped and we
    immediately shall enter this final C shutdown logic. Using --force
    hence is a much safer replacement for the old /sbin/reboot -f and does
    not leave dirty file systems behind. (Thanks to Fabiano Fidencio has his
    colleagues from ProFUSION for this).
  • systemd now includes a minmalistic readahead implementation, based on
    fanotify(), fadvise() and mincore(). It supports btrfs defragmentation and both
    SSD and HDD disks. While the effect on boots that are anyway fast (such as most
    stuff involving SSD) is minimal, slower and older machines benefit from this
    more substantially.
  • We now control fsck and quota during early boot with a C tool that ensure
    maximum parallelization but properly implements the necessary high-level
    administration logic.
  • Every service, every user and every user session now gets its own cgroup in
    the ‘cpu’ hierarchy thus creating better fairness between the logged in users
    and their sessions.
  • We now provide /dev/log logging from early boot to late shutdown.
    If no syslog daemon is running the output is passed on to kmsg. As soon as a
    proper syslog daemon starts up the kmsg buffer is flushed to syslog, and hence
    we will have complete log coverage in syslog even for early boot.
  • systemctl kill was introduced, an easy command to send a signal to
    all processes of a service. Expect a blog story with more details about this
    shortly.
  • systemd gained the ability to load the SELinux policy if necessary, thus
    supporting non-initrd boots and initrd boots from the same binary with no
    duplicate work. This is in fact (and surprisingly) a first among Linux init
    systems.
  • We now initialize and set the system locale inside PID 1 to be inherited by
    all services and users.
  • systemd has native support for /etc/crypttab and can activate
    encrypted LUKS/dm-crypt disks both at boot-up and during runtime. A minimal
    password querying infrastructure is available, where multiple agents can be
    used to present the password to the user. During boot the password is queried
    either via Plymouth or directly on the console. If a system crypto disk is
    plugged in after boot you are queried for the password via a GNOME agent, or a
    wall(1) agent. Finally, while you run systemctl start (or a similar
    command) a minimal TTY password agent is available which asks you for passwords
    right-away if this is necessary. The password querying logic is very simple,
    additional agents can be implemented in a trivial amount of code (Yupp, KDE folks, you
    can add an agent for this, too). Note that the password querying logic in
    systemd is only for non-user passwords, i.e. passwords that have no relation to
    a specific user, but rather to specific hardware or system software. In future
    we hope to extend this so that this can be used to query the password of SSL
    certificates when Apache or other servers start.
  • We offer a minimal interface that external projects can use to extend the
    dependency graph systemd manages. In fact, the cryptsetup logic mentioned above
    is implemented via this ‘plugin’-like system. Since we did not want to add code
    that deals with cryptographic disks into the systemd process itself we
    introduced this interface (after all cryptographic volumes are not an essential
    feature of a minimal OS, and unncessary on most embedded uses; also the future
    might bring us STC which might make this at least partially obsolete). Simply
    by dropping a generator binary into
    /lib/systemd/system-generators which should write out systemd unit
    files into a temporary directory third-party packages may extend the systemd
    dependency tree dynamically. This could be useful for example to automatically
    create a systemd service for each KVM machine or LXC container. With that in
    place those containers/machines could be managed and supervised with the same
    tools as the usual system services.
  • We integrated automatic clean-up of directories such as /tmp into
    the tmpfiles logic we already had in place that recreates files and
    directories on volatile file systems such as /var/run,
    /var/lock or /tmp.
  • We now always measure and write to the log files the system startup time we
    measured, broken up into how many time was spent on the kernel, the initrd and
    the initialization of userspace.
  • We now safely destroy all user session before going down. This is a feature
    long missing on Linux: since user processes were not killed until the very last
    moment the unhealthy situation that user code was running at a time where no
    other daemon was remaining was a normal part of shutdown.
  • systemd now understands an ‘extreme’ form of disabling a service: if you
    symlink a service name in /etc/systemd/system to /dev/null
    then systemd will mark it as masked and completely refuse starting it,
    regardless if this is requested manually or automaticallly. Normally it should
    be sufficient to simply call systemctl disable to disable a service
    which still allows manual activation but no automatic activation. Masking a
    service goes one step further.
  • There’s now a simple condition syntax in places which allows
    skipping or enabling units depending on the existance of a file, whether a
    directory is empty or whether a kernel command line option is set.
  • In addition to normal shutdowns for reboot, halt or poweroff we now
    similarly support a kexec reboot, that reboots the machine without going though
    the BIOS code again.
  • We have bash completion support for systemctl. (Ran Benita)
  • Andrew Edmunds contributed basic support to boot Ubuntu with systemd.
  • Michael Biebl and Tollef Fog Heen have worked on the systemd integration
    into Debian to a level that it is now possible to boot a system without having
    the old initscripts packaged installed. For more details see the Debian Wiki. Michael even
    tested this integration on an Ubuntu Natty system and as it turns out this
    works almost equally well on Ubuntu already. If you are interesting in playing
    around with this, ping Michael.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

We have come quite far in the last year. systemd is about a year old now,
and we are now able to boot a system without legacy shell scripts remaining,
something that appeared to be a task for the distant future.

All of this is available in systemd 13 and in F15/Rawhide as I type
this. If you want to play around with this then consider installing Rawhide
(it’s fun!).

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update.html

It has been a while since my original
announcement of systemd
. Here’s a little status update, on what
happened since then. For simplicity’s sake I’ll just list here what we
worked on in a bulleted list, with no particular order and without
trying to cover this comprehensively:

  • systemd has been accepted as Feature for Fedora 14, and as it
    looks right now everything worked out nicely and we’ll ship F14 with
    systemd as init system.
  • We added a number of additional unit types: .timer for
    cron-style timer-based activation of services, .swap exposes
    swap files and partitions the same way we handle mount points, and
    .path can be used to activate units dependending on the
    existance/creation of files or fill status of spool directories.
  • We hooked systemd up to SELinux: systemd is now capabale of
    properly labelling directories, sockets and FIFOs it creates according
    to the SELinux policy for the services we maintain.
  • We hooked systemd up to the Linux auditing subsystem: as first
    init system at all systemd now generates auditing records for all
    services it starts/stops, including their failure status.
  • We hooked systemd up to TCP wrappers, for all socket connections
    it accepts.
  • We hooked systemd up to PAM, so that optionally, when systemd runs
    a service as a different user it initializes the usual PAM session
    setup and teardown hooks.
  • We hooked systemd up to D-Bus, so that D-Bus passes activation
    requests to systemd and systemd becomes the central point for all
    kinds of activation, thus greatly extending the control of the
    execution environment of bus activated services, and making them
    accessible through the same utilities as SysV services. Also, this
    enables us to do race-free parallelized start-up for D-Bus services
    and their clients, thus speeding up things even further.
  • systemd is now able to handle various Debian and OpenSUSE-specific
    extensions to the classic SysV init script formats natively, on top of
    the Fedora extensions we already parse.
  • The D-Bus coverage of the systemd interface is now complete,
    allowing both introspection of runtime data and of parsed
    configuration data. It’s fun now to introspect systemd with gdbus
    or d-feet.
  • We added a systemd
    PAM module
    , which assigns the processes of each user session to
    its own cgroup in the systemd cgroup tree. This also enables reliable
    killing of all processes associated with a session when the user logs
    out. This also manages a secure per-user /var/run-style directory
    which is supposed to be used for sockets and similar files that shall
    be cleaned up when the user logs out.
  • There’s a new tool systemd-cgls,
    which plots a pretty process tree based on the systemd cgroup
    hierarchy. It’s really pretty. Try it!
  • We now have our own cgroup hierarchy beneath
    /cgroup/systemd (though is will move to /sys/fs/
    before the F14 release).
  • We have pretty code that automatically spawns a getty on a serial
    port when the kernel console is redirected to a serial TTY.
  • systemctl got beefed up substantially (it can even draw
    dependency graphs now, via dot!), and the SysV compatiblity
    tools were extended to more completely and correctly support what was
    historically provided by SysV. For example, we’ll now warn the user
    when systemd service files have changed but systemd was not asked to
    reload its configuration. Also, you can now use systemd’s native
    client tools to reboot or shut-down an Upstart or sysvinit system, to
    facilitate upgrades.
  • We provide a reference
    implementation
    for the socket activation and other APIs for nicer
    interaction with systemd.
  • We have a pretty complete set of documentation
    now, some
    of it
    even extending to areas not directly related to systemd
    itself.
  • Quite a number of upstream packages now ship with systemd service
    files out-of-the-box now, that work across all distributions that have
    adopted systemd. It is our intention to unify the boot and service
    management between distributions with systemd, and this shows fruits
    already. Furthermore a number of upstream packages now ship our
    patches for socket-based activation.
  • Even more options that control the process execution environment
    or the sockets we create are now supported.
  • Earlier today I began my series of blog stories on systemd
    for administrators
    .
  • We reimplemented almost all boot-up and shutdown scripts of the
    standard Fedora install in much smaller, simpler and faster C
    utilities, or in systemd itself. Most of this will not be enabled in
    F14 however, even though it is shipped with systemd upstream. With
    this enabled the entire Linux system gains a completely new feeling as
    the number of shells we spawn approaches zero, and the PID of the
    first user terminal is way < 500 now, and the early boot-up is
    fully parallelized. We looked at the boot scripts of Fedora, OpenSUSE
    and Debian and distilled from this a list of functionality that makes
    up the early boot process and reimplemented this in C, if possible
    following the bahaviour of one of the existing implementations from
    these three distributions. This turned out to be much less effort than
    anticipated, and we are actually quite excited about this. Look
    forward to the fruits of this work in F15, when we might be able to
    present you a shell-less boot at least for standard desktop/laptop
    systems.
  • We spent some time reinvestigating the current syslog logic, and
    came up with an elegant and simple scheme to provide /dev/log
    compatible logging right from the time systemd is first initialized
    right until the time the kernel halts the machine. Through the wonders
    of socket based activation we first connect the /dev/log
    socket with a minimal bridge to the kernel log buffer (kmsg)
    and then, as soon as the real syslog is started up as part of the
    later bootup phase, we dynamically replace this minimal bridge by the
    real syslog daemon — without losing a single log message. Since one
    of the first things the real syslog daemon does is flushing the kernel
    log buffer into log files, all logged messages will sooner or later be
    stored on disk, regardless whether they have been generated during
    early boot, late boot or system runtime. On top of that if the syslog
    daemon terminates or is shut down during runtime, the bridge becomes
    active again and log output is written to kmsg again. The same applies
    when the system goes down. This provides a simple an robust way how we
    can ensure that no logs will ever be lost again, and logging is
    available from the beginning of boot-up to the end of
    shut-down. Plymouth will most likely adopt a similar scheme for initrd
    logging, thus ensuring that everything ever logged on the system will
    properly end up in the log files, whether it comes from the kernel,
    from the initrd, from early-boot, from runtime or shutdown. And if
    syslogd is not around, dmesg will provide you with access to
    the log messages. While this bridge is part of systemd upstream, we’ll
    most likely enable this bridge in Fedora only starting with F15. Also
    note that embedded systems that have no interest in shipping a full
    syslogd solution can simply use this syslog bridge during the entire
    runtime, and thus making the kernel log buffer the centralized log
    storage, with all the advantages this offers: zero disk IO at runtime,
    access to serial and netconsole logging, and remote debug access to
    the kernel log buffer.
  • We now install autofs units for many “API” kernel virtual file
    systems by default, such as binfmt_misc or
    hugetlbfs. That means that the file system access is readily
    available, client code no longer has to manually load the respective
    kernel modules, as they are autoloaded on first access of the file
    system. This has many advantages: it is not only faster to set up
    during boot, but also simpler for applications, as they can just
    assume the functionality is available. On top of that permission
    problems for the initialization go away, since manual module loading
    requires root privileges.
  • Many smaller fixes and enhancements, all across the board, which
    if mentioned here would make this blog story another blog
    novel. Suffice to say, we did a lot of polishing to ready systemd for
    F14.

All in all, systemd is progressing nicely, and the features we have
been working on in the last months are without exception features not
existing in any other of the init systems available on Linux and our
feature set already was far ahead of what the older init
implementations provide. And we have quite a bit planned for the
future. So, stay tuned!

Also note that I’ll speak about systemd at LinuxKongress
2010
in Nuremberg, Germany. Later this year I’ll also be speaking
at the Linux
Plumbers Conference
in Boston, MA. Make sure to drop by if you
want to learn about systemd or discuss exiciting new ideas or features
with us.

systemd for Administrators, Part 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-for-admins-1.html

As many of you know, systemd is the new
Fedora init system, starting with F14, and it is also on its way to being adopted in
a number of other distributions as well (for example, OpenSUSE). For administrators
systemd provides a variety of new features and changes and enhances the
administrative process substantially. This blog story is the first part of a
series of articles I plan to post roughly every week for the next months. In
every post I will try to explain one new feature of systemd. Many of these features
are small and simple, so these stories should be interesting to a broader audience.
However, from time to time we’ll dive a little bit deeper into the great new
features systemd provides you with.

Verifying Bootup

Traditionally, when booting up a Linux system, you see a lot of
little messages passing by on your screen. As we work on speeding up
and parallelizing the boot process these messages are becoming visible
for a shorter and shorter time only and be less and less readable —
if they are shown at all, given we use graphical boot splash
technology like Plymouth these days. Nonetheless the information of
the boot screens was and still is very relevant, because it shows you
for each service that is being started as part of bootup, wether it
managed to start up successfully or failed (with those green or red
[ OK ] or [ FAILED ] indicators). To improve the
situation for machines that boot up fast and parallelized and to make
this information more nicely available during runtime, we added a
feature to systemd that tracks and remembers for each service whether
it started up successfully, whether it exited with a non-zero exit
code, whether it timed out, or whether it terminated abnormally (by
segfaulting or similar), both during start-up and runtime. By simply
typing systemctl in your shell you can query the state of all
services, both systemd native and SysV/LSB services:

[[email protected]] ~# systemctl
UNIT                                          LOAD   ACTIVE       SUB          JOB             DESCRIPTION
dev-hugepages.automount                       loaded active       running                      Huge Pages File System Automount Point
dev-mqueue.automount                          loaded active       running                      POSIX Message Queue File System Automount Point
proc-sys-fs-binfmt_misc.automount             loaded active       waiting                      Arbitrary Executable File Formats File System Automount Point
sys-kernel-debug.automount                    loaded active       waiting                      Debug File System Automount Point
sys-kernel-security.automount                 loaded active       waiting                      Security File System Automount Point
sys-devices-pc...0000:02:00.0-net-eth0.device loaded active       plugged                      82573L Gigabit Ethernet Controller
[...]
sys-devices-virtual-tty-tty9.device           loaded active       plugged                      /sys/devices/virtual/tty/tty9
-.mount                                       loaded active       mounted                      /
boot.mount                                    loaded active       mounted                      /boot
dev-hugepages.mount                           loaded active       mounted                      Huge Pages File System
dev-mqueue.mount                              loaded active       mounted                      POSIX Message Queue File System
home.mount                                    loaded active       mounted                      /home
proc-sys-fs-binfmt_misc.mount                 loaded active       mounted                      Arbitrary Executable File Formats File System
abrtd.service                                 loaded active       running                      ABRT Automated Bug Reporting Tool
accounts-daemon.service                       loaded active       running                      Accounts Service
acpid.service                                 loaded active       running                      ACPI Event Daemon
atd.service                                   loaded active       running                      Execution Queue Daemon
auditd.service                                loaded active       running                      Security Auditing Service
avahi-daemon.service                          loaded active       running                      Avahi mDNS/DNS-SD Stack
bluetooth.service                             loaded active       running                      Bluetooth Manager
console-kit-daemon.service                    loaded active       running                      Console Manager
cpuspeed.service                              loaded active       exited                       LSB: processor frequency scaling support
crond.service                                 loaded active       running                      Command Scheduler
cups.service                                  loaded active       running                      CUPS Printing Service
dbus.service                                  loaded active       running                      D-Bus System Message Bus
[email protected]                            loaded active       running                      Getty on tty2
[email protected]                            loaded active       running                      Getty on tty3
[email protected]                            loaded active       running                      Getty on tty4
[email protected]                            loaded active       running                      Getty on tty5
[email protected]                            loaded active       running                      Getty on tty6
haldaemon.service                             loaded active       running                      Hardware Manager
[email protected]                            loaded active       running                      sda shock protection daemon
irqbalance.service                            loaded active       running                      LSB: start and stop irqbalance daemon
iscsi.service                                 loaded active       exited                       LSB: Starts and stops login and scanning of iSCSI devices.
iscsid.service                                loaded active       exited                       LSB: Starts and stops login iSCSI daemon.
livesys-late.service                          loaded active       exited                       LSB: Late init script for live image.
livesys.service                               loaded active       exited                       LSB: Init script for live image.
lvm2-monitor.service                          loaded active       exited                       LSB: Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
mdmonitor.service                             loaded active       running                      LSB: Start and stop the MD software RAID monitor
modem-manager.service                         loaded active       running                      Modem Manager
netfs.service                                 loaded active       exited                       LSB: Mount and unmount network filesystems.
NetworkManager.service                        loaded active       running                      Network Manager
ntpd.service                                  loaded maintenance  maintenance                  Network Time Service
polkitd.service                               loaded active       running                      Policy Manager
prefdm.service                                loaded active       running                      Display Manager
rc-local.service                              loaded active       exited                       /etc/rc.local Compatibility
rpcbind.service                               loaded active       running                      RPC Portmapper Service
rsyslog.service                               loaded active       running                      System Logging Service
rtkit-daemon.service                          loaded active       running                      RealtimeKit Scheduling Policy Service
sendmail.service                              loaded active       running                      LSB: start and stop sendmail
[email protected]:22-172.31.0.4:36368.service  loaded active       running                      SSH Per-Connection Server
sysinit.service                               loaded active       running                      System Initialization
systemd-logger.service                        loaded active       running                      systemd Logging Daemon
udev-post.service                             loaded active       exited                       LSB: Moves the generated persistent udev rules to /etc/udev/rules.d
udisks.service                                loaded active       running                      Disk Manager
upowerd.service                               loaded active       running                      Power Manager
wpa_supplicant.service                        loaded active       running                      Wi-Fi Security Service
avahi-daemon.socket                           loaded active       listening                    Avahi mDNS/DNS-SD Stack Activation Socket
cups.socket                                   loaded active       listening                    CUPS Printing Service Sockets
dbus.socket                                   loaded active       running                      dbus.socket
rpcbind.socket                                loaded active       listening                    RPC Portmapper Socket
sshd.socket                                   loaded active       listening                    sshd.socket
systemd-initctl.socket                        loaded active       listening                    systemd /dev/initctl Compatibility Socket
systemd-logger.socket                         loaded active       running                      systemd Logging Socket
systemd-shutdownd.socket                      loaded active       listening                    systemd Delayed Shutdown Socket
dev-disk-by\x1...x1db22a\x1d870f1adf2732.swap loaded active       active                       /dev/disk/by-uuid/fd626ef7-34a4-4958-b22a-870f1adf2732
basic.target                                  loaded active       active                       Basic System
bluetooth.target                              loaded active       active                       Bluetooth
dbus.target                                   loaded active       active                       D-Bus
getty.target                                  loaded active       active                       Login Prompts
graphical.target                              loaded active       active                       Graphical Interface
local-fs.target                               loaded active       active                       Local File Systems
multi-user.target                             loaded active       active                       Multi-User
network.target                                loaded active       active                       Network
remote-fs.target                              loaded active       active                       Remote File Systems
sockets.target                                loaded active       active                       Sockets
swap.target                                   loaded active       active                       Swap
sysinit.target                                loaded active       active                       System Initialization

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
JOB    = Pending job for the unit.

221 units listed. Pass --all to see inactive units, too.
[[email protected]] ~#

(I have shortened the output above a little, and removed a few lines not relevant for this blog post.)

Look at the ACTIVE column, which shows you the high-level state of
a service (or in fact of any kind of unit systemd maintains, which can
be more than just services, but we’ll have a look on this in a later
blog posting), whether it is active (i.e. running),
inactive (i.e. not running) or in any other state. If you look
closely you’ll see one item in the list that is marked maintenance
and highlighted in red. This informs you about a service that failed
to run or otherwise encountered a problem. In this case this is
ntpd. Now, let’s find out what actually
happened to ntpd, with the systemctl status
command:

[[email protected]] ~# systemctl status ntpd.service
ntpd.service - Network Time Service
	  Loaded: loaded (/etc/systemd/system/ntpd.service)
	  Active: maintenance
	    Main: 953 (code=exited, status=255)
	  CGroup: name=systemd:/systemd-1/ntpd.service
[[email protected]] ~#

This shows us that NTP terminated during runtime (when it ran as
PID 953), and tells us exactly the error condition: the process exited
with an exit status of 255.

In a later systemd version, we plan to hook this up to ABRT, as soon as
this enhancement request is fixed
. Then, if systemctl
status
shows you information about a service that crashed it will
direct you right-away to the appropriate crash dump in ABRT.

Summary: use systemctl and systemctl
status
as modern, more complete replacements for the traditional
boot-up status messages of SysV services. systemctl status
not only captures in more detail the error condition but also shows
runtime errors in addition to start-up errors.

That’s it for this week, make sure to come back next week, for the
next posting about systemd for administrators!

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

  • To start less.
  • And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate‘s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo)
in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

  1. service: these are the most obvious kind of unit:
    daemons that can be started, stopped, restarted, reloaded. For
    compatibility with SysV we not only support our own
    configuration files for services, but also are able to read
    classic SysV init scripts, in particular we parse the LSB
    header, if it exists. /etc/init.d is hence not much
    more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the
    file-system or on the Internet. We currently support AF_INET,
    AF_INET6, AF_UNIX sockets of the types stream, datagram, and
    sequential packet. We also support classic FIFOs as
    transport. Each socket unit has a matching
    service unit, that is started if the first connection
    comes in on the socket or FIFO. Example: nscd.socket
    starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the
    Linux device tree. If a device is marked for this via udev
    rules, it will be exposed as a device unit in
    systemd. Properties set with udev can be used as
    configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the
    file system hierarchy. systemd monitors all mount points how
    they come and go, and can also be used to mount or
    unmount mount-points. /etc/fstab is used here as an
    additional configuration source for these mount points, similar to
    how SysV init scripts can be used as additional configuration
    source for service units.
  5. automount: this unit type encapsulates an automount
    point in the file system hierarchy. Each automount
    unit has a matching mount unit, which is started
    (i.e. mounted) as soon as the automount directory is
    accessed.
  6. target: this unit type is used for logical
    grouping of units: instead of actually doing anything by itself
    it simply references other units, which thereby can be controlled
    together. Examples for this are: multi-user.target,
    which is a target that basically plays the role of run-level 5 on
    classic SysV system, or bluetooth.target which is
    requested as soon as a bluetooth dongle becomes available and
    which simply pulls in bluetooth related services that otherwise
    would not need to be started: bluetoothd and
    obexd and suchlike.
  7. snapshot: similar to target units
    snapshots do not actually do anything themselves and their only
    purpose is to reference other units. Snapshots can be used to
    save/rollback the state of all services and units of the init
    system. Primarily it has two intended use cases: to allow the
    user to temporarily enter a specific state such as “Emergency
    Shell”, terminating current services, and provide an easy way to
    return to the state before, pulling up all services again that
    got temporarily pulled down. And to ease support for system
    suspending: still many services cannot correctly deal with
    system suspend, and it is often a better idea to shut them down
    before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the
    environment, resource limits, working and root directory, umask,
    OOM killer adjustment, nice level, IO class and priority, CPU policy
    and priority, CPU affinity, timer slack, user id, group id,
    supplementary group ids, readable/writable/inaccessible
    directories, shared/private/slave mount flags,
    capabilities/bounding set, secure bits, CPU scheduler reset of
    fork, private /tmp name-space, cgroup control for
    various subsystems. Also, you can easily connect
    stdin/stdout/stderr of services to syslog, /dev/kmsg,
    arbitrary TTYs. If connected to a TTY for input systemd will make
    sure a process gets exclusive access, optionally waiting or enforcing
    it.
  2. Every executed process gets its own cgroup (currently by
    default in the debug subsystem, since that subsystem is not
    otherwise used and does not much more than the most basic
    process grouping), and it is very easy to configure systemd to
    place services in cgroups that have been configured externally,
    for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely
    follows the well-known .desktop files. It is a simple syntax for
    which parsers exist already in many software frameworks. Also, this
    allows us to rely on existing tools for i18n for service
    descriptions, and similar. Administrators and developers don’t
    need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init
    scripts. We take advantages of LSB and Red Hat chkconfig headers
    if they are available. If they aren’t we try to make the best of
    the otherwise available information, such as the start
    priorities in /etc/rc.d. These init scripts are simply
    considered a different source of configuration, hence an easy
    upgrade path to proper systemd services is available. Optionally
    we can read classic PID files for services to identify the main
    pid of a daemon. Note that we make use of the dependency
    information from the LSB init script headers, and translate
    those into native systemd dependencies. Side note: Upstart is
    unable to harvest and make use of that information. Boot-up on a
    plain Upstart system with mostly LSB SysV init scripts will
    hence not be parallelized, a similar system running systemd
    however will. In fact, for Upstart all SysV scripts together
    make one job that is executed, they are not treated
    individually, again in contrast to systemd where SysV init
    scripts are just another source of configuration and are all
    treated and controlled individually, much like any other native
    systemd service.
  5. Similarly, we read the existing /etc/fstab
    configuration file, and consider it just another source of
    configuration. Using the comment= fstab option you can
    even mark /etc/fstab entries to become systemd
    controlled automount points.
  6. If the same unit is configured in multiple configuration
    sources (e.g. /etc/systemd/system/avahi.service exists,
    and /etc/init.d/avahi too), then the native
    configuration will always take precedence, the legacy format is
    ignored, allowing an easy upgrade path and packages to carry
    both a SysV init script and a systemd service file for a
    while.
  7. We support a simple templating/instance mechanism. Example:
    instead of having six configuration files for six gettys, we
    only have one [email protected] file which gets instantiated to
    [email protected] and suchlike. The interface part can
    even be inherited by dependency expressions, i.e. it is easy to
    encode that a service [email protected] pulls in
    [email protected], while leaving the
    eth0 string wild-carded.
  8. For socket activation we support full compatibility with the
    traditional inetd modes, as well as a very simple mode that
    tries to mimic launchd socket activation and is recommended for
    new services. The inetd mode only allows passing one socket to
    the started daemon, while the native mode supports passing
    arbitrary numbers of file descriptors. We also support one
    instance per connection, as well as one instance for all
    connections modes. In the former mode we name the cgroup the
    daemon will be started in after the connection parameters, and
    utilize the templating logic mentioned above for this. Example:
    sshd.socket might spawn services
    [email protected] with a
    cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
    (i.e. the IP address and port numbers are used in the instance
    names. For AF_UNIX sockets we use PID and user id of the
    connecting client). This provides a nice way for the
    administrator to identify the various instances of a daemon and
    control their runtime individually. The native socket passing
    mode is very easily implementable in applications: if
    $LISTEN_FDS is set it contains the number of sockets
    passed and the daemon will find them sorted as listed in the
    .service file, starting from file descriptor 3 (a
    nicely written daemon could also use fstat() and
    getsockname() to identify the sockets in case it
    receives more than one). In addition we set $LISTEN_PID
    to the PID of the daemon that shall receive the fds, because
    environment variables are normally inherited by sub-processes and
    hence could confuse processes further down the chain. Even
    though this socket passing logic is very simple to implement in
    daemons, we will provide a BSD-licensed reference implementation
    that shows how to do this. We have ported a couple of existing
    daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a
    certain extent. This compatibility is in fact implemented with a
    FIFO-activated service, which simply translates these legacy
    requests to D-Bus requests. Effectively this means the old
    shutdown, poweroff and similar commands from
    Upstart and sysvinit continue to work with
    systemd.
  10. We also provide compatibility with utmp and
    wtmp. Possibly even to an extent that is far more
    than healthy, given how crufty utmp and wtmp
    are.
  11. systemd supports several kinds of
    dependencies between units. After/Before can be used to fix
    the ordering how units are activated. It is completely
    orthogonal to Requires and Wants, which
    express a positive requirement dependency, either mandatory, or
    optional. Then, there is Conflicts which
    expresses a negative requirement dependency. Finally, there are
    three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit
    is requested to start up or shut down we will add it and all its
    dependencies to a temporary transaction. Then, we will
    verify if the transaction is consistent (i.e. whether the
    ordering via After/Before of all units is
    cycle-free). If it is not, systemd will try to fix it up, and
    removes non-essential jobs from the transaction that might
    remove the loop. Also, systemd tries to suppress non-essential
    jobs in the transaction that would stop a running
    service. Non-essential jobs are those which the original request
    did not directly include but which where pulled in by
    Wants type of dependencies. Finally we check whether
    the jobs of the transaction contradict jobs that have already
    been queued, and optionally the transaction is aborted then. If
    all worked out and the transaction is consistent and minimized
    in its impact it is merged with all already outstanding jobs and
    added to the run queue. Effectively this means that before
    executing a requested operation, we will verify that it makes
    sense, fixing it if possible, and only failing if it really cannot
    work.
  13. We record start/exit time as well as the PID and exit status
    of every process we spawn and supervise. This data can be used
    to cross-link daemons with their data in abrtd, auditd and
    syslog. Think of an UI that will highlight crashed daemons for
    you, and allows you to easily navigate to the respective UIs for
    syslog, abrt, and auditd that will show the data generated from
    and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any
    time. The daemon state is serialized before the reexecution and
    deserialized afterwards. That way we provide a simple way to
    facilitate init system upgrades as well as handover from an
    initrd daemon to the final daemon. Open sockets and autofs
    mounts are properly serialized away, so that they stay
    connectible all the time, in a way that clients will not even
    notice that the init system reexecuted itself. Also, the fact
    that a big part of the service state is encoded anyway in the
    cgroup virtual file system would even allow us to resume
    execution without access to the serialization data. The
    reexecution code paths are actually mostly the same as the init
    system configuration reloading code paths, which
    guarantees that reexecution (which is probably more seldom
    triggered) gets similar testing as reloading (which is probably
    more common).
  15. Starting the work of removing shell scripts from the boot
    process we have recoded part of the basic system setup in C and
    moved it directly into systemd. Among that is mounting of the API
    file systems (i.e. virtual file systems such as /proc,
    /sys and /dev.) and setting of the
    host-name.
  16. Server state is introspectable and controllable via
    D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based
    activation, and we hence support dependencies between sockets and
    services, we also support traditional inter-service
    dependencies. We support multiple ways how such a service can
    signal its readiness: by forking and having the start process
    exit (i.e. traditional daemonize() behaviour), as well
    as by watching the bus until a configured service name appears.
  18. There’s an interactive mode which asks for confirmation each
    time a process is spawned by systemd. You may enable it by
    passing systemd.confirm_spawn=1 on the kernel command
    line.
  19. With the systemd.default= kernel command line
    parameter you can specify which unit systemd should start on
    boot-up. Normally you’d specify something like
    multi-user.target here, but another choice could even
    be a single service instead of a target, for example
    out-of-the-box we ship a service emergency.service that
    is similar in its usefulness as init=/bin/bash, however
    has the advantage of actually running the init system, hence
    offering the option to boot up the full system from the
    emergency shell.
  20. There’s a minimal UI that allows you to
    start/stop/introspect services. It’s far from complete but
    useful as a debugging tool. It’s written in Vala (yay!) and goes
    by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup
showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

  • We ask daemon writers not to fork or even double fork
    in their processes, but run their event loop from the initial process
    systemd starts for you. Also, don’t call setsid().
  • Don’t drop user privileges in the daemon itself, leave this
    to systemd and configure it in systemd service configuration
    files. (There are exceptions here. For example, for some daemons
    there are good reasons to drop privileges inside the daemon
    code, after an initialization phase that requires elevated
    privileges.)
  • Don’t write PID files
  • Grab a name on the bus
  • You may rely on systemd for logging, you are welcome to log
    whatever you need to log to stderr.
  • Let systemd create and watch sockets for you, so that socket
    activation works. Hence, interpret $LISTEN_FDS and
    $LISTEN_PID as described above.
  • Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?
Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.
Will this come to Fedora?
If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay’s pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That’s up to them. We’d certainly welcome their interest, and help with the integration.
Why didn’t you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.
Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.
If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.

Introducing nss-myhostname

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/nss-myhostname.html

I am doing a lot of embedded Linux work lately. The machines we use configure their hostname depending on some external configuration options. They boot from a CF card, which is mostly mounted read-only. Since the hostname changes often but we wanted to use sudo we had a problem: sudo requires the local host name to be resolvable using gethostbyname(). On Debian this is usually done by patching /etc/hosts correctly. Unfortunately that file resides on a read-only partition. Instead of hacking some ugly symlink based solution I decided to fix it the right way and wrote a tiny NSS module which does nothing more than mapping the hostname to the IP address 127.0.0.2 (and back). (That IP address is on the loopback device, but is not identical to localhost.)

Get nss-myhostname while it is hot!

BTW: This tool I wrote is pretty useful on embedded machines too, and certainly easier to use than setterm -dump 1 -file /dev/stdout | fold -w 80. And it does color too. And looping. And is much cooler anyway.