Tag Archives: NetworkManager

Security updates for Tuesday

Post Syndicated from corbet original https://lwn.net/Articles/731678/rss

Security updates have been issued by Debian (extplorer and libraw), Fedora (mingw-libsoup, python-tablib, ruby, and subversion), Mageia (avidemux, clamav, nasm, php-pear-CAS, and shutter), Oracle (xmlsec1), Red Hat (openssl tomcat), Scientific Linux (authconfig, bash, curl, evince, firefox, freeradius, gdm gnome-session, ghostscript, git, glibc, gnutls, groovy, GStreamer, gtk-vnc, httpd, java-1.7.0-openjdk, kernel, libreoffice, libsoup, libtasn1, log4j, mariadb, mercurial, NetworkManager, openldap, openssh, pidgin, pki-core, postgresql, python, qemu-kvm, samba, spice, subversion, tcpdump, tigervnc fltk, tomcat, X.org, and xmlsec1), SUSE (git), and Ubuntu (augeas, cvs, and texlive-base).

Security updates for Thursday

Post Syndicated from corbet original https://lwn.net/Articles/730474/rss

Security updates have been issued by Debian (firefox-esr), Fedora (cacti, community-mysql, and pspp), Mageia (varnish), openSUSE (mariadb, nasm, pspp, and rubygem-rubyzip), Oracle (evince, freeradius, golang, java-1.7.0-openjdk, log4j, NetworkManager and libnl3, pki-core, qemu-kvm, and X.org), Red Hat (flash-plugin), and Slackware (curl and mozilla).

Security updates for Tuesday

Post Syndicated from ris original https://lwn.net/Articles/729456/rss

Security updates have been issued by Debian (freerdp and ghostscript), Fedora (freerdp, jackson-databind, moodle, remmina, and runc), Red Hat (authconfig, devtoolset-4-jackson-databind, gnutls, libreoffice, NetworkManager and libnl3, pki-core, rh-eclipse46-jackson-databind, samba, and tcpdump), and Ubuntu (apache2, bash, imagemagick, openjdk-8, and rabbitmq-server).

Security advisories for Thursday

Post Syndicated from jake original http://lwn.net/Articles/709336/rss

Debian has updated game-music-emu
(code execution).

Fedora has updated tomcat (F25; F24; F23: three vulnerabilities).

openSUSE has updated flash-player
(13.2: multiple vulnerabilities), gstreamer-plugins-bad (42.1,
13.2: two code execution flaws), and python-Twisted (42.1: HTTP proxy redirect).

Oracle has updated firefox (OL7; OL6; OL5: multiple vulnerabilities).

Scientific Linux has updated 389-ds-base (SL7: three vulnerabilities), bind (SL7: denial of service), curl (SL7: three vulnerabilities), dhcp (SL7: denial of service), expat (SL7&6: code execution), firefox (multiple vulnerabilities), firefox (code execution), firewalld (SL7: authentication bypass), fontconfig (SL7: privilege escalation), gimp (SL7: code execution), glibc (SL7: code execution), ipsilon (SL7: information leak/denial of service), kernel (SL7: multiple vulnerabilities, some from 2015, one from 2013), krb5 (SL7: two vulnerabilities), libguestfs and virt-p2v (SL7: information leak
from 2015), libreoffice (SL7: two vulnerabilities), libreswan (SL7: denial of service), libvirt (SL7: three vulnerabilities, two from 2015), mariadb (SL7: multiple vulnerabilities), memcached (SL7: three vulnerabilities), mod_nss (SL7: encryption botch), nettle (SL7: multiple vulnerabilities, three from 2015), NetworkManager (SL7: information leak), ntp (SL7: multiple vulnerabilities from 2014 and 2015), openafs (information leak), openssh (SL7: privilege escalation from 2015),
pacemaker (SL7: denial of service), pacemaker (SL7: privilege escalation), pcs (SL7: two vulnerabilities), php
(SL7: multiple vulnerabilities), poppler (SL7: code execution
from 2015), postgresql (SL7: two vulnerabilities), python (SL7: code execution), qemu-kvm (SL7: two vulnerabilities), resteasy-base (SL7: code execution), squid (SL7: multiple vulnerabilities), sudo (SL7&6: two vulnerabilities), sudo (SL7: information disclosure), systemd (SL7: denial of service), thunderbird (code execution), thunderbird (code execution), tomcat (SL7: multiple vulnerabilities, one from 2015), util-linux (SL7: denial of service), and wget (SL7: code execution).

SUSE has updated xen (SLE12: multiple vulnerabilities).

Ubuntu has updated apport (three vulnerabilities).

Security updates for Friday

Post Syndicated from jake original http://lwn.net/Articles/706264/rss

Debian has updated pillow (two vulnerabilities).

Fedora has updated jasper (F23:
multiple vulnerabilities), kdepimlibs (F23: three
vulnerabilities), libXi (F23: two
vulnerabilities), and xen (F23: multiple vulnerabilities).

Mageia has updated freeimage (two
vulnerabilities, one from 2015).

openSUSE has updated curl (42.1:
multiple vulnerabilities), flash-player (13.2: multiple vulnerabilities), gd (42.1: three vulnerabilities), ImageMagick (42.1: multiple vulnerabilities, some from 2014 and
2015), and mysql-community-server (42.1,
13.2: multiple vulnerabilities, many unspecified).

Oracle has updated 389-ds-base
(OL7: unspecified), bind (OL7: denial of
service), curl (OL7: TLS botch), dhcp (OL7: unspecified), firewalld (OL7: authentication bypass), fontconfig (OL7: privilege escalation), gimp (OL7: code execution), glibc (OL7: code execution), java-1.7.0-openjdk (OL7: unspecified), kernel (OL7: multiple vulnerabilities, some from 2013 and 2015), krb5 (OL7: two vulnerabilities), libgcrypt (OL7: bad random numbers), libguestfs (OL7: information leak from 2015), libreoffice (OL7: code execution), libreswan (OL7: denial of service), libvirt (OL7: three vulnerabilities, two from 2015), mariadb (OL7: privilege escalation), mod_nss (OL7: cipher choosing botch), nettle (OL7: multiple vulnerabilities, three from 2015), NetworkManager (OL7: information leak), ntp (OL7: multiple vulnerabilities from 2015), openssh (OL7: privilege escalation from 2015), php (OL7: multiple vulnerabilities), poppler (OL7: code execution from 2015), postgresql (OL7: two vulnerabilities), python (OL7: code execution), qemu-kvm (OL7: two vulnerabilities), resteasy-base (OL7: code execution), squid (OL7: multiple vulnerabilities), sudo (OL7: information disclosure), systemd (OL7: denial of service), tomcat (OL7: multiple vulnerabilities, three from 2015), util-linux (OL7: denial of service), and wget (OL7: code execution).

Ubuntu has updated kernel (16.10; 16.04:
denial of service), kernel (14.04: multiple vulnerabilities, one
from 2014 and 2015), kernel (12.04: two
vulnerabilities), linux-lts-trusty (12.04:
multiple vulnerabilities, one from 2014 and 2015), linux-lts-xenial (14.04: denial of service),
linux-raspi2 (16.10: denial of service), linux-snapdragon (16.04: denial of service),
and linux-ti-omap4 (12.04: two vulnerabilities).

Thursday’s security updates

Post Syndicated from ris original http://lwn.net/Articles/705557/rss

Arch Linux has updated curl (multiple vulnerabilities), lib32-curl (multiple vulnerabilities), lib32-libcurl-compat (multiple vulnerabilities), lib32-libcurl-gnutls (multiple vulnerabilities), libcurl-compat (multiple vulnerabilities), libcurl-gnutls (multiple vulnerabilities), tar (file overwrite), and tomcat6 (redirect HTTP traffic).

CentOS has updated bind (C6; C5: denial
of service) and bind97 (C5: denial of service).

Debian-LTS has updated bind9 (denial of service), bsdiff (denial of service), qemu (multiple vulnerabilities), spip (multiple vulnerabilities), and xen (information leak/corruption).

Mageia has updated openjpeg2 (multiple vulnerabilities).

openSUSE has updated bash (13.2:
code execution), ghostscript (Leap42.1:
insufficient parameter check), libxml2
(Leap42.1: code execution), and openslp
(Leap42.1: two vulnerabilities).

Oracle has updated bind (OL6; OL5:
denial of service) and bind97 (OL5: denial of service).

Red Hat has updated 389-ds-base
(RHEL7: three vulnerabilities), bind (RHEL7; RHEL5,6: denial of service), bind97 (RHEL5: denial of service), curl (RHEL7: three vulnerabilities), dhcp (RHEL7: denial of service), firewalld (RHEL7: authentication bypass), fontconfig (RHEL7: privilege escalation), gimp (RHEL7: use-after-free), glibc (RHEL7: three vulnerabilities), kernel (RHEL7: multiple vulnerabilities), kernel-rt (RHEL7: multiple vulnerabilities),
krb5 (RHEL7: two vulnerabilities), libguestfs and virt-p2v (RHEL7: information
leak), libreoffice (RHEL7: code execution),
libreswan (RHEL7: denial of service), libvirt (RHEL7: three vulnerabilities), mariadb (RHEL7: multiple vulnerabilities), mod_nss (RHEL7: invalid handling of +CIPHER
operator), nettle (RHEL7: multiple
vulnerabilities), NetworkManager (RHEL7:
information leak), ntp (RHEL7: multiple
vulnerabilities), openssh (RHEL7: privilege
escalation), pacemaker (RHEL7: denial of
service), pacemaker (RHEL7: privilege
escalation), pcs (RHEL7: two
vulnerabilities), php (RHEL7: multiple
vulnerabilities), poppler (RHEL7: code
execution), postgresql (RHEL7: two
vulnerabilities), powerpc-utils-python
(RHEL7: code execution), python (RHEL7:
code execution), qemu-kvm (RHEL7: two
vulnerabilities), resteasy-base (RHEL7:
code execution), squid (RHEL7: multiple
denial of service flaws), subscription-manager (RHEL7: information
disclosure), sudo (RHEL7: information
disclosure), systemd (RHEL7: denial of
service), tomcat (RHEL7: multiple
vulnerabilities), util-linux (RHEL7: denial
of service), and wget (RHEL7: code execution).

SUSE has updated bind (SLES-Pi-12-SP2; SOSC5, SMP2.1, SM2.1, SLE11-SP2,3,4: denial of
service) and curl (SLE11-SP4: multiple vulnerabilities).

Ubuntu has updated memcached
(code execution), nvidia-graphics-drivers-367 (16.04, 14.04,
12.04: privilege escalation), and openjdk-8
(16.10, 16.04: multiple vulnerabilities).

MAC Address Spoofing in NetworkManager 1.4.0

Post Syndicated from ris original http://lwn.net/Articles/698683/rss

We recently pointed to Lubomir Rintel’s
coverage
of NetworkManager 1.4. Thomas Haller follows
up
with a more detailed look at the MAC spoofing capabilities of
NetworkManager. “1.2.0 relies on support from wpa_supplicant to configure a random MAC address. The problem is that it requires API which will only be part of the next major release 2.6 of the supplicant. Such a release does not yet exist to this date and thus virtually nobody is using this feature.

With NetworkManager 1.4.0, changing of the MAC address is done by NetworkManager itself, requiring no support from the supplicant. This allows also for more flexibility to generate “stable” addresses and the “generate-mac-address-mask”. Also, the same options are now available not only for Wi-Fi, but also Ethernet devices.”

Haller: MAC Address Spoofing in NetworkManager 1.4.0

Post Syndicated from ris original http://lwn.net/Articles/698683/rss

We recently pointed to Lubomir Rintel’s
coverage
of NetworkManager 1.4. Thomas Haller follows
up
with a more detailed look at the MAC spoofing capabilities of
NetworkManager. “1.2.0 relies on support from wpa_supplicant to configure a random MAC address. The problem is that it requires API which will only be part of the next major release 2.6 of the supplicant. Such a release does not yet exist to this date and thus virtually nobody is using this feature.

With NetworkManager 1.4.0, changing of the MAC address is done by NetworkManager itself, requiring no support from the supplicant. This allows also for more flexibility to generate “stable” addresses and the “generate-mac-address-mask”. Also, the same options are now available not only for Wi-Fi, but also Ethernet devices.”

Rintel: NetworkManager 1.4: with better privacy and easier to use

Post Syndicated from ris original http://lwn.net/Articles/698287/rss

Lubomir Rintel takes
a look
at new features in NetworkManager 1.4. “It is now possible to randomize the MAC address of Ethernet devices to mitigate possibility of tracking. The users can choose between different policies; use a completely random address, or just use different addresses in different networks. For Wi-Fi devices, the same randomization modes are now supported and does no longer require support from wpa-supplicant.
Also a newly added API for using configuration snapshots that automatically
roll back after a timeout, IPv6 tokenized interface identifiers can be
configured, new features in nmcli, and more are covered. (Thanks
to Paul Wise)

Security advisories for Monday

Post Syndicated from ris original http://lwn.net/Articles/688445/rss

Debian has updated wireshark (multiple vulnerabilities).

Debian-LTS has updated extplorer (cross-site request forgery), graphicsmagick (multiple vulnerabilities), and imagemagick (multiple vulnerabilities).

Fedora has updated cacti (F23; F22: SQL
injection), dosfstools (F23: two
vulnerabilities), libksba (F22: denial of
service), libndp (F23; F22: man-in-the-middle attacks), mingw-openssl (F23: multiple vulnerabilities),
moodle (F23: multiple vulnerabilities), openvpn (F22: multiple vulnerabilities),
pgpdump (F23; F22: denial of service), php-symfony
(F23; F22:
buffer overflow), qemu (F22: multiple
vulnerabilities), rpm (F22: two
vulnerabilities), thunderbird (F23: multiple vulnerabilities), and wordpress (F23; F22: two cross-site scripting vulnerabilities).

Mageia has updated apache-mod_nss (invalid handling of +CIPHER operator), bugzilla (cross-site scripting), jansson (denial of service), libgd (denial of service), libreoffice (code execution), networkmanager (information leak), openvpn (multiple vulnerabilities), p7zip (code execution), php-ZendFramework2 (insecure ciphertexts), and wpa_supplicant (two vulnerabilities).

openSUSE has updated kernel
(Leap42.1: multiple vulnerabilities).

Oracle has updated docker-engine (OL7; OL6:
privilege escalation) and kernel 3.8.13 (OL7; OL6:
multiple vulnerabilities), kernel 2.6.39 (OL6; OL5:
multiple vulnerabilities), kernel 2.6.32 (OL6; OL5: multiple vulnerabilities).

Red Hat has updated kernel
(RHEL6.4: two remote denial of service vulnerabilities).

Scientific Linux has updated libndp (SL7: man-in-the-middle attacks).

Slackware has updated curl (server spoofing).

SUSE has updated firefox
(SLE11-SP4,SP3: multiple vulnerabilities), java-1_6_0-ibm (SOSC5, SMP2.1, SM2.1,
SLES11SP3,SP2: multiple vulnerabilities), and java-1_7_0-ibm (SOSC5, SMP2.1, SM2.1,
SLES11SP3,SP2: multiple vulnerabilities).

Rintel: Network Manager 1.2 is here

Post Syndicated from n8willis original http://lwn.net/Articles/684796/rss

At his blog, Lubomir Rintel highlights some of the changes found in the new 1.2 release of Network Manager, the network-configuration utility suite shipped by many Linux distributions. High on the list are privacy improvements; the post notes that “the identity of a mobile host can also leak via Wi-Fi hardware addresses. A common way to solve this is to use random addresses when scanning for available access points, which is what NetworkManager now does (with a recent enough version of wpa_supplicant). The actual hardware address is used only after the device is associated to an access point.” Network Manager can also now be used to manage tun, tap, macvlan, vxlan and IP tunnel software devices, and can run multiple VPN modules simultaneously. In addition, support for several hardware device classes was split into loadable modules, which will reduce memory overhead.

Security advisories for Monday

Post Syndicated from ris original http://lwn.net/Articles/682382/rss

Arch Linux has updated squid (denial of service).

Debian has updated lhasa (code execution) and srtp (denial of service).

Fedora has updated apache-commons-collections (F23; F22: code
execution), bind (F22: multiple
vulnerabilities), bind99 (F22: multiple
vulnerabilities), and NetworkManager (F23: multiple vulnerabilities).

Gentoo has updated qemu (multiple
vulnerabilities) and xalan (code execution
from 2014).

openSUSE has updated krb5 (13.2: null pointer dereference).

Oracle has updated openssh (OL5:
two vulnerabilities).

Scientific Linux has updated krb5
(SL7: three vulnerabilities) and mariadb
(SL7: multiple vulnerabilities).

Slackware has updated mercurial (three vulnerabilities) and php (multiple vulnerabilities).

Security advisories for Monday

Post Syndicated from ris original http://lwn.net/Articles/682382/rss

Arch Linux has updated squid (denial of service).

Debian has updated lhasa (code execution) and srtp (denial of service).

Fedora has updated apache-commons-collections (F23; F22: code
execution), bind (F22: multiple
vulnerabilities), bind99 (F22: multiple
vulnerabilities), and NetworkManager (F23: multiple vulnerabilities).

Gentoo has updated qemu (multiple
vulnerabilities) and xalan (code execution
from 2014).

openSUSE has updated krb5 (13.2: null pointer dereference).

Oracle has updated openssh (OL5:
two vulnerabilities).

Scientific Linux has updated krb5
(SL7: three vulnerabilities) and mariadb
(SL7: multiple vulnerabilities).

Slackware has updated mercurial (three vulnerabilities) and php (multiple vulnerabilities).

The new sd-bus API of systemd

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/the-new-sd-bus-api-of-systemd.html

With the new v221 release of
systemd

we are declaring the
sd-bus
API shipped with
systemd
stable. sd-bus is our minimal D-Bus
IPC
C library, supporting as
back-ends both classic socket-based D-Bus and
kdbus. The library has been been
part of systemd for a while, but has only been used internally, since
we wanted to have the liberty to still make API changes without
affecting external consumers of the library. However, now we are
confident to commit to a stable API for it, starting with v221.
In this blog story I hope to provide you with a quick overview on
sd-bus, a short reiteration on D-Bus and its concepts, as well as a
few simple examples how to write D-Bus clients and services with it.
What is D-Bus again?
Let’s start with a quick reminder what
D-Bus actually is: it’s a
powerful, generic IPC system for Linux and other operating systems. It
knows concepts like buses, objects, interfaces, methods, signals,
properties. It provides you with fine-grained access control, a rich
type system, discoverability, introspection, monitoring, reliable
multicasting, service activation, file descriptor passing, and
more. There are bindings for numerous programming languages that are
used on Linux.
D-Bus has been a core component of Linux systems since more than 10
years. It is certainly the most widely established high-level local
IPC system on Linux. Since systemd’s inception it has been the IPC
system it exposes its interfaces on. And even before systemd, it was
the IPC system Upstart used to expose its interfaces. It is used by
GNOME, by KDE and by a variety of system components.
D-Bus refers to both a
specification
,
and a reference
implementation
. The
reference implementation provides both a bus server component, as well
as a client library. While there are multiple other, popular
reimplementations of the client library – for both C and other
programming languages –, the only commonly used server side is the
one from the reference implementation. (However, the kdbus project is
working on providing an alternative to this server implementation as a
kernel component.)
D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However,
the protocol may be used on top of TCP/IP as well. It does not
natively support encryption, hence using D-Bus directly on TCP is
usually not a good idea. It is possible to combine D-Bus with a
transport like ssh in order to secure it. systemd uses this to make
many of its APIs accessible remotely.
A frequently asked question about D-Bus is why it exists at all,
given that AF_UNIX sockets and FIFOs already exist on UNIX and have
been used for a long time successfully. To answer this question let’s
make a comparison with popular web technology of today: what
AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX
sockets/FIFOs only shovel raw bytes between processes, D-Bus defines
actual message encoding and adds concepts like method call
transactions, an object system, security mechanisms, multicasting and
more.
From our 10year+ experience with D-Bus we know today that while there
are some areas where we can improve things (and we are working on
that, both with kdbus and sd-bus), it generally appears to be a very
well designed system, that stood the test of time, aged well and is
widely established. Today, if we’d sit down and design a completely
new IPC system incorporating all the experience and knowledge we
gained with D-Bus, I am sure the result would be very close to what
D-Bus already is.
Or in short: D-Bus is great. If you hack on a Linux project and need a
local IPC, it should be your first choice. Not only because D-Bus is
well designed, but also because there aren’t many alternatives that
can cover similar functionality.
Where does sd-bus fit in?
Let’s discuss why sd-bus exists, how it compares with the other
existing C D-Bus libraries and why it might be a library to consider
for your project.
For C, there are two established, popular D-Bus libraries: libdbus, as
it is shipped in the reference implementation of D-Bus, as well as
GDBus, a component of GLib, the low-level tool library of GNOME.
Of the two libdbus is the much older one, as it was written at the
time the specification was put together. The library was written with
a focus on being portable and to be useful as back-end for higher-level
language bindings. Both of these goals required the API to be very
generic, resulting in a relatively baroque, hard-to-use API that lacks
the bits that make it easy and fun to use from C. It provides the
building blocks, but few tools to actually make it straightforward to
build a house from them. On the other hand, the library is suitable
for most use-cases (for example, it is OOM-safe making it suitable for
writing lowest level system software), and is portable to operating
systems like Windows or more exotic UNIXes.
GDBus
is a much newer implementation. It has been written after considerable
experience with using a GLib/GObject wrapper around libdbus. GDBus is
implemented from scratch, shares no code with libdbus. Its design
differs substantially from libdbus, it contains code generators to
make it specifically easy to expose GObject objects on the bus, or
talking to D-Bus objects as GObject objects. It translates D-Bus data
types to GVariant, which is GLib’s powerful data serialization
format. If you are used to GLib-style programming then you’ll feel
right at home, hacking D-Bus services and clients with it is a lot
simpler than using libdbus.
With sd-bus we now provide a third implementation, sharing no code
with either libdbus or GDBus. For us, the focus was on providing kind
of a middle ground between libdbus and GDBus: a low-level C library
that actually is fun to work with, that has enough syntactic sugar to
make it easy to write clients and services with, but on the other hand
is more low-level than GDBus/GLib/GObject/GVariant. To be able to use
it in systemd’s various system-level components it needed to be
OOM-safe and minimal. Another major point we wanted to focus on was
supporting a kdbus back-end right from the beginning, in addition to
the socket transport of the original D-Bus specification (“dbus1”). In
fact, we wanted to design the library closer to kdbus’ semantics than
to dbus1’s, wherever they are different, but still cover both
transports nicely. In contrast to libdbus or GDBus portability is not
a priority for sd-bus, instead we try to make the best of the Linux
platform and expose specific Linux concepts wherever that is
beneficial. Finally, performance was also an issue (though a secondary
one): neither libdbus nor GDBus will win any speed records. We wanted
to improve on performance (throughput and latency) — but simplicity
and correctness are more important to us. We believe the result of our
work delivers our goals quite nicely: the library is fun to use,
supports kdbus and sockets as back-end, is relatively minimal, and the
performance is substantially
better

than both libdbus and GDBus.
To decide which of the three APIs to use for you C project, here are
short guidelines:

If you hack on a GLib/GObject project, GDBus is definitely your
first choice.

If portability to non-Linux kernels — including Windows, Mac OS and
other UNIXes — is important to you, use either GDBus (which more or
less means buying into GLib/GObject) or libdbus (which requires a
lot of manual work).

Otherwise, sd-bus would be my recommended choice.

(I am not covering C++ specifically here, this is all about plain C
only. But do note: if you use Qt, then QtDBus is the D-Bus API of
choice, being a wrapper around libdbus.)
Introduction to D-Bus Concepts
To the uninitiated D-Bus usually appears to be a relatively opaque
technology. It uses lots of concepts that appear unnecessarily complex
and redundant on first sight. But actually, they make a lot of
sense. Let’s have a look:

A bus is where you look for IPC services. There are usually two
kinds of buses: a system bus, of which there’s exactly one per
system, and which is where you’d look for system services; and a
user bus, of which there’s one per user, and which is where you’d
look for user services, like the address book service or the mail
program. (Originally, the user bus was actually a session bus — so
that you get multiple of them if you log in many times as the same
user –, and on most setups it still is, but we are working on
moving things to a true user bus, of which there is only one per
user on a system, regardless how many times that user happens to
log in.)

A service is a program that offers some IPC API on a bus. A
service is identified by a name in reverse domain name
notation. Thus, the org.freedesktop.NetworkManager service on the
system bus is where NetworkManager’s APIs are available and
org.freedesktop.login1 on the system bus is where
systemd-logind’s APIs are exposed.

A client is a program that makes use of some IPC API on a bus. It
talks to a service, monitors it and generally doesn’t provide any
services on its own. That said, lines are blurry and many services
are also clients to other services. Frequently the term peer is
used as a generalization to refer to either a service or a client.

An object path is an identifier for an object on a specific
service. In a way this is comparable to a C pointer, since that’s
how you generally reference a C object, if you hack object-oriented
programs in C. However, C pointers are just memory addresses, and
passing memory addresses around to other processes would make
little sense, since they of course refer to the address space of
the service, the client couldn’t make sense of it. Thus, the D-Bus
designers came up with the object path concept, which is just a
string that looks like a file system path. Example:
/org/freedesktop/login1 is the object path of the ‘manager’
object of the org.freedesktop.login1 service (which, as we
remember from above, is still the service systemd-logind
exposes). Because object paths are structured like file system
paths they can be neatly arranged in a tree, so that you end up
with a venerable tree of objects. For example, you’ll find all user
sessions systemd-logind manages below the
/org/freedesktop/login1/session sub-tree, for example called
/org/freedesktop/login1/session/_7,
/org/freedesktop/login1/session/_55 and so on. How services
precisely label their objects and arrange them in a tree is
completely up to the developers of the services.

Each object that is identified by an object path has one or more
interfaces. An interface is a collection of signals, methods, and
properties (collectively called members), that belong
together. The concept of a D-Bus interface is actually pretty
much identical to what you know from programming languages such as
Java, which also know an interface concept. Which interfaces an
object implements are up the developers of the service. Interface
names are in reverse domain name notation, much like service
names. (Yes, that’s admittedly confusing, in particular since it’s
pretty common for simpler services to reuse the service name string
also as an interface name.) A couple of interfaces are standardized
though and you’ll find them available on many of the objects
offered by the various services. Specifically, those are
org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer
and org.freedesktop.DBus.Properties.

An interface can contain methods. The word “method” is more or
less just a fancy word for “function”, and is a term used pretty
much the same way in object-oriented languages such as Java. The
most common interaction between D-Bus peers is that one peer
invokes one of these methods on another peer and gets a reply. A
D-Bus method takes a couple of parameters, and returns others. The
parameters are transmitted in a type-safe way, and the type
information is included in the introspection data you can query
from each object. Usually, method names (and the other member
types) follow a CamelCase syntax. For example, systemd-logind
exposes an ActivateSession method on the
org.freedesktop.login1.Manager interface that is available on the
/org/freedesktop/login1 object of the org.freedesktop.login1
service.

A signature describes a set of parameters a function (or signal,
property, see below) takes or returns. It’s a series of characters
that each encode one parameter by its type. The set of types
available is pretty powerful. For example, there are simpler types
like s for string, or u for 32bit integer, but also complex
types such as as for an array of strings or a(sb) for an array
of structures consisting of one string and one boolean each. See
the D-Bus specification
for the full explanation of the type system. The
ActivateSession method mentioned above takes a single string as
parameter (the parameter signature is hence s), and returns
nothing (the return signature is hence the empty string). Of
course, the signature can get a lot more complex, see below for
more examples.

A signal is another member type that the D-Bus object system
knows. Much like a method it has a signature. However, they serve
different purposes. While in a method call a single client issues a
request on a single service, and that service sends back a response
to the client, signals are for general notification of
peers. Services send them out when they want to tell one or more
peers on the bus that something happened or changed. In contrast to
method calls and their replies they are hence usually broadcast
over a bus. While method calls/replies are used for duplex
one-to-one communication, signals are usually used for simplex
one-to-many communication (note however that that’s not a
requirement, they can also be used one-to-one). Example:
systemd-logind broadcasts a SessionNew signal from its manager
object each time a user logs in, and a SessionRemoved signal
every time a user logs out.

A property is the third member type that the D-Bus object system
knows. It’s similar to the property concept known by languages like
C#. Properties also have a signature, and are more or less just
variables that an object exposes, that can be read or altered by
clients. Example: systemd-logind exposes a property Docked of
the signature b (a boolean). It reflects whether systemd-logind
thinks the system is currently in a docking station of some form
(only applies to laptops …).

So much for the various concepts D-Bus knows. Of course, all these new
concepts might be overwhelming. Let’s look at them from a different
perspective. I assume many of the readers have an understanding of
today’s web technology, specifically HTTP and REST. Let’s try to
compare the concept of a HTTP request with the concept of a D-Bus
method call:

A HTTP request you issue on a specific network. It could be the
Internet, or it could be your local LAN, or a company
VPN. Depending on which network you issue the request on, you’ll be
able to talk to a different set of servers. This is not unlike the
“bus” concept of D-Bus.

On the network you then pick a specific HTTP server to talk
to. That’s roughly comparable to picking a service on a specific bus.

On the HTTP server you then ask for a specific URL. The “path” part
of the URL (by which I mean everything after the host name of the
server, up to the last “/”) is pretty similar to a D-Bus object path.

The “file” part of the URL (by which I mean everything after the
last slash, following the path, as described above), then defines
the actual call to make. In D-Bus this could be mapped to an
interface and method name.

Finally, the parameters of a HTTP call follow the path after the
“?”, they map to the signature of the D-Bus call.

Of course, comparing an HTTP request to a D-Bus method call is a bit
comparing apples and oranges. However, I think it’s still useful to
get a bit of a feeling of what maps to what.
From the shell
So much about the concepts and the gray theory behind them. Let’s make
this exciting, let’s actually see how this feels on a real system.
Since a while systemd has included a tool busctl that is useful to
explore and interact with the D-Bus object system. When invoked
without parameters, it will show you a list of all peers connected to
the system bus. (Use –user to see the peers of your user bus
instead):
$ busctl
NAME PID PROCESS USER CONNECTION UNIT SESSION DESCRIPTION
:1.1 1 systemd root :1.1 – – –
:1.11 705 NetworkManager root :1.11 NetworkManager.service – –
:1.14 744 gdm root :1.14 gdm.service – –
:1.4 708 systemd-logind root :1.4 systemd-logind.service – –
:1.7200 17563 busctl lennart :1.7200 session-1.scope 1 –
[…]
org.freedesktop.NetworkManager 705 NetworkManager root :1.11 NetworkManager.service – –
org.freedesktop.login1 708 systemd-logind root :1.4 systemd-logind.service – –
org.freedesktop.systemd1 1 systemd root :1.1 – – –
org.gnome.DisplayManager 744 gdm root :1.14 gdm.service – –
[…]

(I have shortened the output a bit, to make keep things brief).
The list begins with a list of all peers currently connected to the
bus. They are identified by peer names like “:1.11”. These are called
unique names in D-Bus nomenclature. Basically, every peer has a
unique name, and they are assigned automatically when a peer connects
to the bus. They are much like an IP address if you so will. You’ll
notice that a couple of peers are already connected, including our
little busctl tool itself as well as a number of system services. The
list then shows all actual services on the bus, identified by their
service names (as discussed above; to discern them from the unique
names these are also called well-known names). In many ways
well-known names are similar to DNS host names, i.e. they are a
friendlier way to reference a peer, but on the lower level they just
map to an IP address, or in this comparison the unique name. Much like
you can connect to a host on the Internet by either its host name or
its IP address, you can also connect to a bus peer either by its
unique or its well-known name. (Note that each peer can have as many
well-known names as it likes, much like an IP address can have
multiple host names referring to it).
OK, that’s already kinda cool. Try it for yourself, on your local
machine (all you need is a recent, systemd-based distribution).
Let’s now go the next step. Let’s see which objects the
org.freedesktop.login1 service actually offers:
$ busctl tree org.freedesktop.login1
└─/org/freedesktop/login1
├─/org/freedesktop/login1/seat
│ ├─/org/freedesktop/login1/seat/seat0
│ └─/org/freedesktop/login1/seat/self
├─/org/freedesktop/login1/session
│ ├─/org/freedesktop/login1/session/_31
│ └─/org/freedesktop/login1/session/self
└─/org/freedesktop/login1/user
├─/org/freedesktop/login1/user/_1000
└─/org/freedesktop/login1/user/self

Pretty, isn’t it? What’s actually even nicer, and which the output
does not show is that there’s full command line completion
available: as you press TAB the shell will auto-complete the service
names for you. It’s a real pleasure to explore your D-Bus objects that
way!
The output shows some objects that you might recognize from the
explanations above. Now, let’s go further. Let’s see what interfaces,
methods, signals and properties one of these objects actually exposes:
$ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME TYPE SIGNATURE RESULT/VALUE FLAGS
org.freedesktop.DBus.Introspectable interface – – –
.Introspect method – s –
org.freedesktop.DBus.Peer interface – – –
.GetMachineId method – s –
.Ping method – – –
org.freedesktop.DBus.Properties interface – – –
.Get method ss v –
.GetAll method s a{sv} –
.Set method ssv – –
.PropertiesChanged signal sa{sv}as – –
org.freedesktop.login1.Session interface – – –
.Activate method – – –
.Kill method si – –
.Lock method – – –
.PauseDeviceComplete method uu – –
.ReleaseControl method – – –
.ReleaseDevice method uu – –
.SetIdleHint method b – –
.TakeControl method b – –
.TakeDevice method uu hb –
.Terminate method – – –
.Unlock method – – –
.Active property b true emits-change
.Audit property u 1 const
.Class property s "user" const
.Desktop property s "" const
.Display property s "" const
.Id property s "1" const
.IdleHint property b true emits-change
.IdleSinceHint property t 1434494624206001 emits-change
.IdleSinceHintMonotonic property t 0 emits-change
.Leader property u 762 const
.Name property s "lennart" const
.Remote property b false const
.RemoteHost property s "" const
.RemoteUser property s "" const
.Scope property s "session-1.scope" const
.Seat property (so) "seat0" "/org/freedesktop/login1/seat… const
.Service property s "gdm-autologin" const
.State property s "active" –
.TTY property s "/dev/tty1" const
.Timestamp property t 1434494630344367 const
.TimestampMonotonic property t 34814579 const
.Type property s "x11" const
.User property (uo) 1000 "/org/freedesktop/login1/user/_1… const
.VTNr property u 1 const
.Lock signal – – –
.PauseDevice signal uus – –
.ResumeDevice signal uuh – –
.Unlock signal – – –

As before, the busctl command supports command line completion, hence
both the service name and the object path used are easily put together
on the shell simply by pressing TAB. The output shows the methods,
properties, signals of one of the session objects that are currently
made available by systemd-logind. There’s a section for each
interface the object knows. The second column tells you what kind of
member is shown in the line. The third column shows the signature of
the member. In case of method calls that’s the input parameters, the
fourth column shows what is returned. For properties, the fourth
column encodes the current value of them.
So far, we just explored. Let’s take the next step now: let’s become
active – let’s call a method:
# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock

I don’t think I need to mention this anymore, but anyway: again
there’s full command line completion available. The third argument is
the interface name, the fourth the method name, both can be easily
completed by pressing TAB. In this case we picked the Lock method,
which activates the screen lock for the specific session. And yupp,
the instant I pressed enter on this line my screen lock turned on
(this only works on DEs that correctly hook into systemd-logind for
this to work. GNOME works fine, and KDE should work too).
The Lock method call we picked is very simple, as it takes no
parameters and returns none. Of course, it can get more complicated
for some calls. Here’s another example, this time using one of
systemd’s own bus calls, to start an arbitrary system unit:
# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"

This call takes two strings as input parameters, as we denote in the
signature string that follows the method name (as usual, command line
completion helps you getting this right). Following the signature the
next two parameters are simply the two strings to pass. The specified
signature string hence indicates what comes next. systemd’s StartUnit
method call takes the unit name to start as first parameter, and the
mode in which to start it as second. The call returned a single object
path value. It is encoded the same way as the input parameter: a
signature (just o for the object path) followed by the actual value.
Of course, some method call parameters can get a ton more complex, but
with busctl it’s relatively easy to encode them all. See the man
page
for
details.
busctl knows a number of other operations. For example, you can use
it to monitor D-Bus traffic as it happens (including generating a
.cap file for use with Wireshark!) or you can set or get specific
properties. However, this blog story was supposed to be about sd-bus,
not busctl, hence let’s cut this short here, and let me direct you
to the man page in case you want to know more about the tool.
busctl (like the rest of system) is implemented using the sd-bus
API. Thus it exposes many of the features of sd-bus itself. For
example, you can use to connect to remote or container buses. It
understands both kdbus and classic D-Bus, and more!
sd-bus
But enough! Let’s get back on topic, let’s talk about sd-bus itself.
The sd-bus set of APIs is mostly contained in the header file
sd-bus.h.
Here’s a random selection of features of the library, that make it
compare well with the other implementations available.

Supports both kdbus and dbus1 as back-end.

Has high-level support for connecting to remote buses via ssh, and
to buses of local OS containers.

Powerful credential model, to implement authentication of clients
in services. Currently 34 individual fields are supported, from the
PID of the client to the cgroup or capability sets.

Support for tracking the life-cycle of peers in order to release
local objects automatically when all peers referencing them
disconnected.

The client builds an efficient decision tree to determine which
handlers to deliver an incoming bus message to.

Automatically translates D-Bus errors into UNIX style errors and
back (this is lossy though), to ensure best integration of D-Bus
into low-level Linux programs.

Powerful but lightweight object model for exposing local objects on
the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on
completing the set of manual pages. For details
see all pages starting with sd_bus_.
Invoking a Method, from C, with sd-bus
So much about the library in general. Here’s an example for connecting
to the bus and issuing a method call:
#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
sd_bus_error error = SD_BUS_ERROR_NULL;
sd_bus_message *m = NULL;
sd_bus *bus = NULL;
const char *path;
int r;

/* Connect to the system bus */
r = sd_bus_open_system(&bus);
if (r < 0) {
fprintf(stderr, "Failed to connect to system bus: %sn", strerror(-r));
goto finish;
}

/* Issue the method call and store the respons message in m */
r = sd_bus_call_method(bus,
"org.freedesktop.systemd1", /* service to contact */
"/org/freedesktop/systemd1", /* object path */
"org.freedesktop.systemd1.Manager", /* interface name */
"StartUnit", /* method name */
&error, /* object to return error in */
&m, /* return message on success */
"ss", /* input signature */
"cups.service", /* first argument */
"replace"); /* second argument */
if (r < 0) {
fprintf(stderr, "Failed to issue method call: %sn", error.message);
goto finish;
}

/* Parse the response message */
r = sd_bus_message_read(m, "o", &path);
if (r < 0) {
fprintf(stderr, "Failed to parse response message: %sn", strerror(-r));
goto finish;
}

printf("Queued service job as %s.n", path);

finish:
sd_bus_error_free(&error);
sd_bus_message_unref(m);
sd_bus_unref(bus);

return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-client.c, then build it with:
$ gcc bus-client.c -o bus-client `pkg-config –cflags –libs libsystemd`

This will generate a binary bus-client you can now run. Make sure to
run it as root though, since access to the StartUnit method is
privileged:
# ./bus-client
Queued service job as /org/freedesktop/systemd1/job/3586.

And that’s it already, our first example. It showed how we invoked a
method call on the bus. The actual function call of the method is very
close to the busctl command line we used before. I hope the code
excerpt needs little further explanation. It’s supposed to give you a
taste how to write D-Bus clients with sd-bus. For more more
information please have a look at the header file, the man page or
even the sd-bus sources.
Implementing a Service, in C, with sd-bus
Of course, just calling a single method is a rather simplistic
example. Let’s have a look on how to write a bus service. We’ll write
a small calculator service, that exposes a single object, which
implements an interface that exposes two methods: one to multiply two
64bit signed integers, and one to divide one 64bit signed integer by
another.
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <systemd/sd-bus.h>

static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
int64_t x, y;
int r;

/* Read the parameters */
r = sd_bus_message_read(m, "xx", &x, &y);
if (r < 0) {
fprintf(stderr, "Failed to parse parameters: %sn", strerror(-r));
return r;
}

/* Reply with the response */
return sd_bus_reply_method_return(m, "x", x * y);
}

static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
int64_t x, y;
int r;

/* Read the parameters */
r = sd_bus_message_read(m, "xx", &x, &y);
if (r < 0) {
fprintf(stderr, "Failed to parse parameters: %sn", strerror(-r));
return r;
}

/* Return an error on division by zero */
if (y == 0) {
sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero.");
return -EINVAL;
}

return sd_bus_reply_method_return(m, "x", x / y);
}

/* The vtable of our little object, implements the net.poettering.Calculator interface */
static const sd_bus_vtable calculator_vtable[] = {
SD_BUS_VTABLE_START(0),
SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED),
SD_BUS_METHOD("Divide", "xx", "x", method_divide, SD_BUS_VTABLE_UNPRIVILEGED),
SD_BUS_VTABLE_END
};

int main(int argc, char *argv[]) {
sd_bus_slot *slot = NULL;
sd_bus *bus = NULL;
int r;

/* Connect to the user bus this time */
r = sd_bus_open_user(&bus);
if (r < 0) {
fprintf(stderr, "Failed to connect to system bus: %sn", strerror(-r));
goto finish;
}

/* Install the object */
r = sd_bus_add_object_vtable(bus,
&slot,
"/net/poettering/Calculator", /* object path */
"net.poettering.Calculator", /* interface name */
calculator_vtable,
NULL);
if (r < 0) {
fprintf(stderr, "Failed to issue method call: %sn", strerror(-r));
goto finish;
}

/* Take a well-known service name so that clients can find us */
r = sd_bus_request_name(bus, "net.poettering.Calculator", 0);
if (r < 0) {
fprintf(stderr, "Failed to acquire service name: %sn", strerror(-r));
goto finish;
}

for (;;) {
/* Process requests */
r = sd_bus_process(bus, NULL);
if (r < 0) {
fprintf(stderr, "Failed to process bus: %sn", strerror(-r));
goto finish;
}
if (r > 0) /* we processed a request, try to process another one, right-away */
continue;

/* Wait for the next request to process */
r = sd_bus_wait(bus, (uint64_t) -1);
if (r < 0) {
fprintf(stderr, "Failed to wait on bus: %sn", strerror(-r));
goto finish;
}
}

finish:
sd_bus_slot_unref(slot);
sd_bus_unref(bus);

return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-service.c, then build it with:
$ gcc bus-service.c -o bus-service `pkg-config –cflags –libs libsystemd`

Now, let’s run it:
$ ./bus-service

In another terminal, let’s try to talk to it. Note that this service
is now on the user bus, not on the system bus as before. We do this
for simplicity reasons: on the system bus access to services is
tightly controlled so unprivileged clients cannot request privileged
operations. On the user bus however things are simpler: as only
processes of the user owning the bus can connect no further policy
enforcement will complicate this example. Because the service is on
the user bus, we have to pass the –user switch on the busctl
command line. Let’s start with looking at the service’s object tree.
$ busctl –user tree net.poettering.Calculator
└─/net/poettering/Calculator

As we can see, there’s only a single object on the service, which is
not surprising, given that our code above only registered one. Let’s
see the interfaces and the members this object exposes:
$ busctl –user introspect net.poettering.Calculator /net/poettering/Calculator
NAME TYPE SIGNATURE RESULT/VALUE FLAGS
net.poettering.Calculator interface – – –
.Divide method xx x –
.Multiply method xx x –
org.freedesktop.DBus.Introspectable interface – – –
.Introspect method – s –
org.freedesktop.DBus.Peer interface – – –
.GetMachineId method – s –
.Ping method – – –
org.freedesktop.DBus.Properties interface – – –
.Get method ss v –
.GetAll method s a{sv} –
.Set method ssv – –
.PropertiesChanged signal sa{sv}as – –

The sd-bus library automatically added a couple of generic interfaces,
as mentioned above. But the first interface we see is actually the one
we added! It shows our two methods, and both take “xx” (two 64bit
signed integers) as input parameters, and return one “x”. Great! But
does it work?
$ busctl –user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35

Woohoo! We passed the two integers 5 and 7, and the service actually
multiplied them for us and returned a single integer 35! Let’s try the
other method:
$ busctl –user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17
x 5

Oh, wow! It can even do integer division! Fantastic! But let’s trick
it into dividing by zero:
$ busctl –user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.

Nice! It detected this nicely and returned a clean error about it. If
you look in the source code example above you’ll see how precisely we
generated the error.
And that’s really all I have for today. Of course, the examples I
showed are short, and I don’t get into detail here on what precisely
each line does. However, this is supposed to be a short introduction
into D-Bus and sd-bus, and it’s already way too long for that …
I hope this blog story was useful to you. If you are interested in
using sd-bus for your own programs, I hope this gets you started. If
you have further questions, check the (incomplete) man pages, and
inquire us on IRC or the systemd mailing list. If you need more
examples, have a look at the systemd source tree, all of systemd’s
many bus services use sd-bus extensively.

The new sd-bus API of systemd

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/the-new-sd-bus-api-of-systemd.html

With the new v221 release of
systemd

we are declaring the
sd-bus
API shipped with
systemd
stable. sd-bus is our minimal D-Bus
IPC
C library, supporting as
back-ends both classic socket-based D-Bus and
kdbus. The library has been been
part of systemd for a while, but has only been used internally, since
we wanted to have the liberty to still make API changes without
affecting external consumers of the library. However, now we are
confident to commit to a stable API for it, starting with v221.

In this blog story I hope to provide you with a quick overview on
sd-bus, a short reiteration on D-Bus and its concepts, as well as a
few simple examples how to write D-Bus clients and services with it.

What is D-Bus again?

Let’s start with a quick reminder what
D-Bus actually is: it’s a
powerful, generic IPC system for Linux and other operating systems. It
knows concepts like buses, objects, interfaces, methods, signals,
properties. It provides you with fine-grained access control, a rich
type system, discoverability, introspection, monitoring, reliable
multicasting, service activation, file descriptor passing, and
more. There are bindings for numerous programming languages that are
used on Linux.

D-Bus has been a core component of Linux systems since more than 10
years. It is certainly the most widely established high-level local
IPC system on Linux. Since systemd’s inception it has been the IPC
system it exposes its interfaces on. And even before systemd, it was
the IPC system Upstart used to expose its interfaces. It is used by
GNOME, by KDE and by a variety of system components.

D-Bus refers to both a
specification
,
and a reference
implementation
. The
reference implementation provides both a bus server component, as well
as a client library. While there are multiple other, popular
reimplementations of the client library – for both C and other
programming languages –, the only commonly used server side is the
one from the reference implementation. (However, the kdbus project is
working on providing an alternative to this server implementation as a
kernel component.)

D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However,
the protocol may be used on top of TCP/IP as well. It does not
natively support encryption, hence using D-Bus directly on TCP is
usually not a good idea. It is possible to combine D-Bus with a
transport like ssh in order to secure it. systemd uses this to make
many of its APIs accessible remotely.

A frequently asked question about D-Bus is why it exists at all,
given that AF_UNIX sockets and FIFOs already exist on UNIX and have
been used for a long time successfully. To answer this question let’s
make a comparison with popular web technology of today: what
AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX
sockets/FIFOs only shovel raw bytes between processes, D-Bus defines
actual message encoding and adds concepts like method call
transactions, an object system, security mechanisms, multicasting and
more.

From our 10year+ experience with D-Bus we know today that while there
are some areas where we can improve things (and we are working on
that, both with kdbus and sd-bus), it generally appears to be a very
well designed system, that stood the test of time, aged well and is
widely established. Today, if we’d sit down and design a completely
new IPC system incorporating all the experience and knowledge we
gained with D-Bus, I am sure the result would be very close to what
D-Bus already is.

Or in short: D-Bus is great. If you hack on a Linux project and need a
local IPC, it should be your first choice. Not only because D-Bus is
well designed, but also because there aren’t many alternatives that
can cover similar functionality.

Where does sd-bus fit in?

Let’s discuss why sd-bus exists, how it compares with the other
existing C D-Bus libraries and why it might be a library to consider
for your project.

For C, there are two established, popular D-Bus libraries: libdbus, as
it is shipped in the reference implementation of D-Bus, as well as
GDBus, a component of GLib, the low-level tool library of GNOME.

Of the two libdbus is the much older one, as it was written at the
time the specification was put together. The library was written with
a focus on being portable and to be useful as back-end for higher-level
language bindings. Both of these goals required the API to be very
generic, resulting in a relatively baroque, hard-to-use API that lacks
the bits that make it easy and fun to use from C. It provides the
building blocks, but few tools to actually make it straightforward to
build a house from them. On the other hand, the library is suitable
for most use-cases (for example, it is OOM-safe making it suitable for
writing lowest level system software), and is portable to operating
systems like Windows or more exotic UNIXes.

GDBus
is a much newer implementation. It has been written after considerable
experience with using a GLib/GObject wrapper around libdbus. GDBus is
implemented from scratch, shares no code with libdbus. Its design
differs substantially from libdbus, it contains code generators to
make it specifically easy to expose GObject objects on the bus, or
talking to D-Bus objects as GObject objects. It translates D-Bus data
types to GVariant, which is GLib’s powerful data serialization
format. If you are used to GLib-style programming then you’ll feel
right at home, hacking D-Bus services and clients with it is a lot
simpler than using libdbus.

With sd-bus we now provide a third implementation, sharing no code
with either libdbus or GDBus. For us, the focus was on providing kind
of a middle ground between libdbus and GDBus: a low-level C library
that actually is fun to work with, that has enough syntactic sugar to
make it easy to write clients and services with, but on the other hand
is more low-level than GDBus/GLib/GObject/GVariant. To be able to use
it in systemd’s various system-level components it needed to be
OOM-safe and minimal. Another major point we wanted to focus on was
supporting a kdbus back-end right from the beginning, in addition to
the socket transport of the original D-Bus specification (“dbus1”). In
fact, we wanted to design the library closer to kdbus’ semantics than
to dbus1’s, wherever they are different, but still cover both
transports nicely. In contrast to libdbus or GDBus portability is not
a priority for sd-bus, instead we try to make the best of the Linux
platform and expose specific Linux concepts wherever that is
beneficial. Finally, performance was also an issue (though a secondary
one): neither libdbus nor GDBus will win any speed records. We wanted
to improve on performance (throughput and latency) — but simplicity
and correctness are more important to us. We believe the result of our
work delivers our goals quite nicely: the library is fun to use,
supports kdbus and sockets as back-end, is relatively minimal, and the
performance is substantially
better

than both libdbus and GDBus.

To decide which of the three APIs to use for you C project, here are
short guidelines:

  • If you hack on a GLib/GObject project, GDBus is definitely your
    first choice.

  • If portability to non-Linux kernels — including Windows, Mac OS and
    other UNIXes — is important to you, use either GDBus (which more or
    less means buying into GLib/GObject) or libdbus (which requires a
    lot of manual work).

  • Otherwise, sd-bus would be my recommended choice.

(I am not covering C++ specifically here, this is all about plain C
only. But do note: if you use Qt, then QtDBus is the D-Bus API of
choice, being a wrapper around libdbus.)

Introduction to D-Bus Concepts

To the uninitiated D-Bus usually appears to be a relatively opaque
technology. It uses lots of concepts that appear unnecessarily complex
and redundant on first sight. But actually, they make a lot of
sense. Let’s have a look:

  • A bus is where you look for IPC services. There are usually two
    kinds of buses: a system bus, of which there’s exactly one per
    system, and which is where you’d look for system services; and a
    user bus, of which there’s one per user, and which is where you’d
    look for user services, like the address book service or the mail
    program. (Originally, the user bus was actually a session bus — so
    that you get multiple of them if you log in many times as the same
    user –, and on most setups it still is, but we are working on
    moving things to a true user bus, of which there is only one per
    user on a system, regardless how many times that user happens to
    log in.)

  • A service is a program that offers some IPC API on a bus. A
    service is identified by a name in reverse domain name
    notation. Thus, the org.freedesktop.NetworkManager service on the
    system bus is where NetworkManager’s APIs are available and
    org.freedesktop.login1 on the system bus is where
    systemd-logind‘s APIs are exposed.

  • A client is a program that makes use of some IPC API on a bus. It
    talks to a service, monitors it and generally doesn’t provide any
    services on its own. That said, lines are blurry and many services
    are also clients to other services. Frequently the term peer is
    used as a generalization to refer to either a service or a client.

  • An object path is an identifier for an object on a specific
    service. In a way this is comparable to a C pointer, since that’s
    how you generally reference a C object, if you hack object-oriented
    programs in C. However, C pointers are just memory addresses, and
    passing memory addresses around to other processes would make
    little sense, since they of course refer to the address space of
    the service, the client couldn’t make sense of it. Thus, the D-Bus
    designers came up with the object path concept, which is just a
    string that looks like a file system path. Example:
    /org/freedesktop/login1 is the object path of the ‘manager’
    object of the org.freedesktop.login1 service (which, as we
    remember from above, is still the service systemd-logind
    exposes). Because object paths are structured like file system
    paths they can be neatly arranged in a tree, so that you end up
    with a venerable tree of objects. For example, you’ll find all user
    sessions systemd-logind manages below the
    /org/freedesktop/login1/session sub-tree, for example called
    /org/freedesktop/login1/session/_7,
    /org/freedesktop/login1/session/_55 and so on. How services
    precisely label their objects and arrange them in a tree is
    completely up to the developers of the services.

  • Each object that is identified by an object path has one or more
    interfaces. An interface is a collection of signals, methods, and
    properties (collectively called members), that belong
    together. The concept of a D-Bus interface is actually pretty
    much identical to what you know from programming languages such as
    Java, which also know an interface concept. Which interfaces an
    object implements are up the developers of the service. Interface
    names are in reverse domain name notation, much like service
    names. (Yes, that’s admittedly confusing, in particular since it’s
    pretty common for simpler services to reuse the service name string
    also as an interface name.) A couple of interfaces are standardized
    though and you’ll find them available on many of the objects
    offered by the various services. Specifically, those are
    org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer
    and org.freedesktop.DBus.Properties.

  • An interface can contain methods. The word “method” is more or
    less just a fancy word for “function”, and is a term used pretty
    much the same way in object-oriented languages such as Java. The
    most common interaction between D-Bus peers is that one peer
    invokes one of these methods on another peer and gets a reply. A
    D-Bus method takes a couple of parameters, and returns others. The
    parameters are transmitted in a type-safe way, and the type
    information is included in the introspection data you can query
    from each object. Usually, method names (and the other member
    types) follow a CamelCase syntax. For example, systemd-logind
    exposes an ActivateSession method on the
    org.freedesktop.login1.Manager interface that is available on the
    /org/freedesktop/login1 object of the org.freedesktop.login1
    service.

  • A signature describes a set of parameters a function (or signal,
    property, see below) takes or returns. It’s a series of characters
    that each encode one parameter by its type. The set of types
    available is pretty powerful. For example, there are simpler types
    like s for string, or u for 32bit integer, but also complex
    types such as as for an array of strings or a(sb) for an array
    of structures consisting of one string and one boolean each. See
    the D-Bus specification
    for the full explanation of the type system. The
    ActivateSession method mentioned above takes a single string as
    parameter (the parameter signature is hence s), and returns
    nothing (the return signature is hence the empty string). Of
    course, the signature can get a lot more complex, see below for
    more examples.

  • A signal is another member type that the D-Bus object system
    knows. Much like a method it has a signature. However, they serve
    different purposes. While in a method call a single client issues a
    request on a single service, and that service sends back a response
    to the client, signals are for general notification of
    peers. Services send them out when they want to tell one or more
    peers on the bus that something happened or changed. In contrast to
    method calls and their replies they are hence usually broadcast
    over a bus. While method calls/replies are used for duplex
    one-to-one communication, signals are usually used for simplex
    one-to-many communication (note however that that’s not a
    requirement, they can also be used one-to-one). Example:
    systemd-logind broadcasts a SessionNew signal from its manager
    object each time a user logs in, and a SessionRemoved signal
    every time a user logs out.

  • A property is the third member type that the D-Bus object system
    knows. It’s similar to the property concept known by languages like
    C#. Properties also have a signature, and are more or less just
    variables that an object exposes, that can be read or altered by
    clients. Example: systemd-logind exposes a property Docked of
    the signature b (a boolean). It reflects whether systemd-logind
    thinks the system is currently in a docking station of some form
    (only applies to laptops …).

So much for the various concepts D-Bus knows. Of course, all these new
concepts might be overwhelming. Let’s look at them from a different
perspective. I assume many of the readers have an understanding of
today’s web technology, specifically HTTP and REST. Let’s try to
compare the concept of a HTTP request with the concept of a D-Bus
method call:

  • A HTTP request you issue on a specific network. It could be the
    Internet, or it could be your local LAN, or a company
    VPN. Depending on which network you issue the request on, you’ll be
    able to talk to a different set of servers. This is not unlike the
    “bus” concept of D-Bus.

  • On the network you then pick a specific HTTP server to talk
    to. That’s roughly comparable to picking a service on a specific bus.

  • On the HTTP server you then ask for a specific URL. The “path” part
    of the URL (by which I mean everything after the host name of the
    server, up to the last “/”) is pretty similar to a D-Bus object path.

  • The “file” part of the URL (by which I mean everything after the
    last slash, following the path, as described above), then defines
    the actual call to make. In D-Bus this could be mapped to an
    interface and method name.

  • Finally, the parameters of a HTTP call follow the path after the
    “?”, they map to the signature of the D-Bus call.

Of course, comparing an HTTP request to a D-Bus method call is a bit
comparing apples and oranges. However, I think it’s still useful to
get a bit of a feeling of what maps to what.

From the shell

So much about the concepts and the gray theory behind them. Let’s make
this exciting, let’s actually see how this feels on a real system.

Since a while systemd has included a tool busctl that is useful to
explore and interact with the D-Bus object system. When invoked
without parameters, it will show you a list of all peers connected to
the system bus. (Use --user to see the peers of your user bus
instead):

$ busctl
NAME                                       PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.1                                         1 systemd         root             :1.1          -                         -          -
:1.11                                      705 NetworkManager  root             :1.11         NetworkManager.service    -          -
:1.14                                      744 gdm             root             :1.14         gdm.service               -          -
:1.4                                       708 systemd-logind  root             :1.4          systemd-logind.service    -          -
:1.7200                                  17563 busctl          lennart          :1.7200       session-1.scope           1          -
[…]
org.freedesktop.NetworkManager             705 NetworkManager  root             :1.11         NetworkManager.service    -          -
org.freedesktop.login1                     708 systemd-logind  root             :1.4          systemd-logind.service    -          -
org.freedesktop.systemd1                     1 systemd         root             :1.1          -                         -          -
org.gnome.DisplayManager                   744 gdm             root             :1.14         gdm.service               -          -
[…]

(I have shortened the output a bit, to make keep things brief).

The list begins with a list of all peers currently connected to the
bus. They are identified by peer names like “:1.11”. These are called
unique names in D-Bus nomenclature. Basically, every peer has a
unique name, and they are assigned automatically when a peer connects
to the bus. They are much like an IP address if you so will. You’ll
notice that a couple of peers are already connected, including our
little busctl tool itself as well as a number of system services. The
list then shows all actual services on the bus, identified by their
service names (as discussed above; to discern them from the unique
names these are also called well-known names). In many ways
well-known names are similar to DNS host names, i.e. they are a
friendlier way to reference a peer, but on the lower level they just
map to an IP address, or in this comparison the unique name. Much like
you can connect to a host on the Internet by either its host name or
its IP address, you can also connect to a bus peer either by its
unique or its well-known name. (Note that each peer can have as many
well-known names as it likes, much like an IP address can have
multiple host names referring to it).

OK, that’s already kinda cool. Try it for yourself, on your local
machine (all you need is a recent, systemd-based distribution).

Let’s now go the next step. Let’s see which objects the
org.freedesktop.login1 service actually offers:

$ busctl tree org.freedesktop.login1
└─/org/freedesktop/login1
  ├─/org/freedesktop/login1/seat
  │ ├─/org/freedesktop/login1/seat/seat0
  │ └─/org/freedesktop/login1/seat/self
  ├─/org/freedesktop/login1/session
  │ ├─/org/freedesktop/login1/session/_31
  │ └─/org/freedesktop/login1/session/self
  └─/org/freedesktop/login1/user
    ├─/org/freedesktop/login1/user/_1000
    └─/org/freedesktop/login1/user/self

Pretty, isn’t it? What’s actually even nicer, and which the output
does not show is that there’s full command line completion
available: as you press TAB the shell will auto-complete the service
names for you. It’s a real pleasure to explore your D-Bus objects that
way!

The output shows some objects that you might recognize from the
explanations above. Now, let’s go further. Let’s see what interfaces,
methods, signals and properties one of these objects actually exposes:

$ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME                                TYPE      SIGNATURE RESULT/VALUE                             FLAGS
org.freedesktop.DBus.Introspectable interface -         -                                        -
.Introspect                         method    -         s                                        -
org.freedesktop.DBus.Peer           interface -         -                                        -
.GetMachineId                       method    -         s                                        -
.Ping                               method    -         -                                        -
org.freedesktop.DBus.Properties     interface -         -                                        -
.Get                                method    ss        v                                        -
.GetAll                             method    s         a{sv}                                    -
.Set                                method    ssv       -                                        -
.PropertiesChanged                  signal    sa{sv}as  -                                        -
org.freedesktop.login1.Session      interface -         -                                        -
.Activate                           method    -         -                                        -
.Kill                               method    si        -                                        -
.Lock                               method    -         -                                        -
.PauseDeviceComplete                method    uu        -                                        -
.ReleaseControl                     method    -         -                                        -
.ReleaseDevice                      method    uu        -                                        -
.SetIdleHint                        method    b         -                                        -
.TakeControl                        method    b         -                                        -
.TakeDevice                         method    uu        hb                                       -
.Terminate                          method    -         -                                        -
.Unlock                             method    -         -                                        -
.Active                             property  b         true                                     emits-change
.Audit                              property  u         1                                        const
.Class                              property  s         "user"                                   const
.Desktop                            property  s         ""                                       const
.Display                            property  s         ""                                       const
.Id                                 property  s         "1"                                      const
.IdleHint                           property  b         true                                     emits-change
.IdleSinceHint                      property  t         1434494624206001                         emits-change
.IdleSinceHintMonotonic             property  t         0                                        emits-change
.Leader                             property  u         762                                      const
.Name                               property  s         "lennart"                                const
.Remote                             property  b         false                                    const
.RemoteHost                         property  s         ""                                       const
.RemoteUser                         property  s         ""                                       const
.Scope                              property  s         "session-1.scope"                        const
.Seat                               property  (so)      "seat0" "/org/freedesktop/login1/seat... const
.Service                            property  s         "gdm-autologin"                          const
.State                              property  s         "active"                                 -
.TTY                                property  s         "/dev/tty1"                              const
.Timestamp                          property  t         1434494630344367                         const
.TimestampMonotonic                 property  t         34814579                                 const
.Type                               property  s         "x11"                                    const
.User                               property  (uo)      1000 "/org/freedesktop/login1/user/_1... const
.VTNr                               property  u         1                                        const
.Lock                               signal    -         -                                        -
.PauseDevice                        signal    uus       -                                        -
.ResumeDevice                       signal    uuh       -                                        -
.Unlock                             signal    -         -                                        -

As before, the busctl command supports command line completion, hence
both the service name and the object path used are easily put together
on the shell simply by pressing TAB. The output shows the methods,
properties, signals of one of the session objects that are currently
made available by systemd-logind. There’s a section for each
interface the object knows. The second column tells you what kind of
member is shown in the line. The third column shows the signature of
the member. In case of method calls that’s the input parameters, the
fourth column shows what is returned. For properties, the fourth
column encodes the current value of them.

So far, we just explored. Let’s take the next step now: let’s become
active – let’s call a method:

# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock

I don’t think I need to mention this anymore, but anyway: again
there’s full command line completion available. The third argument is
the interface name, the fourth the method name, both can be easily
completed by pressing TAB. In this case we picked the Lock method,
which activates the screen lock for the specific session. And yupp,
the instant I pressed enter on this line my screen lock turned on
(this only works on DEs that correctly hook into systemd-logind for
this to work. GNOME works fine, and KDE should work too).

The Lock method call we picked is very simple, as it takes no
parameters and returns none. Of course, it can get more complicated
for some calls. Here’s another example, this time using one of
systemd’s own bus calls, to start an arbitrary system unit:

# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"

This call takes two strings as input parameters, as we denote in the
signature string that follows the method name (as usual, command line
completion helps you getting this right). Following the signature the
next two parameters are simply the two strings to pass. The specified
signature string hence indicates what comes next. systemd’s StartUnit
method call takes the unit name to start as first parameter, and the
mode in which to start it as second. The call returned a single object
path value. It is encoded the same way as the input parameter: a
signature (just o for the object path) followed by the actual value.

Of course, some method call parameters can get a ton more complex, but
with busctl it’s relatively easy to encode them all. See the man
page
for
details.

busctl knows a number of other operations. For example, you can use
it to monitor D-Bus traffic as it happens (including generating a
.cap file for use with Wireshark!) or you can set or get specific
properties. However, this blog story was supposed to be about sd-bus,
not busctl, hence let’s cut this short here, and let me direct you
to the man page in case you want to know more about the tool.

busctl (like the rest of system) is implemented using the sd-bus
API. Thus it exposes many of the features of sd-bus itself. For
example, you can use to connect to remote or container buses. It
understands both kdbus and classic D-Bus, and more!

sd-bus

But enough! Let’s get back on topic, let’s talk about sd-bus itself.

The sd-bus set of APIs is mostly contained in the header file
sd-bus.h.

Here’s a random selection of features of the library, that make it
compare well with the other implementations available.

  • Supports both kdbus and dbus1 as back-end.

  • Has high-level support for connecting to remote buses via ssh, and
    to buses of local OS containers.

  • Powerful credential model, to implement authentication of clients
    in services. Currently 34 individual fields are supported, from the
    PID of the client to the cgroup or capability sets.

  • Support for tracking the life-cycle of peers in order to release
    local objects automatically when all peers referencing them
    disconnected.

  • The client builds an efficient decision tree to determine which
    handlers to deliver an incoming bus message to.

  • Automatically translates D-Bus errors into UNIX style errors and
    back (this is lossy though), to ensure best integration of D-Bus
    into low-level Linux programs.

  • Powerful but lightweight object model for exposing local objects on
    the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on
completing the set of manual pages. For details
see all pages starting with sd_bus_.

Invoking a Method, from C, with sd-bus

So much about the library in general. Here’s an example for connecting
to the bus and issuing a method call:

#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
        sd_bus_error error = SD_BUS_ERROR_NULL;
        sd_bus_message *m = NULL;
        sd_bus *bus = NULL;
        const char *path;
        int r;

        /* Connect to the system bus */
        r = sd_bus_open_system(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Issue the method call and store the respons message in m */
        r = sd_bus_call_method(bus,
                               "org.freedesktop.systemd1",           /* service to contact */
                               "/org/freedesktop/systemd1",          /* object path */
                               "org.freedesktop.systemd1.Manager",   /* interface name */
                               "StartUnit",                          /* method name */
                               &error,                               /* object to return error in */
                               &m,                                   /* return message on success */
                               "ss",                                 /* input signature */
                               "cups.service",                       /* first argument */
                               "replace");                           /* second argument */
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", error.message);
                goto finish;
        }

        /* Parse the response message */
        r = sd_bus_message_read(m, "o", &path);
        if (r < 0) {
                fprintf(stderr, "Failed to parse response message: %s\n", strerror(-r));
                goto finish;
        }

        printf("Queued service job as %s.\n", path);

finish:
        sd_bus_error_free(&error);
        sd_bus_message_unref(m);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-client.c, then build it with:

$ gcc bus-client.c -o bus-client `pkg-config --cflags --libs libsystemd`

This will generate a binary bus-client you can now run. Make sure to
run it as root though, since access to the StartUnit method is
privileged:

# ./bus-client
Queued service job as /org/freedesktop/systemd1/job/3586.

And that’s it already, our first example. It showed how we invoked a
method call on the bus. The actual function call of the method is very
close to the busctl command line we used before. I hope the code
excerpt needs little further explanation. It’s supposed to give you a
taste how to write D-Bus clients with sd-bus. For more more
information please have a look at the header file, the man page or
even the sd-bus sources.

Implementing a Service, in C, with sd-bus

Of course, just calling a single method is a rather simplistic
example. Let’s have a look on how to write a bus service. We’ll write
a small calculator service, that exposes a single object, which
implements an interface that exposes two methods: one to multiply two
64bit signed integers, and one to divide one 64bit signed integer by
another.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <systemd/sd-bus.h>

static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Reply with the response */
        return sd_bus_reply_method_return(m, "x", x * y);
}

static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Return an error on division by zero */
        if (y == 0) {
                sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero.");
                return -EINVAL;
        }

        return sd_bus_reply_method_return(m, "x", x / y);
}

/* The vtable of our little object, implements the net.poettering.Calculator interface */
static const sd_bus_vtable calculator_vtable[] = {
        SD_BUS_VTABLE_START(0),
        SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_METHOD("Divide",   "xx", "x", method_divide,   SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_VTABLE_END
};

int main(int argc, char *argv[]) {
        sd_bus_slot *slot = NULL;
        sd_bus *bus = NULL;
        int r;

        /* Connect to the user bus this time */
        r = sd_bus_open_user(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Install the object */
        r = sd_bus_add_object_vtable(bus,
                                     &slot,
                                     "/net/poettering/Calculator",  /* object path */
                                     "net.poettering.Calculator",   /* interface name */
                                     calculator_vtable,
                                     NULL);
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", strerror(-r));
                goto finish;
        }

        /* Take a well-known service name so that clients can find us */
        r = sd_bus_request_name(bus, "net.poettering.Calculator", 0);
        if (r < 0) {
                fprintf(stderr, "Failed to acquire service name: %s\n", strerror(-r));
                goto finish;
        }

        for (;;) {
                /* Process requests */
                r = sd_bus_process(bus, NULL);
                if (r < 0) {
                        fprintf(stderr, "Failed to process bus: %s\n", strerror(-r));
                        goto finish;
                }
                if (r > 0) /* we processed a request, try to process another one, right-away */
                        continue;

                /* Wait for the next request to process */
                r = sd_bus_wait(bus, (uint64_t) -1);
                if (r < 0) {
                        fprintf(stderr, "Failed to wait on bus: %s\n", strerror(-r));
                        goto finish;
                }
        }

finish:
        sd_bus_slot_unref(slot);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-service.c, then build it with:

$ gcc bus-service.c -o bus-service `pkg-config --cflags --libs libsystemd`

Now, let’s run it:

$ ./bus-service

In another terminal, let’s try to talk to it. Note that this service
is now on the user bus, not on the system bus as before. We do this
for simplicity reasons: on the system bus access to services is
tightly controlled so unprivileged clients cannot request privileged
operations. On the user bus however things are simpler: as only
processes of the user owning the bus can connect no further policy
enforcement will complicate this example. Because the service is on
the user bus, we have to pass the --user switch on the busctl
command line. Let’s start with looking at the service’s object tree.

$ busctl --user tree net.poettering.Calculator
└─/net/poettering/Calculator

As we can see, there’s only a single object on the service, which is
not surprising, given that our code above only registered one. Let’s
see the interfaces and the members this object exposes:

$ busctl --user introspect net.poettering.Calculator /net/poettering/Calculator
NAME                                TYPE      SIGNATURE RESULT/VALUE FLAGS
net.poettering.Calculator           interface -         -            -
.Divide                             method    xx        x            -
.Multiply                           method    xx        x            -
org.freedesktop.DBus.Introspectable interface -         -            -
.Introspect                         method    -         s            -
org.freedesktop.DBus.Peer           interface -         -            -
.GetMachineId                       method    -         s            -
.Ping                               method    -         -            -
org.freedesktop.DBus.Properties     interface -         -            -
.Get                                method    ss        v            -
.GetAll                             method    s         a{sv}        -
.Set                                method    ssv       -            -
.PropertiesChanged                  signal    sa{sv}as  -            -

The sd-bus library automatically added a couple of generic interfaces,
as mentioned above. But the first interface we see is actually the one
we added! It shows our two methods, and both take “xx” (two 64bit
signed integers) as input parameters, and return one “x”. Great! But
does it work?

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35

Woohoo! We passed the two integers 5 and 7, and the service actually
multiplied them for us and returned a single integer 35! Let’s try the
other method:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17
x 5

Oh, wow! It can even do integer division! Fantastic! But let’s trick
it into dividing by zero:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.

Nice! It detected this nicely and returned a clean error about it. If
you look in the source code example above you’ll see how precisely we
generated the error.

And that’s really all I have for today. Of course, the examples I
showed are short, and I don’t get into detail here on what precisely
each line does. However, this is supposed to be a short introduction
into D-Bus and sd-bus, and it’s already way too long for that …

I hope this blog story was useful to you. If you are interested in
using sd-bus for your own programs, I hope this gets you started. If
you have further questions, check the (incomplete) man pages, and
inquire us on IRC or the systemd mailing list. If you need more
examples, have a look at the systemd source tree, all of systemd’s
many bus services use sd-bus extensively.

The new sd-bus API of systemd

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/the-new-sd-bus-api-of-systemd.html

With the new v221 release of
systemd

we are declaring the
sd-bus
API shipped with
systemd
stable. sd-bus is our minimal D-Bus
IPC
C library, supporting as
back-ends both classic socket-based D-Bus and
kdbus. The library has been been
part of systemd for a while, but has only been used internally, since
we wanted to have the liberty to still make API changes without
affecting external consumers of the library. However, now we are
confident to commit to a stable API for it, starting with v221.

In this blog story I hope to provide you with a quick overview on
sd-bus, a short reiteration on D-Bus and its concepts, as well as a
few simple examples how to write D-Bus clients and services with it.

What is D-Bus again?

Let’s start with a quick reminder what
D-Bus actually is: it’s a
powerful, generic IPC system for Linux and other operating systems. It
knows concepts like buses, objects, interfaces, methods, signals,
properties. It provides you with fine-grained access control, a rich
type system, discoverability, introspection, monitoring, reliable
multicasting, service activation, file descriptor passing, and
more. There are bindings for numerous programming languages that are
used on Linux.

D-Bus has been a core component of Linux systems since more than 10
years. It is certainly the most widely established high-level local
IPC system on Linux. Since systemd’s inception it has been the IPC
system it exposes its interfaces on. And even before systemd, it was
the IPC system Upstart used to expose its interfaces. It is used by
GNOME, by KDE and by a variety of system components.

D-Bus refers to both a
specification
,
and a reference
implementation
. The
reference implementation provides both a bus server component, as well
as a client library. While there are multiple other, popular
reimplementations of the client library – for both C and other
programming languages –, the only commonly used server side is the
one from the reference implementation. (However, the kdbus project is
working on providing an alternative to this server implementation as a
kernel component.)

D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However,
the protocol may be used on top of TCP/IP as well. It does not
natively support encryption, hence using D-Bus directly on TCP is
usually not a good idea. It is possible to combine D-Bus with a
transport like ssh in order to secure it. systemd uses this to make
many of its APIs accessible remotely.

A frequently asked question about D-Bus is why it exists at all,
given that AF_UNIX sockets and FIFOs already exist on UNIX and have
been used for a long time successfully. To answer this question let’s
make a comparison with popular web technology of today: what
AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX
sockets/FIFOs only shovel raw bytes between processes, D-Bus defines
actual message encoding and adds concepts like method call
transactions, an object system, security mechanisms, multicasting and
more.

From our 10year+ experience with D-Bus we know today that while there
are some areas where we can improve things (and we are working on
that, both with kdbus and sd-bus), it generally appears to be a very
well designed system, that stood the test of time, aged well and is
widely established. Today, if we’d sit down and design a completely
new IPC system incorporating all the experience and knowledge we
gained with D-Bus, I am sure the result would be very close to what
D-Bus already is.

Or in short: D-Bus is great. If you hack on a Linux project and need a
local IPC, it should be your first choice. Not only because D-Bus is
well designed, but also because there aren’t many alternatives that
can cover similar functionality.

Where does sd-bus fit in?

Let’s discuss why sd-bus exists, how it compares with the other
existing C D-Bus libraries and why it might be a library to consider
for your project.

For C, there are two established, popular D-Bus libraries: libdbus, as
it is shipped in the reference implementation of D-Bus, as well as
GDBus, a component of GLib, the low-level tool library of GNOME.

Of the two libdbus is the much older one, as it was written at the
time the specification was put together. The library was written with
a focus on being portable and to be useful as back-end for higher-level
language bindings. Both of these goals required the API to be very
generic, resulting in a relatively baroque, hard-to-use API that lacks
the bits that make it easy and fun to use from C. It provides the
building blocks, but few tools to actually make it straightforward to
build a house from them. On the other hand, the library is suitable
for most use-cases (for example, it is OOM-safe making it suitable for
writing lowest level system software), and is portable to operating
systems like Windows or more exotic UNIXes.

GDBus
is a much newer implementation. It has been written after considerable
experience with using a GLib/GObject wrapper around libdbus. GDBus is
implemented from scratch, shares no code with libdbus. Its design
differs substantially from libdbus, it contains code generators to
make it specifically easy to expose GObject objects on the bus, or
talking to D-Bus objects as GObject objects. It translates D-Bus data
types to GVariant, which is GLib’s powerful data serialization
format. If you are used to GLib-style programming then you’ll feel
right at home, hacking D-Bus services and clients with it is a lot
simpler than using libdbus.

With sd-bus we now provide a third implementation, sharing no code
with either libdbus or GDBus. For us, the focus was on providing kind
of a middle ground between libdbus and GDBus: a low-level C library
that actually is fun to work with, that has enough syntactic sugar to
make it easy to write clients and services with, but on the other hand
is more low-level than GDBus/GLib/GObject/GVariant. To be able to use
it in systemd’s various system-level components it needed to be
OOM-safe and minimal. Another major point we wanted to focus on was
supporting a kdbus back-end right from the beginning, in addition to
the socket transport of the original D-Bus specification (“dbus1”). In
fact, we wanted to design the library closer to kdbus’ semantics than
to dbus1’s, wherever they are different, but still cover both
transports nicely. In contrast to libdbus or GDBus portability is not
a priority for sd-bus, instead we try to make the best of the Linux
platform and expose specific Linux concepts wherever that is
beneficial. Finally, performance was also an issue (though a secondary
one): neither libdbus nor GDBus will win any speed records. We wanted
to improve on performance (throughput and latency) — but simplicity
and correctness are more important to us. We believe the result of our
work delivers our goals quite nicely: the library is fun to use,
supports kdbus and sockets as back-end, is relatively minimal, and the
performance is substantially
better

than both libdbus and GDBus.

To decide which of the three APIs to use for you C project, here are
short guidelines:

  • If you hack on a GLib/GObject project, GDBus is definitely your
    first choice.

  • If portability to non-Linux kernels — including Windows, Mac OS and
    other UNIXes — is important to you, use either GDBus (which more or
    less means buying into GLib/GObject) or libdbus (which requires a
    lot of manual work).

  • Otherwise, sd-bus would be my recommended choice.

(I am not covering C++ specifically here, this is all about plain C
only. But do note: if you use Qt, then QtDBus is the D-Bus API of
choice, being a wrapper around libdbus.)

Introduction to D-Bus Concepts

To the uninitiated D-Bus usually appears to be a relatively opaque
technology. It uses lots of concepts that appear unnecessarily complex
and redundant on first sight. But actually, they make a lot of
sense. Let’s have a look:

  • A bus is where you look for IPC services. There are usually two
    kinds of buses: a system bus, of which there’s exactly one per
    system, and which is where you’d look for system services; and a
    user bus, of which there’s one per user, and which is where you’d
    look for user services, like the address book service or the mail
    program. (Originally, the user bus was actually a session bus — so
    that you get multiple of them if you log in many times as the same
    user –, and on most setups it still is, but we are working on
    moving things to a true user bus, of which there is only one per
    user on a system, regardless how many times that user happens to
    log in.)

  • A service is a program that offers some IPC API on a bus. A
    service is identified by a name in reverse domain name
    notation. Thus, the org.freedesktop.NetworkManager service on the
    system bus is where NetworkManager’s APIs are available and
    org.freedesktop.login1 on the system bus is where
    systemd-logind‘s APIs are exposed.

  • A client is a program that makes use of some IPC API on a bus. It
    talks to a service, monitors it and generally doesn’t provide any
    services on its own. That said, lines are blurry and many services
    are also clients to other services. Frequently the term peer is
    used as a generalization to refer to either a service or a client.

  • An object path is an identifier for an object on a specific
    service. In a way this is comparable to a C pointer, since that’s
    how you generally reference a C object, if you hack object-oriented
    programs in C. However, C pointers are just memory addresses, and
    passing memory addresses around to other processes would make
    little sense, since they of course refer to the address space of
    the service, the client couldn’t make sense of it. Thus, the D-Bus
    designers came up with the object path concept, which is just a
    string that looks like a file system path. Example:
    /org/freedesktop/login1 is the object path of the ‘manager’
    object of the org.freedesktop.login1 service (which, as we
    remember from above, is still the service systemd-logind
    exposes). Because object paths are structured like file system
    paths they can be neatly arranged in a tree, so that you end up
    with a venerable tree of objects. For example, you’ll find all user
    sessions systemd-logind manages below the
    /org/freedesktop/login1/session sub-tree, for example called
    /org/freedesktop/login1/session/_7,
    /org/freedesktop/login1/session/_55 and so on. How services
    precisely label their objects and arrange them in a tree is
    completely up to the developers of the services.

  • Each object that is identified by an object path has one or more
    interfaces. An interface is a collection of signals, methods, and
    properties (collectively called members), that belong
    together. The concept of a D-Bus interface is actually pretty
    much identical to what you know from programming languages such as
    Java, which also know an interface concept. Which interfaces an
    object implements are up the developers of the service. Interface
    names are in reverse domain name notation, much like service
    names. (Yes, that’s admittedly confusing, in particular since it’s
    pretty common for simpler services to reuse the service name string
    also as an interface name.) A couple of interfaces are standardized
    though and you’ll find them available on many of the objects
    offered by the various services. Specifically, those are
    org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer
    and org.freedesktop.DBus.Properties.

  • An interface can contain methods. The word “method” is more or
    less just a fancy word for “function”, and is a term used pretty
    much the same way in object-oriented languages such as Java. The
    most common interaction between D-Bus peers is that one peer
    invokes one of these methods on another peer and gets a reply. A
    D-Bus method takes a couple of parameters, and returns others. The
    parameters are transmitted in a type-safe way, and the type
    information is included in the introspection data you can query
    from each object. Usually, method names (and the other member
    types) follow a CamelCase syntax. For example, systemd-logind
    exposes an ActivateSession method on the
    org.freedesktop.login1.Manager interface that is available on the
    /org/freedesktop/login1 object of the org.freedesktop.login1
    service.

  • A signature describes a set of parameters a function (or signal,
    property, see below) takes or returns. It’s a series of characters
    that each encode one parameter by its type. The set of types
    available is pretty powerful. For example, there are simpler types
    like s for string, or u for 32bit integer, but also complex
    types such as as for an array of strings or a(sb) for an array
    of structures consisting of one string and one boolean each. See
    the D-Bus specification
    for the full explanation of the type system. The
    ActivateSession method mentioned above takes a single string as
    parameter (the parameter signature is hence s), and returns
    nothing (the return signature is hence the empty string). Of
    course, the signature can get a lot more complex, see below for
    more examples.

  • A signal is another member type that the D-Bus object system
    knows. Much like a method it has a signature. However, they serve
    different purposes. While in a method call a single client issues a
    request on a single service, and that service sends back a response
    to the client, signals are for general notification of
    peers. Services send them out when they want to tell one or more
    peers on the bus that something happened or changed. In contrast to
    method calls and their replies they are hence usually broadcast
    over a bus. While method calls/replies are used for duplex
    one-to-one communication, signals are usually used for simplex
    one-to-many communication (note however that that’s not a
    requirement, they can also be used one-to-one). Example:
    systemd-logind broadcasts a SessionNew signal from its manager
    object each time a user logs in, and a SessionRemoved signal
    every time a user logs out.

  • A property is the third member type that the D-Bus object system
    knows. It’s similar to the property concept known by languages like
    C#. Properties also have a signature, and are more or less just
    variables that an object exposes, that can be read or altered by
    clients. Example: systemd-logind exposes a property Docked of
    the signature b (a boolean). It reflects whether systemd-logind
    thinks the system is currently in a docking station of some form
    (only applies to laptops …).

So much for the various concepts D-Bus knows. Of course, all these new
concepts might be overwhelming. Let’s look at them from a different
perspective. I assume many of the readers have an understanding of
today’s web technology, specifically HTTP and REST. Let’s try to
compare the concept of a HTTP request with the concept of a D-Bus
method call:

  • A HTTP request you issue on a specific network. It could be the
    Internet, or it could be your local LAN, or a company
    VPN. Depending on which network you issue the request on, you’ll be
    able to talk to a different set of servers. This is not unlike the
    “bus” concept of D-Bus.

  • On the network you then pick a specific HTTP server to talk
    to. That’s roughly comparable to picking a service on a specific bus.

  • On the HTTP server you then ask for a specific URL. The “path” part
    of the URL (by which I mean everything after the host name of the
    server, up to the last “/”) is pretty similar to a D-Bus object path.

  • The “file” part of the URL (by which I mean everything after the
    last slash, following the path, as described above), then defines
    the actual call to make. In D-Bus this could be mapped to an
    interface and method name.

  • Finally, the parameters of a HTTP call follow the path after the
    “?”, they map to the signature of the D-Bus call.

Of course, comparing an HTTP request to a D-Bus method call is a bit
comparing apples and oranges. However, I think it’s still useful to
get a bit of a feeling of what maps to what.

From the shell

So much about the concepts and the gray theory behind them. Let’s make
this exciting, let’s actually see how this feels on a real system.

Since a while systemd has included a tool busctl that is useful to
explore and interact with the D-Bus object system. When invoked
without parameters, it will show you a list of all peers connected to
the system bus. (Use --user to see the peers of your user bus
instead):

$ busctl
NAME                                       PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.1                                         1 systemd         root             :1.1          -                         -          -
:1.11                                      705 NetworkManager  root             :1.11         NetworkManager.service    -          -
:1.14                                      744 gdm             root             :1.14         gdm.service               -          -
:1.4                                       708 systemd-logind  root             :1.4          systemd-logind.service    -          -
:1.7200                                  17563 busctl          lennart          :1.7200       session-1.scope           1          -
[…]
org.freedesktop.NetworkManager             705 NetworkManager  root             :1.11         NetworkManager.service    -          -
org.freedesktop.login1                     708 systemd-logind  root             :1.4          systemd-logind.service    -          -
org.freedesktop.systemd1                     1 systemd         root             :1.1          -                         -          -
org.gnome.DisplayManager                   744 gdm             root             :1.14         gdm.service               -          -
[…]

(I have shortened the output a bit, to make keep things brief).

The list begins with a list of all peers currently connected to the
bus. They are identified by peer names like “:1.11”. These are called
unique names in D-Bus nomenclature. Basically, every peer has a
unique name, and they are assigned automatically when a peer connects
to the bus. They are much like an IP address if you so will. You’ll
notice that a couple of peers are already connected, including our
little busctl tool itself as well as a number of system services. The
list then shows all actual services on the bus, identified by their
service names (as discussed above; to discern them from the unique
names these are also called well-known names). In many ways
well-known names are similar to DNS host names, i.e. they are a
friendlier way to reference a peer, but on the lower level they just
map to an IP address, or in this comparison the unique name. Much like
you can connect to a host on the Internet by either its host name or
its IP address, you can also connect to a bus peer either by its
unique or its well-known name. (Note that each peer can have as many
well-known names as it likes, much like an IP address can have
multiple host names referring to it).

OK, that’s already kinda cool. Try it for yourself, on your local
machine (all you need is a recent, systemd-based distribution).

Let’s now go the next step. Let’s see which objects the
org.freedesktop.login1 service actually offers:

$ busctl tree org.freedesktop.login1
└─/org/freedesktop/login1
  ├─/org/freedesktop/login1/seat
  │ ├─/org/freedesktop/login1/seat/seat0
  │ └─/org/freedesktop/login1/seat/self
  ├─/org/freedesktop/login1/session
  │ ├─/org/freedesktop/login1/session/_31
  │ └─/org/freedesktop/login1/session/self
  └─/org/freedesktop/login1/user
    ├─/org/freedesktop/login1/user/_1000
    └─/org/freedesktop/login1/user/self

Pretty, isn’t it? What’s actually even nicer, and which the output
does not show is that there’s full command line completion
available: as you press TAB the shell will auto-complete the service
names for you. It’s a real pleasure to explore your D-Bus objects that
way!

The output shows some objects that you might recognize from the
explanations above. Now, let’s go further. Let’s see what interfaces,
methods, signals and properties one of these objects actually exposes:

$ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME                                TYPE      SIGNATURE RESULT/VALUE                             FLAGS
org.freedesktop.DBus.Introspectable interface -         -                                        -
.Introspect                         method    -         s                                        -
org.freedesktop.DBus.Peer           interface -         -                                        -
.GetMachineId                       method    -         s                                        -
.Ping                               method    -         -                                        -
org.freedesktop.DBus.Properties     interface -         -                                        -
.Get                                method    ss        v                                        -
.GetAll                             method    s         a{sv}                                    -
.Set                                method    ssv       -                                        -
.PropertiesChanged                  signal    sa{sv}as  -                                        -
org.freedesktop.login1.Session      interface -         -                                        -
.Activate                           method    -         -                                        -
.Kill                               method    si        -                                        -
.Lock                               method    -         -                                        -
.PauseDeviceComplete                method    uu        -                                        -
.ReleaseControl                     method    -         -                                        -
.ReleaseDevice                      method    uu        -                                        -
.SetIdleHint                        method    b         -                                        -
.TakeControl                        method    b         -                                        -
.TakeDevice                         method    uu        hb                                       -
.Terminate                          method    -         -                                        -
.Unlock                             method    -         -                                        -
.Active                             property  b         true                                     emits-change
.Audit                              property  u         1                                        const
.Class                              property  s         "user"                                   const
.Desktop                            property  s         ""                                       const
.Display                            property  s         ""                                       const
.Id                                 property  s         "1"                                      const
.IdleHint                           property  b         true                                     emits-change
.IdleSinceHint                      property  t         1434494624206001                         emits-change
.IdleSinceHintMonotonic             property  t         0                                        emits-change
.Leader                             property  u         762                                      const
.Name                               property  s         "lennart"                                const
.Remote                             property  b         false                                    const
.RemoteHost                         property  s         ""                                       const
.RemoteUser                         property  s         ""                                       const
.Scope                              property  s         "session-1.scope"                        const
.Seat                               property  (so)      "seat0" "/org/freedesktop/login1/seat... const
.Service                            property  s         "gdm-autologin"                          const
.State                              property  s         "active"                                 -
.TTY                                property  s         "/dev/tty1"                              const
.Timestamp                          property  t         1434494630344367                         const
.TimestampMonotonic                 property  t         34814579                                 const
.Type                               property  s         "x11"                                    const
.User                               property  (uo)      1000 "/org/freedesktop/login1/user/_1... const
.VTNr                               property  u         1                                        const
.Lock                               signal    -         -                                        -
.PauseDevice                        signal    uus       -                                        -
.ResumeDevice                       signal    uuh       -                                        -
.Unlock                             signal    -         -                                        -

As before, the busctl command supports command line completion, hence
both the service name and the object path used are easily put together
on the shell simply by pressing TAB. The output shows the methods,
properties, signals of one of the session objects that are currently
made available by systemd-logind. There’s a section for each
interface the object knows. The second column tells you what kind of
member is shown in the line. The third column shows the signature of
the member. In case of method calls that’s the input parameters, the
fourth column shows what is returned. For properties, the fourth
column encodes the current value of them.

So far, we just explored. Let’s take the next step now: let’s become
active – let’s call a method:

# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock

I don’t think I need to mention this anymore, but anyway: again
there’s full command line completion available. The third argument is
the interface name, the fourth the method name, both can be easily
completed by pressing TAB. In this case we picked the Lock method,
which activates the screen lock for the specific session. And yupp,
the instant I pressed enter on this line my screen lock turned on
(this only works on DEs that correctly hook into systemd-logind for
this to work. GNOME works fine, and KDE should work too).

The Lock method call we picked is very simple, as it takes no
parameters and returns none. Of course, it can get more complicated
for some calls. Here’s another example, this time using one of
systemd’s own bus calls, to start an arbitrary system unit:

# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"

This call takes two strings as input parameters, as we denote in the
signature string that follows the method name (as usual, command line
completion helps you getting this right). Following the signature the
next two parameters are simply the two strings to pass. The specified
signature string hence indicates what comes next. systemd’s StartUnit
method call takes the unit name to start as first parameter, and the
mode in which to start it as second. The call returned a single object
path value. It is encoded the same way as the input parameter: a
signature (just o for the object path) followed by the actual value.

Of course, some method call parameters can get a ton more complex, but
with busctl it’s relatively easy to encode them all. See the man
page
for
details.

busctl knows a number of other operations. For example, you can use
it to monitor D-Bus traffic as it happens (including generating a
.cap file for use with Wireshark!) or you can set or get specific
properties. However, this blog story was supposed to be about sd-bus,
not busctl, hence let’s cut this short here, and let me direct you
to the man page in case you want to know more about the tool.

busctl (like the rest of system) is implemented using the sd-bus
API. Thus it exposes many of the features of sd-bus itself. For
example, you can use to connect to remote or container buses. It
understands both kdbus and classic D-Bus, and more!

sd-bus

But enough! Let’s get back on topic, let’s talk about sd-bus itself.

The sd-bus set of APIs is mostly contained in the header file
sd-bus.h.

Here’s a random selection of features of the library, that make it
compare well with the other implementations available.

  • Supports both kdbus and dbus1 as back-end.

  • Has high-level support for connecting to remote buses via ssh, and
    to buses of local OS containers.

  • Powerful credential model, to implement authentication of clients
    in services. Currently 34 individual fields are supported, from the
    PID of the client to the cgroup or capability sets.

  • Support for tracking the life-cycle of peers in order to release
    local objects automatically when all peers referencing them
    disconnected.

  • The client builds an efficient decision tree to determine which
    handlers to deliver an incoming bus message to.

  • Automatically translates D-Bus errors into UNIX style errors and
    back (this is lossy though), to ensure best integration of D-Bus
    into low-level Linux programs.

  • Powerful but lightweight object model for exposing local objects on
    the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on
completing the set of manual pages. For details
see all pages starting with sd_bus_.

Invoking a Method, from C, with sd-bus

So much about the library in general. Here’s an example for connecting
to the bus and issuing a method call:

#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
        sd_bus_error error = SD_BUS_ERROR_NULL;
        sd_bus_message *m = NULL;
        sd_bus *bus = NULL;
        const char *path;
        int r;

        /* Connect to the system bus */
        r = sd_bus_open_system(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %sn", strerror(-r));
                goto finish;
        }

        /* Issue the method call and store the respons message in m */
        r = sd_bus_call_method(bus,
                               "org.freedesktop.systemd1",           /* service to contact */
                               "/org/freedesktop/systemd1",          /* object path */
                               "org.freedesktop.systemd1.Manager",   /* interface name */
                               "StartUnit",                          /* method name */
                               &error,                               /* object to return error in */
                               &m,                                   /* return message on success */
                               "ss",                                 /* input signature */
                               "cups.service",                       /* first argument */
                               "replace");                           /* second argument */
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %sn", error.message);
                goto finish;
        }

        /* Parse the response message */
        r = sd_bus_message_read(m, "o", &path);
        if (r < 0) {
                fprintf(stderr, "Failed to parse response message: %sn", strerror(-r));
                goto finish;
        }

        printf("Queued service job as %s.n", path);

finish:
        sd_bus_error_free(&error);
        sd_bus_message_unref(m);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-client.c, then build it with:

$ gcc bus-client.c -o bus-client `pkg-config --cflags --libs libsystemd`

This will generate a binary bus-client you can now run. Make sure to
run it as root though, since access to the StartUnit method is
privileged:

# ./bus-client
Queued service job as /org/freedesktop/systemd1/job/3586.

And that’s it already, our first example. It showed how we invoked a
method call on the bus. The actual function call of the method is very
close to the busctl command line we used before. I hope the code
excerpt needs little further explanation. It’s supposed to give you a
taste how to write D-Bus clients with sd-bus. For more more
information please have a look at the header file, the man page or
even the sd-bus sources.

Implementing a Service, in C, with sd-bus

Of course, just calling a single method is a rather simplistic
example. Let’s have a look on how to write a bus service. We’ll write
a small calculator service, that exposes a single object, which
implements an interface that exposes two methods: one to multiply two
64bit signed integers, and one to divide one 64bit signed integer by
another.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <systemd/sd-bus.h>

static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %sn", strerror(-r));
                return r;
        }

        /* Reply with the response */
        return sd_bus_reply_method_return(m, "x", x * y);
}

static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %sn", strerror(-r));
                return r;
        }

        /* Return an error on division by zero */
        if (y == 0) {
                sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero.");
                return -EINVAL;
        }

        return sd_bus_reply_method_return(m, "x", x / y);
}

/* The vtable of our little object, implements the net.poettering.Calculator interface */
static const sd_bus_vtable calculator_vtable[] = {
        SD_BUS_VTABLE_START(0),
        SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_METHOD("Divide",   "xx", "x", method_divide,   SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_VTABLE_END
};

int main(int argc, char *argv[]) {
        sd_bus_slot *slot = NULL;
        sd_bus *bus = NULL;
        int r;

        /* Connect to the user bus this time */
        r = sd_bus_open_user(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %sn", strerror(-r));
                goto finish;
        }

        /* Install the object */
        r = sd_bus_add_object_vtable(bus,
                                     &slot,
                                     "/net/poettering/Calculator",  /* object path */
                                     "net.poettering.Calculator",   /* interface name */
                                     calculator_vtable,
                                     NULL);
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %sn", strerror(-r));
                goto finish;
        }

        /* Take a well-known service name so that clients can find us */
        r = sd_bus_request_name(bus, "net.poettering.Calculator", 0);
        if (r < 0) {
                fprintf(stderr, "Failed to acquire service name: %sn", strerror(-r));
                goto finish;
        }

        for (;;) {
                /* Process requests */
                r = sd_bus_process(bus, NULL);
                if (r < 0) {
                        fprintf(stderr, "Failed to process bus: %sn", strerror(-r));
                        goto finish;
                }
                if (r > 0) /* we processed a request, try to process another one, right-away */
                        continue;

                /* Wait for the next request to process */
                r = sd_bus_wait(bus, (uint64_t) -1);
                if (r < 0) {
                        fprintf(stderr, "Failed to wait on bus: %sn", strerror(-r));
                        goto finish;
                }
        }

finish:
        sd_bus_slot_unref(slot);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-service.c, then build it with:

$ gcc bus-service.c -o bus-service `pkg-config --cflags --libs libsystemd`

Now, let’s run it:

$ ./bus-service

In another terminal, let’s try to talk to it. Note that this service
is now on the user bus, not on the system bus as before. We do this
for simplicity reasons: on the system bus access to services is
tightly controlled so unprivileged clients cannot request privileged
operations. On the user bus however things are simpler: as only
processes of the user owning the bus can connect no further policy
enforcement will complicate this example. Because the service is on
the user bus, we have to pass the --user switch on the busctl
command line. Let’s start with looking at the service’s object tree.

$ busctl --user tree net.poettering.Calculator
└─/net/poettering/Calculator

As we can see, there’s only a single object on the service, which is
not surprising, given that our code above only registered one. Let’s
see the interfaces and the members this object exposes:

$ busctl --user introspect net.poettering.Calculator /net/poettering/Calculator
NAME                                TYPE      SIGNATURE RESULT/VALUE FLAGS
net.poettering.Calculator           interface -         -            -
.Divide                             method    xx        x            -
.Multiply                           method    xx        x            -
org.freedesktop.DBus.Introspectable interface -         -            -
.Introspect                         method    -         s            -
org.freedesktop.DBus.Peer           interface -         -            -
.GetMachineId                       method    -         s            -
.Ping                               method    -         -            -
org.freedesktop.DBus.Properties     interface -         -            -
.Get                                method    ss        v            -
.GetAll                             method    s         a{sv}        -
.Set                                method    ssv       -            -
.PropertiesChanged                  signal    sa{sv}as  -            -

The sd-bus library automatically added a couple of generic interfaces,
as mentioned above. But the first interface we see is actually the one
we added! It shows our two methods, and both take “xx” (two 64bit
signed integers) as input parameters, and return one “x”. Great! But
does it work?

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35

Woohoo! We passed the two integers 5 and 7, and the service actually
multiplied them for us and returned a single integer 35! Let’s try the
other method:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17
x 5

Oh, wow! It can even do integer division! Fantastic! But let’s trick
it into dividing by zero:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.

Nice! It detected this nicely and returned a clean error about it. If
you look in the source code example above you’ll see how precisely we
generated the error.

And that’s really all I have for today. Of course, the examples I
showed are short, and I don’t get into detail here on what precisely
each line does. However, this is supposed to be a short introduction
into D-Bus and sd-bus, and it’s already way too long for that …

I hope this blog story was useful to you. If you are interested in
using sd-bus for your own programs, I hope this gets you started. If
you have further questions, check the (incomplete) man pages, and
inquire us on IRC or the systemd mailing list. If you need more
examples, have a look at the systemd source tree, all of systemd’s
many bus services use sd-bus extensively.

systemd For Administrators, Part XXI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemd-for-administrators-part-xxi.html

Container Integration

Since a while containers have been one of the hot topics on
Linux. Container managers such as libvirt-lxc, LXC or Docker are
widely known and used these days. In this blog story I want to shed
some light on systemd‘s integration points with container managers, to
allow seamless management of services across container boundaries.

We’ll focus on OS containers here, i.e. the case where an init system
runs inside the container, and the container hence in most ways
appears like an independent system of its own. Much of what I describe
here is available on pretty much any container manager that implements
the logic described
here
,
including libvirt-lxc. However, to make things easy we’ll focus on
systemd-nspawn,
the mini-container manager that is shipped with systemd
itself. systemd-nspawn uses the same kernel interfaces as the other
container managers, however is less flexible as it is designed to be a
container manager that is as simple to use as possible and “just
works”, rather than trying to be a generic tool you can configure in
every low-level detail. We use systemd-nspawn extensively when
developing systemd.

Anyway, so let’s get started with our run-through. Let’s start by
creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in
/srv/mycontainer. This command line is Fedora-specific, but most
distributions provide similar functionality in one way or another. The
examples section in the systemd-nspawn(1) man
page

contains a list of the various command lines for other distribution.

We now have the new container installed, let’s set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then
use passwd to set the root password. After that the initial setup is done,
hence let’s boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container
integration of systemd. Let’s have a look at the first tool,
machinectl. When run without parameters it shows a list of all
locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The “status” subcommand shows details about the container:

$ machinectl status mycontainer
mycontainer:
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
     Address: 192.168.178.38
              10.36.6.162
              fd00::523f:56ff:fe00:4994
              fe80::523f:56ff:fe00:4994
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
              └─system.slice
                ├─dbus.service
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                ├─systemd-journald.service
                │ └─5383 /usr/lib/systemd/systemd-journald
                ├─systemd-logind.service
                │ └─5411 /usr/lib/systemd/systemd-logind
                └─console-getty.service
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container,
including its control group tree (with processes), IP addresses and
root directory.

The “login” subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The “reboot” subcommand reboots the container:

# machinectl reboot mycontainer

The “poweroff” subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more
commands, please check the man
page

for details. Note again that even though we use systemd-nspawn as
container manager here the concepts apply to any container manager
that implements the logic described
here
,
including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with
containers. Many of systemd’s own tools have been updated to
explicitly support containers too! Let’s try this (after starting the
container up again first, repeating the systemd-nspawn command from
above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses
hostnamectl(1)
on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local
containers. Here’s
systemctl(1)‘s -M switch
in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]
time-sync.target                     loaded active active    System Time Synchronized
timers.target                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified
container, not the host. (Output is shortened here, the blog story is
already getting too long).

Let’s use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M
switch. With the -r switch it shows the units running on the host,
plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0x2dLVDSx2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]
timers.target                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]
mycontainer:time-sync.target                                                                        loaded active active    System Time Synchronized
mycontainer:timers.target                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the
units of the one container we have currently running. The units of the
containers are prefixed with the container name, and a colon
(“:”). (The output is shortened again for brevity’s sake.)

The list-machines subcommand of systemctl shows a list of all
running containers, inquiring the system managers within the containers
about system state and health. More specifically it shows if
containers are properly booted up, or if there are any failed
services:

# systemctl list-machines
NAME         STATE   FAILED JOBS
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in
parallel. One of them has a failed service, which results in the
machine state to be degraded.

Let’s have a look at
journalctl(1)‘s
container support. It too supports -M to show the logs of a specific
container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the
host and all local containers:

# journalctl -m -e

(Let’s skip the output here completely, I figure you can extrapolate
how this looks.)

But it’s not only systemd’s own tools that understand container
support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
[...]
2915 -                               emacs contents/projects/containers.md
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved
[...]

This shows a process list (shortened). The second column shows the
container a process belongs to. All processes shown with “-” belong to
the host itself.

But it doesn’t stop there. The new “sd-bus” D-Bus client library we
have been preparing in the systemd/kdbus context knows containers
too. While you use sd_bus_open_system() to connect to your local
host’s system bus sd_bus_open_system_container() may be used to
connect to the system bus of any local container, so that you can
execute bus methods on it.

sd-login.h
and machined’s bus
interface

provide a number of APIs to add container support to other programs
too. They support enumeration of containers as well as retrieving the
machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a
container it will by default run a DHCP client and IPv4LL on any veth
network interface named host0 (this interface is special under the
logic described here). When run on the host networkd will by default
provide a DHCP server and IPv4LL on veth network interface named ve-
followed by a container name.

Let’s have a look at one last facet of systemd’s container
integration: the hook-up with the name service switch. Recent systemd
versions contain a new NSS module nss-mymachines that make the names
of all local containers resolvable via gethostbyname() and
getaddrinfo(). This only applies to containers that run within their
own network namespace. With the systemd-nspawn command shown above the
the container shares the network configuration with the host however;
hence let’s restart the container, this time with a virtual veth
network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we
can already ping the container using its name, due to the simple magic
of nss-mymachines:

# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with
all other tools that use libc gethostbyname() or getaddrinfo()
too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly
touched a variety of integration points, and there’s a lot more still
if you look closely. We are working on even more container integration
all the time, so expect more new features in this area with every
systemd release.

Note that the whole machine concept is actually not limited to
containers, but covers VMs too to a certain degree. However, the
integration is not as close, as access to a VM’s internals is not as
easy as for containers, as it usually requires a network transport
instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look
at the linked man pages and other documentation.

systemd For Administrators, Part XXI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemd-for-administrators-part-xxi.html

Container Integration
Since a while containers have been one of the hot topics on
Linux. Container managers such as libvirt-lxc, LXC or Docker are
widely known and used these days. In this blog story I want to shed
some light on systemd‘s integration points with container managers, to
allow seamless management of services across container boundaries.
We’ll focus on OS containers here, i.e. the case where an init system
runs inside the container, and the container hence in most ways
appears like an independent system of its own. Much of what I describe
here is available on pretty much any container manager that implements
the logic described
here
,
including libvirt-lxc. However, to make things easy we’ll focus on
systemd-nspawn,
the mini-container manager that is shipped with systemd
itself. systemd-nspawn uses the same kernel interfaces as the other
container managers, however is less flexible as it is designed to be a
container manager that is as simple to use as possible and “just
works”, rather than trying to be a generic tool you can configure in
every low-level detail. We use systemd-nspawn extensively when
developing systemd.
Anyway, so let’s get started with our run-through. Let’s start by
creating a Fedora container tree in a subdirectory:
# yum -y –releasever=20 –nogpg –installroot=/srv/mycontainer –disablerepo='*' –enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in
/srv/mycontainer. This command line is Fedora-specific, but most
distributions provide similar functionality in one way or another. The
examples section in the systemd-nspawn(1) man
page

contains a list of the various command lines for other distribution.
We now have the new container installed, let’s set an initial root password:
# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then
use passwd to set the root password. After that the initial setup is done,
hence let’s boot it up and log in as root with our new password:
$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[ OK ] Reached target Remote File Systems.
[ OK ] Created slice Root Slice.
[ OK ] Created slice User and Session Slice.
[ OK ] Created slice System Slice.
[ OK ] Created slice system-getty.slice.
[ OK ] Reached target Slices.
[ OK ] Listening on Delayed Shutdown Socket.
[ OK ] Listening on /dev/initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket.
Starting Journal Service…
[ OK ] Started Journal Service.
[ OK ] Reached target Paths.
Mounting Debug File System…
Mounting Configuration File System…
Mounting FUSE Control File System…
Starting Create static device nodes in /dev…
Mounting POSIX Message Queue File System…
Mounting Huge Pages File System…
[ OK ] Reached target Encrypted Volumes.
[ OK ] Reached target Swap.
Mounting Temporary Directory…
Starting Load/Save Random Seed…
[ OK ] Mounted Configuration File System.
[ OK ] Mounted FUSE Control File System.
[ OK ] Mounted Temporary Directory.
[ OK ] Mounted POSIX Message Queue File System.
[ OK ] Mounted Debug File System.
[ OK ] Mounted Huge Pages File System.
[ OK ] Started Load/Save Random Seed.
[ OK ] Started Create static device nodes in /dev.
[ OK ] Reached target Local File Systems (Pre).
[ OK ] Reached target Local File Systems.
Starting Trigger Flushing of Journal to Persistent Storage…
Starting Recreate Volatile Files and Directories…
[ OK ] Started Recreate Volatile Files and Directories.
Starting Update UTMP about System Reboot/Shutdown…
[ OK ] Started Trigger Flushing of Journal to Persistent Storage.
[ OK ] Started Update UTMP about System Reboot/Shutdown.
[ OK ] Reached target System Initialization.
[ OK ] Reached target Timers.
[ OK ] Listening on D-Bus System Message Bus Socket.
[ OK ] Reached target Sockets.
[ OK ] Reached target Basic System.
Starting Login Service…
Starting Permit User Sessions…
Starting D-Bus System Message Bus…
[ OK ] Started D-Bus System Message Bus.
Starting Cleanup of Temporary Directories…
[ OK ] Started Cleanup of Temporary Directories.
[ OK ] Started Permit User Sessions.
Starting Console Getty…
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
[ OK ] Started Login Service.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container
integration of systemd. Let’s have a look at the first tool,
machinectl. When run without parameters it shows a list of all
locally running containers:
$ machinectl
MACHINE CONTAINER SERVICE
mycontainer container nspawn

1 machines listed.

The “status” subcommand shows details about the container:
$ machinectl status mycontainer
mycontainer:
Since: Mi 2014-11-12 16:47:19 CET; 51s ago
Leader: 5374 (systemd)
Service: nspawn; class container
Root: /srv/mycontainer
Address: 192.168.178.38
10.36.6.162
fd00::523f:56ff:fe00:4994
fe80::523f:56ff:fe00:4994
OS: Fedora 20 (Heisenbug)
Unit: machine-mycontainer.scope
├─5374 /usr/lib/systemd/systemd
└─system.slice
├─dbus.service
│ └─5414 /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-act…
├─systemd-journald.service
│ └─5383 /usr/lib/systemd/systemd-journald
├─systemd-logind.service
│ └─5411 /usr/lib/systemd/systemd-logind
└─console-getty.service
└─5416 /sbin/agetty –noclear -s console 115200 38400 9600

With this we see some interesting information about the container,
including its control group tree (with processes), IP addresses and
root directory.
The “login” subcommand gets us a new login shell in the container:
# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The “reboot” subcommand reboots the container:
# machinectl reboot mycontainer

The “poweroff” subcommand powers the container off:
# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more
commands, please check the man
page

for details. Note again that even though we use systemd-nspawn as
container manager here the concepts apply to any container manager
that implements the logic described
here
,
including libvirt-lxc for example.
machinectl is not the only tool that is useful in conjunction with
containers. Many of systemd’s own tools have been updated to
explicitly support containers too! Let’s try this (after starting the
container up again first, repeating the systemd-nspawn command from
above.):
# hostnamectl -M mycontainer set-hostname "wuff"

This uses
hostnamectl(1)
on the local container and sets its hostname.
Similar, many other tools have been updated for connecting to local
containers. Here’s
systemctl(1)‘s -M switch
in action:
# systemctl -M mycontainer
UNIT LOAD ACTIVE SUB DESCRIPTION
-.mount loaded active mounted /
dev-hugepages.mount loaded active mounted Huge Pages File System
dev-mqueue.mount loaded active mounted POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted /proc/sys/kernel/random/boot_id
[…]
time-sync.target loaded active active System Time Synchronized
timers.target loaded active active Timers
systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass –all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified
container, not the host. (Output is shortened here, the blog story is
already getting too long).
Let’s use this to restart a service within our container:
# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M
switch. With the -r switch it shows the units running on the host,
plus all units of all local, running containers:
# systemctl -r
UNIT LOAD ACTIVE SUB DESCRIPTION
boot.automount loaded active waiting EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount loaded active waiting Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0x2dLVDSx2d1-intel_backlight.device loaded active plugged /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[…]
timers.target loaded active active Timers
mandb.timer loaded active waiting Daily man-db cache update
systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories
mycontainer:-.mount loaded active mounted /
mycontainer:dev-hugepages.mount loaded active mounted Huge Pages File System
mycontainer:dev-mqueue.mount loaded active mounted POSIX Message Queue File System
[…]
mycontainer:time-sync.target loaded active active System Time Synchronized
mycontainer:timers.target loaded active active Timers
mycontainer:systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass –all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the
units of the one container we have currently running. The units of the
containers are prefixed with the container name, and a colon
(“:”). (The output is shortened again for brevity’s sake.)
The list-machines subcommand of systemctl shows a list of all
running containers, inquiring the system managers within the containers
about system state and health. More specifically it shows if
containers are properly booted up, or if there are any failed
services:
# systemctl list-machines
NAME STATE FAILED JOBS
delta (host) running 0 0
mycontainer running 0 0
miau degraded 1 0
waldi running 0 0

4 machines listed.

To make things more interesting we have started two more containers in
parallel. One of them has a failed service, which results in the
machine state to be degraded.
Let’s have a look at
journalctl(1)‘s
container support. It too supports -M to show the logs of a specific
container:
# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes…
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the
host and all local containers:
# journalctl -m -e

(Let’s skip the output here completely, I figure you can extrapolate
how this looks.)
But it’s not only systemd’s own tools that understand container
support these days, procps sports support for it, too:
# ps -eo pid,machine,args
PID MACHINE COMMAND
1 – /usr/lib/systemd/systemd –switched-root –system –deserialize 20
[…]
2915 – emacs contents/projects/containers.md
3403 – [kworker/u16:7]
3415 – [kworker/u16:9]
4501 – /usr/libexec/nm-vpnc-service
4519 – /usr/sbin/vpnc –non-inter –no-detach –pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid –
4749 – /usr/libexec/dconf-service
4980 – /usr/lib/systemd/systemd-resolved
5006 – /usr/lib64/firefox/firefox
5168 – [kworker/u16:0]
5192 – [kworker/u16:4]
5193 – [kworker/u16:5]
5497 – [kworker/u16:1]
5591 – [kworker/u16:8]
5711 – sudo -s
5715 – /bin/bash
5749 – /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer /usr/lib/systemd/systemd
5799 mycontainer /usr/lib/systemd/systemd-journald
5862 mycontainer /usr/lib/systemd/systemd-logind
5863 mycontainer /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-activation
5868 mycontainer /sbin/agetty –noclear –keep-baud console 115200 38400 9600 vt102
5871 mycontainer /usr/sbin/sshd -D
6527 mycontainer /usr/lib/systemd/systemd-resolved
[…]

This shows a process list (shortened). The second column shows the
container a process belongs to. All processes shown with “-” belong to
the host itself.
But it doesn’t stop there. The new “sd-bus” D-Bus client library we
have been preparing in the systemd/kdbus context knows containers
too. While you use sd_bus_open_system() to connect to your local
host’s system bus sd_bus_open_system_container() may be used to
connect to the system bus of any local container, so that you can
execute bus methods on it.
sd-login.h
and machined’s bus
interface

provide a number of APIs to add container support to other programs
too. They support enumeration of containers as well as retrieving the
machine name from a PID and similar.
systemd-networkd also has support for containers. When run inside a
container it will by default run a DHCP client and IPv4LL on any veth
network interface named host0 (this interface is special under the
logic described here). When run on the host networkd will by default
provide a DHCP server and IPv4LL on veth network interface named ve-
followed by a container name.
Let’s have a look at one last facet of systemd’s container
integration: the hook-up with the name service switch. Recent systemd
versions contain a new NSS module nss-mymachines that make the names
of all local containers resolvable via gethostbyname() and
getaddrinfo(). This only applies to containers that run within their
own network namespace. With the systemd-nspawn command shown above the
the container shares the network configuration with the host however;
hence let’s restart the container, this time with a virtual veth
network link between host and container:
# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer –network-veth -b

Now, (assuming that networkd is used in the container and outside) we
can already ping the container using its name, due to the simple magic
of nss-mymachines:
# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with
all other tools that use libc gethostbyname() or getaddrinfo()
too, among them venerable ssh.
And this is pretty much all I want to cover for now. We briefly
touched a variety of integration points, and there’s a lot more still
if you look closely. We are working on even more container integration
all the time, so expect more new features in this area with every
systemd release.
Note that the whole machine concept is actually not limited to
containers, but covers VMs too to a certain degree. However, the
integration is not as close, as access to a VM’s internals is not as
easy as for containers, as it usually requires a network transport
instead of allowing direct syscall access.
Anyway, I hope this is useful. For further details, please have a look
at the linked man pages and other documentation.

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

  • To start less.
  • And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate‘s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo)
in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

  1. service: these are the most obvious kind of unit:
    daemons that can be started, stopped, restarted, reloaded. For
    compatibility with SysV we not only support our own
    configuration files for services, but also are able to read
    classic SysV init scripts, in particular we parse the LSB
    header, if it exists. /etc/init.d is hence not much
    more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the
    file-system or on the Internet. We currently support AF_INET,
    AF_INET6, AF_UNIX sockets of the types stream, datagram, and
    sequential packet. We also support classic FIFOs as
    transport. Each socket unit has a matching
    service unit, that is started if the first connection
    comes in on the socket or FIFO. Example: nscd.socket
    starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the
    Linux device tree. If a device is marked for this via udev
    rules, it will be exposed as a device unit in
    systemd. Properties set with udev can be used as
    configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the
    file system hierarchy. systemd monitors all mount points how
    they come and go, and can also be used to mount or
    unmount mount-points. /etc/fstab is used here as an
    additional configuration source for these mount points, similar to
    how SysV init scripts can be used as additional configuration
    source for service units.
  5. automount: this unit type encapsulates an automount
    point in the file system hierarchy. Each automount
    unit has a matching mount unit, which is started
    (i.e. mounted) as soon as the automount directory is
    accessed.
  6. target: this unit type is used for logical
    grouping of units: instead of actually doing anything by itself
    it simply references other units, which thereby can be controlled
    together. Examples for this are: multi-user.target,
    which is a target that basically plays the role of run-level 5 on
    classic SysV system, or bluetooth.target which is
    requested as soon as a bluetooth dongle becomes available and
    which simply pulls in bluetooth related services that otherwise
    would not need to be started: bluetoothd and
    obexd and suchlike.
  7. snapshot: similar to target units
    snapshots do not actually do anything themselves and their only
    purpose is to reference other units. Snapshots can be used to
    save/rollback the state of all services and units of the init
    system. Primarily it has two intended use cases: to allow the
    user to temporarily enter a specific state such as “Emergency
    Shell”, terminating current services, and provide an easy way to
    return to the state before, pulling up all services again that
    got temporarily pulled down. And to ease support for system
    suspending: still many services cannot correctly deal with
    system suspend, and it is often a better idea to shut them down
    before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the
    environment, resource limits, working and root directory, umask,
    OOM killer adjustment, nice level, IO class and priority, CPU policy
    and priority, CPU affinity, timer slack, user id, group id,
    supplementary group ids, readable/writable/inaccessible
    directories, shared/private/slave mount flags,
    capabilities/bounding set, secure bits, CPU scheduler reset of
    fork, private /tmp name-space, cgroup control for
    various subsystems. Also, you can easily connect
    stdin/stdout/stderr of services to syslog, /dev/kmsg,
    arbitrary TTYs. If connected to a TTY for input systemd will make
    sure a process gets exclusive access, optionally waiting or enforcing
    it.
  2. Every executed process gets its own cgroup (currently by
    default in the debug subsystem, since that subsystem is not
    otherwise used and does not much more than the most basic
    process grouping), and it is very easy to configure systemd to
    place services in cgroups that have been configured externally,
    for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely
    follows the well-known .desktop files. It is a simple syntax for
    which parsers exist already in many software frameworks. Also, this
    allows us to rely on existing tools for i18n for service
    descriptions, and similar. Administrators and developers don’t
    need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init
    scripts. We take advantages of LSB and Red Hat chkconfig headers
    if they are available. If they aren’t we try to make the best of
    the otherwise available information, such as the start
    priorities in /etc/rc.d. These init scripts are simply
    considered a different source of configuration, hence an easy
    upgrade path to proper systemd services is available. Optionally
    we can read classic PID files for services to identify the main
    pid of a daemon. Note that we make use of the dependency
    information from the LSB init script headers, and translate
    those into native systemd dependencies. Side note: Upstart is
    unable to harvest and make use of that information. Boot-up on a
    plain Upstart system with mostly LSB SysV init scripts will
    hence not be parallelized, a similar system running systemd
    however will. In fact, for Upstart all SysV scripts together
    make one job that is executed, they are not treated
    individually, again in contrast to systemd where SysV init
    scripts are just another source of configuration and are all
    treated and controlled individually, much like any other native
    systemd service.
  5. Similarly, we read the existing /etc/fstab
    configuration file, and consider it just another source of
    configuration. Using the comment= fstab option you can
    even mark /etc/fstab entries to become systemd
    controlled automount points.
  6. If the same unit is configured in multiple configuration
    sources (e.g. /etc/systemd/system/avahi.service exists,
    and /etc/init.d/avahi too), then the native
    configuration will always take precedence, the legacy format is
    ignored, allowing an easy upgrade path and packages to carry
    both a SysV init script and a systemd service file for a
    while.
  7. We support a simple templating/instance mechanism. Example:
    instead of having six configuration files for six gettys, we
    only have one [email protected] file which gets instantiated to
    [email protected] and suchlike. The interface part can
    even be inherited by dependency expressions, i.e. it is easy to
    encode that a service [email protected] pulls in
    [email protected], while leaving the
    eth0 string wild-carded.
  8. For socket activation we support full compatibility with the
    traditional inetd modes, as well as a very simple mode that
    tries to mimic launchd socket activation and is recommended for
    new services. The inetd mode only allows passing one socket to
    the started daemon, while the native mode supports passing
    arbitrary numbers of file descriptors. We also support one
    instance per connection, as well as one instance for all
    connections modes. In the former mode we name the cgroup the
    daemon will be started in after the connection parameters, and
    utilize the templating logic mentioned above for this. Example:
    sshd.socket might spawn services
    [email protected] with a
    cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
    (i.e. the IP address and port numbers are used in the instance
    names. For AF_UNIX sockets we use PID and user id of the
    connecting client). This provides a nice way for the
    administrator to identify the various instances of a daemon and
    control their runtime individually. The native socket passing
    mode is very easily implementable in applications: if
    $LISTEN_FDS is set it contains the number of sockets
    passed and the daemon will find them sorted as listed in the
    .service file, starting from file descriptor 3 (a
    nicely written daemon could also use fstat() and
    getsockname() to identify the sockets in case it
    receives more than one). In addition we set $LISTEN_PID
    to the PID of the daemon that shall receive the fds, because
    environment variables are normally inherited by sub-processes and
    hence could confuse processes further down the chain. Even
    though this socket passing logic is very simple to implement in
    daemons, we will provide a BSD-licensed reference implementation
    that shows how to do this. We have ported a couple of existing
    daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a
    certain extent. This compatibility is in fact implemented with a
    FIFO-activated service, which simply translates these legacy
    requests to D-Bus requests. Effectively this means the old
    shutdown, poweroff and similar commands from
    Upstart and sysvinit continue to work with
    systemd.
  10. We also provide compatibility with utmp and
    wtmp. Possibly even to an extent that is far more
    than healthy, given how crufty utmp and wtmp
    are.
  11. systemd supports several kinds of
    dependencies between units. After/Before can be used to fix
    the ordering how units are activated. It is completely
    orthogonal to Requires and Wants, which
    express a positive requirement dependency, either mandatory, or
    optional. Then, there is Conflicts which
    expresses a negative requirement dependency. Finally, there are
    three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit
    is requested to start up or shut down we will add it and all its
    dependencies to a temporary transaction. Then, we will
    verify if the transaction is consistent (i.e. whether the
    ordering via After/Before of all units is
    cycle-free). If it is not, systemd will try to fix it up, and
    removes non-essential jobs from the transaction that might
    remove the loop. Also, systemd tries to suppress non-essential
    jobs in the transaction that would stop a running
    service. Non-essential jobs are those which the original request
    did not directly include but which where pulled in by
    Wants type of dependencies. Finally we check whether
    the jobs of the transaction contradict jobs that have already
    been queued, and optionally the transaction is aborted then. If
    all worked out and the transaction is consistent and minimized
    in its impact it is merged with all already outstanding jobs and
    added to the run queue. Effectively this means that before
    executing a requested operation, we will verify that it makes
    sense, fixing it if possible, and only failing if it really cannot
    work.
  13. We record start/exit time as well as the PID and exit status
    of every process we spawn and supervise. This data can be used
    to cross-link daemons with their data in abrtd, auditd and
    syslog. Think of an UI that will highlight crashed daemons for
    you, and allows you to easily navigate to the respective UIs for
    syslog, abrt, and auditd that will show the data generated from
    and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any
    time. The daemon state is serialized before the reexecution and
    deserialized afterwards. That way we provide a simple way to
    facilitate init system upgrades as well as handover from an
    initrd daemon to the final daemon. Open sockets and autofs
    mounts are properly serialized away, so that they stay
    connectible all the time, in a way that clients will not even
    notice that the init system reexecuted itself. Also, the fact
    that a big part of the service state is encoded anyway in the
    cgroup virtual file system would even allow us to resume
    execution without access to the serialization data. The
    reexecution code paths are actually mostly the same as the init
    system configuration reloading code paths, which
    guarantees that reexecution (which is probably more seldom
    triggered) gets similar testing as reloading (which is probably
    more common).
  15. Starting the work of removing shell scripts from the boot
    process we have recoded part of the basic system setup in C and
    moved it directly into systemd. Among that is mounting of the API
    file systems (i.e. virtual file systems such as /proc,
    /sys and /dev.) and setting of the
    host-name.
  16. Server state is introspectable and controllable via
    D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based
    activation, and we hence support dependencies between sockets and
    services, we also support traditional inter-service
    dependencies. We support multiple ways how such a service can
    signal its readiness: by forking and having the start process
    exit (i.e. traditional daemonize() behaviour), as well
    as by watching the bus until a configured service name appears.
  18. There’s an interactive mode which asks for confirmation each
    time a process is spawned by systemd. You may enable it by
    passing systemd.confirm_spawn=1 on the kernel command
    line.
  19. With the systemd.default= kernel command line
    parameter you can specify which unit systemd should start on
    boot-up. Normally you’d specify something like
    multi-user.target here, but another choice could even
    be a single service instead of a target, for example
    out-of-the-box we ship a service emergency.service that
    is similar in its usefulness as init=/bin/bash, however
    has the advantage of actually running the init system, hence
    offering the option to boot up the full system from the
    emergency shell.
  20. There’s a minimal UI that allows you to
    start/stop/introspect services. It’s far from complete but
    useful as a debugging tool. It’s written in Vala (yay!) and goes
    by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup
showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

  • We ask daemon writers not to fork or even double fork
    in their processes, but run their event loop from the initial process
    systemd starts for you. Also, don’t call setsid().
  • Don’t drop user privileges in the daemon itself, leave this
    to systemd and configure it in systemd service configuration
    files. (There are exceptions here. For example, for some daemons
    there are good reasons to drop privileges inside the daemon
    code, after an initialization phase that requires elevated
    privileges.)
  • Don’t write PID files
  • Grab a name on the bus
  • You may rely on systemd for logging, you are welcome to log
    whatever you need to log to stderr.
  • Let systemd create and watch sockets for you, so that socket
    activation works. Hence, interpret $LISTEN_FDS and
    $LISTEN_PID as described above.
  • Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?
Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.
Will this come to Fedora?
If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay’s pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That’s up to them. We’d certainly welcome their interest, and help with the integration.
Why didn’t you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.
Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.
If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

To start less.

And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate’s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo) in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

service: these are the most obvious kind of unit:
daemons that can be started, stopped, restarted, reloaded. For
compatibility with SysV we not only support our own
configuration files for services, but also are able to read
classic SysV init scripts, in particular we parse the LSB
header, if it exists. /etc/init.d is hence not much
more than just another source of configuration.

socket: this unit encapsulates a socket in the
file-system or on the Internet. We currently support AF_INET,
AF_INET6, AF_UNIX sockets of the types stream, datagram, and
sequential packet. We also support classic FIFOs as
transport. Each socket unit has a matching
service unit, that is started if the first connection
comes in on the socket or FIFO. Example: nscd.socket
starts nscd.service on an incoming connection.

device: this unit encapsulates a device in the
Linux device tree. If a device is marked for this via udev
rules, it will be exposed as a device unit in
systemd. Properties set with udev can be used as
configuration source to set dependencies for device units.

mount: this unit encapsulates a mount point in the
file system hierarchy. systemd monitors all mount points how
they come and go, and can also be used to mount or
unmount mount-points. /etc/fstab is used here as an
additional configuration source for these mount points, similar to
how SysV init scripts can be used as additional configuration
source for service units.

automount: this unit type encapsulates an automount
point in the file system hierarchy. Each automount
unit has a matching mount unit, which is started
(i.e. mounted) as soon as the automount directory is
accessed.

target: this unit type is used for logical
grouping of units: instead of actually doing anything by itself
it simply references other units, which thereby can be controlled
together. Examples for this are: multi-user.target,
which is a target that basically plays the role of run-level 5 on
classic SysV system, or bluetooth.target which is
requested as soon as a bluetooth dongle becomes available and
which simply pulls in bluetooth related services that otherwise
would not need to be started: bluetoothd and
obexd and suchlike.

snapshot: similar to target units
snapshots do not actually do anything themselves and their only
purpose is to reference other units. Snapshots can be used to
save/rollback the state of all services and units of the init
system. Primarily it has two intended use cases: to allow the
user to temporarily enter a specific state such as “Emergency
Shell”, terminating current services, and provide an easy way to
return to the state before, pulling up all services again that
got temporarily pulled down. And to ease support for system
suspending: still many services cannot correctly deal with
system suspend, and it is often a better idea to shut them down
before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

For each process that is spawned, you may control: the
environment, resource limits, working and root directory, umask,
OOM killer adjustment, nice level, IO class and priority, CPU policy
and priority, CPU affinity, timer slack, user id, group id,
supplementary group ids, readable/writable/inaccessible
directories, shared/private/slave mount flags,
capabilities/bounding set, secure bits, CPU scheduler reset of
fork, private /tmp name-space, cgroup control for
various subsystems. Also, you can easily connect
stdin/stdout/stderr of services to syslog, /dev/kmsg,
arbitrary TTYs. If connected to a TTY for input systemd will make
sure a process gets exclusive access, optionally waiting or enforcing
it.

Every executed process gets its own cgroup (currently by
default in the debug subsystem, since that subsystem is not
otherwise used and does not much more than the most basic
process grouping), and it is very easy to configure systemd to
place services in cgroups that have been configured externally,
for example via the libcgroups utilities.

The native configuration files use a syntax that closely
follows the well-known .desktop files. It is a simple syntax for
which parsers exist already in many software frameworks. Also, this
allows us to rely on existing tools for i18n for service
descriptions, and similar. Administrators and developers don’t
need to learn a new syntax.

As mentioned, we provide compatibility with SysV init
scripts. We take advantages of LSB and Red Hat chkconfig headers
if they are available. If they aren’t we try to make the best of
the otherwise available information, such as the start
priorities in /etc/rc.d. These init scripts are simply
considered a different source of configuration, hence an easy
upgrade path to proper systemd services is available. Optionally
we can read classic PID files for services to identify the main
pid of a daemon. Note that we make use of the dependency
information from the LSB init script headers, and translate
those into native systemd dependencies. Side note: Upstart is
unable to harvest and make use of that information. Boot-up on a
plain Upstart system with mostly LSB SysV init scripts will
hence not be parallelized, a similar system running systemd
however will. In fact, for Upstart all SysV scripts together
make one job that is executed, they are not treated
individually, again in contrast to systemd where SysV init
scripts are just another source of configuration and are all
treated and controlled individually, much like any other native
systemd service.

Similarly, we read the existing /etc/fstab
configuration file, and consider it just another source of
configuration. Using the comment= fstab option you can
even mark /etc/fstab entries to become systemd
controlled automount points.

If the same unit is configured in multiple configuration
sources (e.g. /etc/systemd/system/avahi.service exists,
and /etc/init.d/avahi too), then the native
configuration will always take precedence, the legacy format is
ignored, allowing an easy upgrade path and packages to carry
both a SysV init script and a systemd service file for a
while.

We support a simple templating/instance mechanism. Example:
instead of having six configuration files for six gettys, we
only have one [email protected] file which gets instantiated to
[email protected] and suchlike. The interface part can
even be inherited by dependency expressions, i.e. it is easy to
encode that a service [email protected] pulls in
[email protected], while leaving the
eth0 string wild-carded.

For socket activation we support full compatibility with the
traditional inetd modes, as well as a very simple mode that
tries to mimic launchd socket activation and is recommended for
new services. The inetd mode only allows passing one socket to
the started daemon, while the native mode supports passing
arbitrary numbers of file descriptors. We also support one
instance per connection, as well as one instance for all
connections modes. In the former mode we name the cgroup the
daemon will be started in after the connection parameters, and
utilize the templating logic mentioned above for this. Example:
sshd.socket might spawn services
[email protected] with a
cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
(i.e. the IP address and port numbers are used in the instance
names. For AF_UNIX sockets we use PID and user id of the
connecting client). This provides a nice way for the
administrator to identify the various instances of a daemon and
control their runtime individually. The native socket passing
mode is very easily implementable in applications: if
$LISTEN_FDS is set it contains the number of sockets
passed and the daemon will find them sorted as listed in the
.service file, starting from file descriptor 3 (a
nicely written daemon could also use fstat() and
getsockname() to identify the sockets in case it
receives more than one). In addition we set $LISTEN_PID
to the PID of the daemon that shall receive the fds, because
environment variables are normally inherited by sub-processes and
hence could confuse processes further down the chain. Even
though this socket passing logic is very simple to implement in
daemons, we will provide a BSD-licensed reference implementation
that shows how to do this. We have ported a couple of existing
daemons to this new scheme.

We provide compatibility with /dev/initctl to a
certain extent. This compatibility is in fact implemented with a
FIFO-activated service, which simply translates these legacy
requests to D-Bus requests. Effectively this means the old
shutdown, poweroff and similar commands from
Upstart and sysvinit continue to work with
systemd.

We also provide compatibility with utmp and
wtmp. Possibly even to an extent that is far more
than healthy, given how crufty utmp and wtmp
are.

systemd supports several kinds of
dependencies between units. After/Before can be used to fix
the ordering how units are activated. It is completely
orthogonal to Requires and Wants, which
express a positive requirement dependency, either mandatory, or
optional. Then, there is Conflicts which
expresses a negative requirement dependency. Finally, there are
three further, less used dependency types.

systemd has a minimal transaction system. Meaning: if a unit
is requested to start up or shut down we will add it and all its
dependencies to a temporary transaction. Then, we will
verify if the transaction is consistent (i.e. whether the
ordering via After/Before of all units is
cycle-free). If it is not, systemd will try to fix it up, and
removes non-essential jobs from the transaction that might
remove the loop. Also, systemd tries to suppress non-essential
jobs in the transaction that would stop a running
service. Non-essential jobs are those which the original request
did not directly include but which where pulled in by
Wants type of dependencies. Finally we check whether
the jobs of the transaction contradict jobs that have already
been queued, and optionally the transaction is aborted then. If
all worked out and the transaction is consistent and minimized
in its impact it is merged with all already outstanding jobs and
added to the run queue. Effectively this means that before
executing a requested operation, we will verify that it makes
sense, fixing it if possible, and only failing if it really cannot
work.

We record start/exit time as well as the PID and exit status
of every process we spawn and supervise. This data can be used
to cross-link daemons with their data in abrtd, auditd and
syslog. Think of an UI that will highlight crashed daemons for
you, and allows you to easily navigate to the respective UIs for
syslog, abrt, and auditd that will show the data generated from
and for this daemon on a specific run.

We support reexecution of the init process itself at any
time. The daemon state is serialized before the reexecution and
deserialized afterwards. That way we provide a simple way to
facilitate init system upgrades as well as handover from an
initrd daemon to the final daemon. Open sockets and autofs
mounts are properly serialized away, so that they stay
connectible all the time, in a way that clients will not even
notice that the init system reexecuted itself. Also, the fact
that a big part of the service state is encoded anyway in the
cgroup virtual file system would even allow us to resume
execution without access to the serialization data. The
reexecution code paths are actually mostly the same as the init
system configuration reloading code paths, which
guarantees that reexecution (which is probably more seldom
triggered) gets similar testing as reloading (which is probably
more common).

Starting the work of removing shell scripts from the boot
process we have recoded part of the basic system setup in C and
moved it directly into systemd. Among that is mounting of the API
file systems (i.e. virtual file systems such as /proc,
/sys and /dev.) and setting of the
host-name.

Server state is introspectable and controllable via
D-Bus. This is not complete yet but quite extensive.

While we want to emphasize socket-based and bus-name-based
activation, and we hence support dependencies between sockets and
services, we also support traditional inter-service
dependencies. We support multiple ways how such a service can
signal its readiness: by forking and having the start process
exit (i.e. traditional daemonize() behaviour), as well
as by watching the bus until a configured service name appears.

There’s an interactive mode which asks for confirmation each
time a process is spawned by systemd. You may enable it by
passing systemd.confirm_spawn=1 on the kernel command
line.

With the systemd.default= kernel command line
parameter you can specify which unit systemd should start on
boot-up. Normally you’d specify something like
multi-user.target here, but another choice could even
be a single service instead of a target, for example
out-of-the-box we ship a service emergency.service that
is similar in its usefulness as init=/bin/bash, however
has the advantage of actually running the init system, hence
offering the option to boot up the full system from the
emergency shell.

There’s a minimal UI that allows you to
start/stop/introspect services. It’s far from complete but
useful as a debugging tool. It’s written in Vala (yay!) and goes
by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

We ask daemon writers not to fork or even double fork
in their processes, but run their event loop from the initial process
systemd starts for you. Also, don’t call setsid().

Don’t drop user privileges in the daemon itself, leave this
to systemd and configure it in systemd service configuration
files. (There are exceptions here. For example, for some daemons
there are good reasons to drop privileges inside the daemon
code, after an initialization phase that requires elevated
privileges.)

Don’t write PID files

Grab a name on the bus

You may rely on systemd for logging, you are welcome to log
whatever you need to log to stderr.

Let systemd create and watch sockets for you, so that socket
activation works. Hence, interpret $LISTEN_FDS and
$LISTEN_PID as described above.

Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?

Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.

Is this a Red Hat project?

No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.

Will this come to Fedora?

If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.

Will this come to OpenSUSE?

Kay’s pursuing that, so something similar as for Fedora applies here, too.

Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?

That’s up to them. We’d certainly welcome their interest, and help with the integration.

Why didn’t you just add this to Upstart, why did you invent something new?

Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.

If you love Apple launchd so much, why not adopt that?

launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.

Is this an NIH project?

Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.

Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!

Will this run on [insert non-Linux OS here]?

Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.

Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.

If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.

I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?

Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

<!– First you break my
audio
, and now you want to corrupt my boot?

Yes. And don’t forget that I am also responsible for crucifying your network. I am
coming after you! Muhahahaha!–>

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.