Tag Archives: embedded

What You Need to Know When Becoming a Free Software Hacker

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/hinter-den-kulissen.html

Earlier today I gave a presentation at the Technical University Berlin about
things you need to know, things you should expect and things you shouldn’t
expect when your are aspiring to become a successful Free Software Hacker.

I have put my slides up on Google Docs in case you are interested, either
because you are the target audience (i.e. a university student) or because you
need inspiration for a similar talk about the same topic.

The first two slides are in German language, so skip over them. The
interesting bits are all in English. I hope it’s quite comprehensive (though of
course terse). Enjoy:

In case your feed reader/planet messes this up, here’s the non-embedded version.

Oh, and thanks to everybody who reviewed and suggested additions to the the slides on +.

Plumbers Conference 2011

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/lpc2011.html

The Linux Plumbers
Conference 2011 in Santa Rosa, CA, USA
is coming nearer (Sep. 7-9).
Together with Kay Sievers I am running the Boot&Init track, and together with
Mark Brown the Audio track.

For both tracks we still need proposals. So if you haven’t submitted
anything yet, please consider doing so. And that quickly. i.e. if you can
arrange for it, last sunday would be best, since that was actually the final
deadline. However, the submission form is still open, so if you submit
something really, really quickly we’ll ignore the absence of time travel and the calendar for a bit. So, go,
submit something. Now.

What are we looking for? Well, here’s what I just posted on the audio
related mailing lists
:

So, please consider submitting something if you haven't done so yet. We
are looking for all kinds of technical talks covering everything audio
plumbing related: audio drivers, audio APIs, sound servers, pro audio,
consumer audio. If you can propose something audio related -- like talks
on media controller routing, on audio for ASOC/Embedded, submit
something! If you care for low-latency audio, submit something. If you
care about the Linux audio stack in general, submit something.

LPC is probably the most relevant technical conference on the general
Linux platform, so be sure that if you want your project, your work,
your ideas to be heard then this is the right forum for everything
related to the Linux stack. And the Audio track covers everything in our
Audio Stack, regardless whether it is pro or consumer audio.

And here’s what I posted to the init
related lists
:

So, please consider submitting something if you haven't done so yet. We
are looking for all kinds of technical talks covering everything from
the BIOS (i.e. CoreBoot and friends), over boot loaders (i.e. GRUB and
friends), to initramfs (i.e. Dracut and friends) and init systems
(i.e. systemd and friends). If you have something smart to say about any
of these areas or maybe about related tools (i.e. you wrote a fancy new
tool to measure boot performance) or fancy boot schemes in your
favourite Linux based OS (i.e. the new Meego zero second boot ;-)) then
don't hesitate to submit something on the LPC web site, in the Boot&Init
track!

And now, quickly, go to the
LPC website
and post your session proposal in the Audio resp. Boot&Init; track! Thank you!

systemd for Developers I

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/socket-activation.html

systemd
not only brings improvements for administrators and users, it also
brings a (small) number of new APIs with it. In this blog story (which might
become the first of a series) I hope to shed some light on one of the
most important new APIs in systemd:

Socket Activation

In the original blog
story about systemd
I tried to explain why socket activation is a
wonderful technology to spawn services. Let’s reiterate the background
here a bit.

The basic idea of socket activation is not new. The inetd
superserver was a standard component of most Linux and Unix systems
since time began: instead of spawning all local Internet services
already at boot, the superserver would listen on behalf of the
services and whenever a connection would come in an instance of the
respective service would be spawned. This allowed relatively weak
machines with few resources to offer a big variety of services at the
same time. However it quickly got a reputation for being somewhat
slow: since daemons would be spawned for each incoming connection a
lot of time was spent on forking and initialization of the services
— once for each connection, instead of once for them all.

Spawning one instance per connection was how inetd was primarily
used, even though inetd actually understood another mode: on the first
incoming connection it would notice this via poll() (or
select()) and spawn a single instance for all future
connections. (This was controllable with the
wait/nowait options.) That way the first connection
would be slow to set up, but subsequent ones would be as fast as with
a standalone service. In this mode inetd would work in a true
on-demand mode: a service would be made available lazily when it was
required.

inetd’s focus was clearly on AF_INET (i.e. Internet) sockets. As
time progressed and Linux/Unix left the server niche and became
increasingly relevant on desktops, mobile and embedded environments
inetd was somehow lost in the troubles of time. Its reputation for
being slow, and the fact that Linux’ focus shifted away from only
Internet servers made a Linux machine running inetd (or one of its newer
implementations, like xinetd) the exception, not the rule.

When Apple engineers worked on optimizing the MacOS boot time they
found a new way to make use of the idea of socket activation: they
shifted the focus away from AF_INET sockets towards AF_UNIX
sockets. And they noticed that on-demand socket activation was only
part of the story: much more powerful is socket activation when used
for all local services including those which need to be started
anyway on boot. They implemented these ideas in launchd, a central building
block of modern MacOS X systems, and probably the main reason why
MacOS is so fast booting up.

But, before we continue, let’s have a closer look what the benefits
of socket activation for non-on-demand, non-Internet services in
detail are. Consider the four services Syslog, D-Bus, Avahi and the
Bluetooth daemon. D-Bus logs to Syslog, hence on traditional Linux
systems it would get started after Syslog. Similarly, Avahi requires
Syslog and D-Bus, hence would get started after both. Finally
Bluetooth is similar to Avahi and also requires Syslog and D-Bus but
does not interface at all with Avahi. Sinceoin a traditional
SysV-based system only one service can be in the process of getting
started at a time, the following serialization of startup would take
place: Syslog → D-Bus → Avahi → Bluetooth (Of course, Avahi and
Bluetooth could be started in the opposite order too, but we have to
pick one here, so let’s simply go alphabetically.). To illustrate
this, here’s a plot showing the order of startup beginning with system
startup (at the top).

Parallelization plot

Certain distributions tried to improve this strictly serialized
start-up: since Avahi and Bluetooth are independent from each other,
they can be started simultaneously. The parallelization is increased,
the overall startup time slightly smaller. (This is visualized in the
middle part of the plot.)

Socket activation makes it possible to start all four services
completely simultaneously, without any kind of ordering. Since the
creation of the listening sockets is moved outside of the daemons
themselves we can start them all at the same time, and they are able
to connect to each other’s sockets right-away. I.e. in a single step
the /dev/log and /run/dbus/system_bus_socket sockets
are created, and in the next step all four services are spawned
simultaneously. When D-Bus then wants to log to syslog, it just writes
its messages to /dev/log. As long as the socket buffer does
not run full it can go on immediately with what else it wants to do
for initialization. As soon as the syslog service catches up it will
process the queued messages. And if the socket buffer runs full then
the client logging will temporarily block until the socket is writable
again, and continue the moment it can write its log messages. That
means the scheduling of our services is entirely done by the kernel:
from the userspace perspective all services are run at the same time,
and when one service cannot keep up the others needing it will
temporarily block on their request but go on as soon as these
requests are dispatched. All of this is completely automatic and
invisible to userspace. Socket activation hence allows us to
drastically parallelize start-up, enabling simultaneous start-up of
services which previously were thought to strictly require
serialization. Most Linux services use sockets as communication
channel. Socket activation allows starting of clients and servers of
these channels at the same time.

But it’s not just about parallelization. It offers a number of
other benefits:

  • We no longer need to configure dependencies explicitly. Since the
    sockets are initialized before all services they are simply available,
    and no userspace ordering of service start-up needs to take place
    anymore. Socket activation hence drastically simplifies configuration
    and development of services.
  • If a service dies its listening socket stays around, not losing a
    single message. After a restart of the crashed service it can continue
    right where it left off.
  • If a service is upgraded we can restart the service while keeping
    around its sockets, thus ensuring the service is continously
    responsive. Not a single connection is lost during the upgrade.
  • We can even replace a service during runtime in a way that is
    invisible to the client. For example, all systems running systemd
    start up with a tiny syslog daemon at boot which passes all log
    messages written to /dev/log on to the kernel message
    buffer. That way we provide reliable userspace logging starting from
    the first instant of boot-up. Then, when the actual rsyslog daemon is
    ready to start we terminate the mini daemon and replace it with the
    real daemon. And all that while keeping around the original logging
    socket and sharing it between the two daemons and not losing a single
    message. Since rsyslog flushes the kernel log buffer to disk after
    start-up all log messages from the kernel, from early-boot and from
    runtime end up on disk.

For another explanation of this idea consult the original blog
story about systemd
.

Socket activation has been available in systemd since its
inception. On Fedora 15 a number of services have been modified to
implement socket activation, including Avahi, D-Bus and rsyslog (to continue with the example above).

systemd’s socket activation is quite comprehensive. Not only classic
sockets are support but related technologies as well:

  • AF_UNIX sockets, in the flavours SOCK_DGRAM, SOCK_STREAM and SOCK_SEQPACKET; both in the filesystem and in the abstract namespace
  • AF_INET sockets, i.e. TCP/IP and UDP/IP; both IPv4 and IPv6
  • Unix named pipes/FIFOs in the filesystem
  • AF_NETLINK sockets, to subscribe to certain kernel features. This
    is currently used by udev, but could be useful for other
    netlink-related services too, such as audit.
  • Certain special files like /proc/kmsg or device nodes like /dev/input/*.
  • POSIX Message Queues

A service capable of socket activation must be able to receive its
preinitialized sockets from systemd, instead of creating them
internally. For most services this requires (minimal)
patching. However, since systemd actually provides inetd compatibility
a service working with inetd will also work with systemd — which is
quite useful for services like sshd for example.

So much about the background of socket activation, let’s now have a
look how to patch a service to make it socket activatable. Let’s start
with a theoretic service foobard. (In a later blog post we’ll focus on
real-life example.)

Our little (theoretic) service includes code like the following for
creating sockets (most services include code like this in one way or
another):

/* Source Code Example #1: ORIGINAL, NOT SOCKET-ACTIVATABLE SERVICE */
...
union {
        struct sockaddr sa;
        struct sockaddr_un un;
} sa;
int fd;

fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0) {
        fprintf(stderr, "socket(): %m\n");
        exit(1);
}

memset(&sa, 0, sizeof(sa));
sa.un.sun_family = AF_UNIX;
strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
        fprintf(stderr, "bind(): %m\n");
        exit(1);
}

if (listen(fd, SOMAXCONN) < 0) {
        fprintf(stderr, "listen(): %m\n");
        exit(1);
}
...

A socket activatable service may use the following code instead:

/* Source Code Example #2: UPDATED, SOCKET-ACTIVATABLE SERVICE */
...
#include "sd-daemon.h"
...
int fd;

if (sd_listen_fds(0) != 1) {
        fprintf(stderr, "No or too many file descriptors received.\n");
        exit(1);
}

fd = SD_LISTEN_FDS_START + 0;
...

systemd might pass you more than one socket (based on
configuration, see below). In this example we are interested in one
only. sd_listen_fds()
returns how many file descriptors are passed. We simply compare that
with 1, and fail if we got more or less. The file descriptors systemd
passes to us are inherited one after the other beginning with fd
#3. (SD_LISTEN_FDS_START is a macro defined to 3). Our code hence just
takes possession of fd #3.

As you can see this code is actually much shorter than the
original. This of course comes at the price that our little service
with this change will no longer work in a non-socket-activation
environment. With minimal changes we can adapt our example to work nicely
both with and without socket activation:

/* Source Code Example #3: UPDATED, SOCKET-ACTIVATABLE SERVICE WITH COMPATIBILITY */
...
#include "sd-daemon.h"
...
int fd, n;

n = sd_listen_fds(0);
if (n > 1) {
        fprintf(stderr, "Too many file descriptors received.\n");
        exit(1);
} else if (n == 1)
        fd = SD_LISTEN_FDS_START + 0;
else {
        union {
                struct sockaddr sa;
                struct sockaddr_un un;
        } sa;

        fd = socket(AF_UNIX, SOCK_STREAM, 0);
        if (fd < 0) {
                fprintf(stderr, "socket(): %m\n");
                exit(1);
        }

        memset(&sa, 0, sizeof(sa));
        sa.un.sun_family = AF_UNIX;
        strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

        if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
                fprintf(stderr, "bind(): %m\n");
                exit(1);
        }

        if (listen(fd, SOMAXCONN) < 0) {
                fprintf(stderr, "listen(): %m\n");
                exit(1);
        }
}
...

With this simple change our service can now make use of socket
activation but still works unmodified in classic environments. Now,
let’s see how we can enable this service in systemd. For this we have
to write two systemd unit files: one describing the socket, the other
describing the service. First, here’s foobar.socket:

[Socket]
ListenStream=/run/foobar.sk

[Install]
WantedBy=sockets.target

And here’s the matching service file foobar.service:

[Service]
ExecStart=/usr/bin/foobard

If we place these two files in /etc/systemd/system we can
enable and start them:

# systemctl enable foobar.socket
# systemctl start foobar.socket

Now our little socket is listening, but our service not running
yet. If we now connect to /run/foobar.sk the service will be
automatically spawned, for on-demand service start-up. With a
modification of foobar.service we can start our service
already at startup, thus using socket activation only for
parallelization purposes, not for on-demand auto-spawning anymore:

[Service]
ExecStart=/usr/bin/foobard

[Install]
WantedBy=multi-user.target

And now let’s enable this too:

# systemctl enable foobar.service
# systemctl start foobar.service

Now our little daemon will be started at boot and on-demand,
whatever comes first. It can be started fully in parallel with its
clients, and when it dies it will be automatically restarted when it
is used the next time.

A single .socket file can include multiple ListenXXX stanzas, which
is useful for services that listen on more than one socket. In this
case all configured sockets will be passed to the service in the exact
order they are configured in the socket unit file. Also,
you may configure various socket settings in the .socket
files.

In real life it’s a good idea to include description strings in
these unit files, to keep things simple we’ll leave this out of our
example. Speaking of real-life: our next installment will cover an
actual real-life example. We’ll add socket activation to the CUPS
printing server.

The sd_listen_fds() function call is defined in sd-daemon.h
and sd-daemon.c. These
two files are currently drop-in .c sources which projects should
simply copy into their source tree. Eventually we plan to turn this
into a proper shared library, however using the drop-in files allows
you to compile your project in a way that is compatible with socket
activation even without any compile time dependencies on
systemd. sd-daemon.c is liberally licensed, should compile
fine on the most exotic Unixes and the algorithms are trivial enough
to be reimplemented with very little code if the license should
nonetheless be a problem for your project. sd-daemon.c
contains a couple of other API functions besides
sd_listen_fds() that are useful when implementing socket
activation in a project. For example, there’s sd_is_socket()
which can be used to distuingish and identify particular sockets when
a service gets passed more than one.

Let me point out that the interfaces used here are in no way bound
directly to systemd. They are generic enough to be implemented in
other systems as well. We deliberately designed them as simple and
minimal as possible to make it possible for others to adopt similar
schemes.

Stay tuned for the next installment. As mentioned, it will cover a
real-life example of turning an existing daemon into a
socket-activatable one: the CUPS printing service. However, I hope
this blog story might already be enough to get you started if you plan
to convert an existing service into a socket activatable one. We
invite everybody to convert upstream projects to this scheme. If you
have any questions join us on #systemd on freenode.

Why systemd?

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/why.html

systemd is
still a young project, but it is not a baby anymore. The initial
announcement
I posted precisely a year ago. Since then most of the
big distributions have decided to adopt it in one way or another, many
smaller distributions have already switched. The first big
distribution with systemd by default will be Fedora 15, due end of
May. It is expected that the others will follow the lead a bit later
(with one exception). Many
embedded developers have already adopted it too, and there’s even a company specializing on engineering and
consulting services for systemd
. In short: within one year
systemd became a really successful project.

However, there are still folks who we haven’t won over yet. If you
fall into one of the following categories, then please have a look on
the comparison of init systems below:

  • You are working on an embedded project and are wondering whether
    it should be based on systemd.
  • You are a user or administrator and wondering which distribution
    to pick, and are pondering whether it should be based on systemd or
    not.
  • You are a user or administrator and wondering why your favourite
    distribution has switched to systemd, if everything already worked so
    well before.
  • You are developing a distribution that hasn’t switched yet, and
    you are wondering whether to invest the work and go systemd.

And even if you don’t fall into any of these categories, you might still
find the comparison interesting.

We’ll be comparing the three most relevant init systems for Linux:
sysvinit, Upstart and systemd. Of course there are other init systems
in existance, but they play virtually no role in the big
picture. Unless you run Android (which is a completely different beast
anyway), you’ll almost definitely run one of these three init systems
on your Linux kernel. (OK, or busybox, but then you are basically not
running any init system at all.) Unless you have a soft spot for
exotic init systems there’s little need to look further. Also, I am
kinda lazy, and don’t want to spend the time on analyzing those other
systems in enough detail to be completely fair to them.

Speaking of fairness: I am of course one of the creators of
systemd. I will try my best to be fair to the other two contenders,
but in the end, take it with a grain of salt. I am sure though that
should I be grossly unfair or otherwise incorrect somebody will point
it out in the comments of this story, so consider having a look on
those, before you put too much trust in what I say.

We’ll look at the currently implemented features in a released
version. Grand plans don’t count.

General Features

sysvinit Upstart systemd
Interfacing via D-Bus no yes yes
Shell-free bootup no no yes
Modular C coded early boot services included no no yes
Read-Ahead no no[1] yes
Socket-based Activation no no[2] yes
Socket-based Activation: inetd compatibility no no[2] yes
Bus-based Activation no no[3] yes
Device-based Activation no no[4] yes
Configuration of device dependencies with udev rules no no yes
Path-based Activation (inotify) no no yes
Timer-based Activation no no yes
Mount handling no no[5] yes
fsck handling no no[5] yes
Quota handling no no yes
Automount handling no no yes
Swap handling no no yes
Snapshotting of system state no no yes
XDG_RUNTIME_DIR Support no no yes
Optionally kills remaining processes of users logging out no no yes
Linux Control Groups Integration no no yes
Audit record generation for started services no no yes
SELinux integration no no yes
PAM integration no no yes
Encrypted hard disk handling (LUKS) no no yes
SSL Certificate/LUKS Password handling, including Plymouth, Console, wall(1), TTY and GNOME agents no no yes
Network Loopback device handling no no yes
binfmt_misc handling no no yes
System-wide locale handling no no yes
Console and keyboard setup no no yes
Infrastructure for creating, removing, cleaning up of temporary and volatile files no no yes
Handling for /proc/sys sysctl no no yes
Plymouth integration no yes yes
Save/restore random seed no no yes
Static loading of kernel modules no no yes
Automatic serial console handling no no yes
Unique Machine ID handling no no yes
Dynamic host name and machine meta data handling no no yes
Reliable termination of services no no yes
Early boot /dev/log logging no no yes
Minimal kmsg-based syslog daemon for embedded use no no yes
Respawning on service crash without losing connectivity no no yes
Gapless service upgrades no no yes
Graphical UI no no yes
Built-In Profiling and Tools no no yes
Instantiated services no yes yes
PolicyKit integration no no yes
Remote access/Cluster support built into client tools no no yes
Can list all processes of a service no no yes
Can identify service of a process no no yes
Automatic per-service CPU cgroups to even out CPU usage between them no no yes
Automatic per-user cgroups no no yes
SysV compatibility yes yes yes
SysV services controllable like native services yes no yes
SysV-compatible /dev/initctl yes no yes
Reexecution with full serialization of state yes no yes
Interactive boot-up no[6] no[6] yes
Container support (as advanced chroot() replacement) no no yes
Dependency-based bootup no[7] no yes
Disabling of services without editing files yes no yes
Masking of services without editing files no no yes
Robust system shutdown within PID 1 no no yes
Built-in kexec support no no yes
Dynamic service generation no no yes
Upstream support in various other OS components yes no yes
Service files compatible between distributions no no yes
Signal delivery to services no no yes
Reliable termination of user sessions before shutdown no no yes
utmp/wtmp support yes yes yes
Easily writable, extensible and parseable service files, suitable for manipulation with enterprise management tools no no yes

[1] Read-Ahead implementation for Upstart available in separate package ureadahead, requires non-standard kernel patch.

[2] Socket activation implementation for Upstart available as preview, lacks parallelization support hence entirely misses the point of socket activation.

[3] Bus activation implementation for Upstart posted as patch, not merged.

[4] udev device event bridge implementation for Upstart available as preview, forwards entire udev database into Upstart, not practical.

[5] Mount handling utility mountall for Upstart available in separate package, covers only boot-time mounts, very limited dependency system.

[6] Some distributions offer this implemented in shell.

[7] LSB init scripts support this, if they are used.

Available Native Service Settings

sysvinit Upstart systemd
OOM Adjustment no yes[1] yes
Working Directory no yes yes
Root Directory (chroot()) no yes yes
Environment Variables no yes yes
Environment Variables from external file no no yes
Resource Limits no some[2] yes
umask no yes yes
User/Group/Supplementary Groups no no yes
IO Scheduling Class/Priority no no yes
CPU Scheduling Nice Value no yes yes
CPU Scheduling Policy/Priority no no yes
CPU Scheduling Reset on fork() control no no yes
CPU affinity no no yes
Timer Slack no no yes
Capabilities Control no no yes
Secure Bits Control no no yes
Control Group Control no no yes
High-level file system namespace control: making directories inacessible no no yes
High-level file system namespace control: making directories read-only no no yes
High-level file system namespace control: private /tmp no no yes
High-level file system namespace control: mount inheritance no no yes
Input on Console yes yes yes
Output on Syslog no no yes
Output on kmsg/dmesg no no yes
Output on arbitrary TTY no no yes
Kill signal control no no yes
Conditional execution: by identified CPU virtualization/container no no yes
Conditional execution: by file existance no no yes
Conditional execution: by security framework no no yes
Conditional execution: by kernel command line no no yes

[1] Upstart supports only the deprecated oom_score_adj mechanism, not the current oom_adj logic.

[2] Upstart lacks support for RLIMIT_RTTIME and RLIMIT_RTPRIO.

Note that some of these options are relatively easily added to SysV
init scripts, by editing the shell sources. The table above focusses
on easily accessible options that do not require source code
editing.

Miscellaneous

sysvinit Upstart systemd
Maturity > 15 years 6 years 1 year
Specialized professional consulting and engineering services available no no yes
SCM Subversion Bazaar git
Copyright-assignment-free contributing yes no yes

Summary

As the tables above hopefully show in all clarity systemd
has left behind both sysvinit and Upstart in almost every
aspect. With the exception of the project’s age/maturity systemd wins
in every category. At this point in time it will be very hard for
sysvinit and Upstart to catch up with the features systemd provides
today. In one year we managed to push systemd forward much further
than Upstart has been pushed in six.

It is our intention to drive forward the development of the Linux
platform with systemd. In the next release cycle we will focus more
strongly on providing the same features and speed improvement we
already offer for the system to the user login session. This will
bring much closer integration with the other parts of the OS and
applications, making the most of the features the service manager
provides, and making it available to login sessions. Certain
components such as ConsoleKit will be made redundant by these
upgrades, and services relying on them will be updated. The
burden for maintaining these then obsolete components
will be passed on the vendors who plan to continue to rely on
them.

If you are wondering whether or not to adopt systemd, then systemd
obviously wins when it comes to mere features. Of course that should
not be the only aspect to keep in mind. In the long run, sticking with
the existing infrastructure (such as ConsoleKit) comes at a price:
porting work needs to take place, and additional maintainance work for
bitrotting code needs to be done. Going it on your own means increased
workload.

That said, adopting systemd is also not free. Especially if you
made investments in the other two solutions adopting systemd means
work. The basic work to adopt systemd is relatively minimal for
porting over SysV systems (since compatibility is provided), but can
mean substantial work when coming from Upstart. If you plan to go for
a 100% systemd system without any SysV compatibility (recommended for
embedded, long run goal for the big distributions) you need to be
willing to invest some work to rewrite init scripts as simple systemd
unit files.

systemd is in the process of becoming a comprehensive, integrated
and modular platform providing everything needed to bootstrap and
maintain an operating system’s userspace. It includes C rewrites of
all basic early boot init scripts that are shipped with the various
distributions. Especially for the embedded case adopting systemd
provides you in one step with almost everything you need, and you can
pick the modules you want. The other two init systems are singular
individual components, which to be useful need a great number of
additional components with differing interfaces. The emphasis of
systemd to provide a platform instead of just a component allows for
closer integration, and cleaner APIs. Sooner or later this will
trickle up to the applications. Already, there are accepted XDG
specifications (e.g. XDG basedir spec, more specifically
XDG_RUNTIME_DIR) that are not supported on the other init systems.

systemd is also a big opportunity for Linux standardization. Since
it standardizes many interfaces of the system that previously have
been differing on every distribution, on every implementation,
adopting it helps to work against the balkanization of the Linux
interfaces. Choosing systemd means redefining more closely
what the Linux platform is about. This improves the lifes of
programmers, users and administrators alike.

I believe that momentum is clearly with systemd. We invite you to
join our community and be part of that momentum.

systemd for Administrators, Part VIII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/the-new-configuration-files.html

Another episode of my
ongoing
series
on
systemd
for
Administrators:

The New Configuration Files

One of the formidable new features of systemd is
that it comes with a complete set of modular early-boot services that are
written in simple, fast, parallelizable and robust C, replacing the
shell “novels” the various distributions featured before. Our little
Project Zero Shell[1] has been a full success. We currently
cover pretty much everything most desktop and embedded
distributions should need, plus a big part of the server needs:

  • Checking and mounting of all file systems
  • Updating and enabling quota on all file systems
  • Setting the host name
  • Configuring the loopback network device
  • Loading the SELinux policy and relabelling /run and /dev as necessary on boot
  • Registering additional binary formats in the kernel, such as Java, Mono and WINE binaries
  • Setting the system locale
  • Setting up the console font and keyboard map
  • Creating, removing and cleaning up of temporary and volatile files and directories
  • Applying mount options from /etc/fstab to pre-mounted API VFS
  • Applying sysctl kernel settings
  • Collecting and replaying readahead information
  • Updating utmp boot and shutdown records
  • Loading and saving the random seed
  • Statically loading specific kernel modules
  • Setting up encrypted hard disks and partitions
  • Spawning automatic gettys on serial kernel consoles
  • Maintenance of Plymouth
  • Machine ID maintenance
  • Setting of the UTC distance for the system clock

On a standard Fedora 15 install, only a few legacy and storage
services still require shell scripts during early boot. If you don’t
need those, you can easily disable them end enjoy your shell-free boot
(like I do every day). The shell-less boot systemd offers you is a
unique feature on Linux.

Many of these small components are configured via configuration
files in /etc. Some of these are fairly standardized among
distributions and hence supporting them in the C implementations was
easy and obvious. Examples include: /etc/fstab,
/etc/crypttab or /etc/sysctl.conf. However, for
others no standardized file or directory existed which forced us to add
#ifdef orgies to our sources to deal with the different
places the distributions we want to support store these things. All
these configuration files have in common that they are dead-simple and
there is simply no good reason for distributions to distuingish
themselves with them: they all do the very same thing, just
a bit differently.

To improve the situation and benefit from the unifying force that
systemd is we thus decided to read the per-distribution configuration
files only as fallbacks — and to introduce new configuration
files as primary source of configuration wherever applicable. Of
course, where possible these standardized configuration files should
not be new inventions but rather just standardizations of the best
distribution-specific configuration files previously used. Here’s a
little overview over these new common configuration files systemd
supports on all distributions:

  • /etc/hostname:
    the host name for the system. One of the most basic and trivial
    system settings. Nonetheless previously all distributions used
    different files for this. Fedora used /etc/sysconfig/network,
    OpenSUSE /etc/HOSTNAME. We chose to standardize on the
    Debian configuration file /etc/hostname.
  • /etc/vconsole.conf:
    configuration of the default keyboard mapping and console font.
  • /etc/locale.conf:
    configuration of the system-wide locale.
  • /etc/modules-load.d/*.conf:
    a drop-in directory for kernel modules to statically load at
    boot (for the very few that still need this).
  • /etc/sysctl.d/*.conf:
    a drop-in directory for kernel sysctl parameters, extending what you
    can already do with /etc/sysctl.conf.
  • /etc/tmpfiles.d/*.conf:
    a drop-in directory for configuration of runtime files that need to be
    removed/created/cleaned up at boot and during uptime.
  • /etc/binfmt.d/*.conf:
    a drop-in directory for registration of additional binary formats for
    systems like Java, Mono and WINE.
  • /etc/os-release:
    a standardization of the various distribution ID files like
    /etc/fedora-release and similar. Really every distribution
    introduced their own file here; writing a simple tool that just prints
    out the name of the local distribution usually means including a
    database of release files to check. The LSB tried to standardize
    something like this with the lsb_release
    tool, but quite frankly the idea of employing a shell script in this
    is not the best choice the LSB folks ever made. To rectify this we
    just decided to generalize this, so that everybody can use the same
    file here.
  • /etc/machine-id:
    a machine ID file, superseding D-Bus’ machine ID file. This file is
    guaranteed to be existing and valid on a systemd system, covering also
    stateless boots. By moving this out of the D-Bus logic it is hopefully
    interesting for a lot of additional uses as a unique and stable
    machine identifier.
  • /etc/machine-info:
    a new information file encoding meta data about a host, like a pretty
    host name and an icon name, replacing stuff like
    /etc/favicon.png and suchlike. This is maintained by systemd-hostnamed.

It is our definite intention to convince you to use these new
configuration files in your configuration tools: if your
configuration frontend writes these files instead of the old ones, it
automatically becomes more portable between Linux distributions, and
you are helping standardizing Linux. This makes things simpler to
understand and more obvious for users and administrators. Of course,
right now, only systemd-based distributions read these files, but that
already covers all important distributions in one way or another, except for one. And it’s a bit of a
chicken-and-egg problem: a standard becomes a standard by being
used. In order to gently push everybody to standardize on these files
we also want to make clear that sooner or later we plan to drop the
fallback support for the old configuration files from
systemd. That means adoption of this new scheme can happen slowly and piece
by piece. But the final goal of only having one set of configuration
files must be clear.

Many of these configuration files are relevant not only for
configuration tools but also (and sometimes even primarily) in
upstream projects. For example, we invite projects like Mono, Java, or
WINE to install a drop-in file in /etc/binfmt.d/ from their
upstream build systems. Per-distribution downstream support for binary
formats would then no longer be necessary and your platform would work
the same on all distributions. Something similar applies to all
software which need creation/cleaning of certain runtime files and
directories at boot, for example beneath the /run hierarchy
(i.e. /var/run as it used to be known). These
projects should just drop in configuration files in
/etc/tmpfiles.d, also from the upstream build systems. This
also helps speeding up the boot process, as separate per-project SysV
shell scripts which implement trivial things like registering a binary
format or removing/creating temporary/volatile files at boot are no
longer necessary. Or another example, where upstream support would be
fantastic: projects like X11 could probably benefit from reading the
default keyboard mapping for its displays from
/etc/vconsole.conf.

Of course, I have no doubt that not everybody is happy with our
choice of names (and formats) for these configuration files. In the
end we had to pick something, and from all the choices these appeared
to be the most convincing. The file formats are as simple as they can
be, and usually easily written and read even from shell scripts. That
said, /etc/bikeshed.conf could of course also have been a
fantastic configuration file name!

So, help us standardizing Linux! Use the new configuration files!
Adopt them upstream, adopt them downstream, adopt them all across the
distributions!

Oh, and in case you are wondering: yes, all of these files were
discussed in one way or another with various folks from the various
distributions. And there has even been some push towards supporting
some of these files even outside of systemd systems.

Footnotes

[1] Our slogan: “The only shell that should get started
during boot is gnome-shell!
” — Yes, the slogan needs a bit of
work, but you get the idea.

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-2.html

It has been a
while since my last status update on systemd
. Here’s another short,
incomprehensive status update on what we worked on for systemd since then.

  • Fedora F15 (Rawhide) now includes a split up
    /etc/init.d/rc.sysinit (Bill Nottingham). This allows us to keep only
    a minimal compatibility set of shell scripts around, and boot otherwise a
    system without any shell scripts at all. In fact, shell scripts during early
    boot are only used in exceptional cases, i.e. when you enabled autoswapping
    (bad idea anyway), when a full SELinux relabel is necessary, during the first
    boot after initialization, if you have static kernel modules to load (which are
    not configured via the systemd-native way to do that), if you boot from a
    read-only NFS server, or when you rely on LVM/RAID/Multipath. If nothing of
    this applies to you can easily disable these parts of early boot and
    save several seconds on boot. How to do this I will describe in a later blog
    story.
  • We have a fully C coded shutdown logic that kills all remaining processes,
    unmounts all remaining file systems, detaches all loop devices and DM volumes
    and does that in the right way to ensure that all these things are properly
    teared down even if they depend on each other in arbitrary ways. This is not
    only considerably faster then the traditional shell hackery for this, but also
    a lot safer, since we try to unmount/remount the remaining file systems with a
    little bit of brains. This feature is available via systemctl --force
    poweroff
    to the administrator. The --force controls whether the
    usual shutdown of all services is run or whether this is skipped and we
    immediately shall enter this final C shutdown logic. Using --force
    hence is a much safer replacement for the old /sbin/reboot -f and does
    not leave dirty file systems behind. (Thanks to Fabiano Fidencio has his
    colleagues from ProFUSION for this).
  • systemd now includes a minmalistic readahead implementation, based on
    fanotify(), fadvise() and mincore(). It supports btrfs defragmentation and both
    SSD and HDD disks. While the effect on boots that are anyway fast (such as most
    stuff involving SSD) is minimal, slower and older machines benefit from this
    more substantially.
  • We now control fsck and quota during early boot with a C tool that ensure
    maximum parallelization but properly implements the necessary high-level
    administration logic.
  • Every service, every user and every user session now gets its own cgroup in
    the ‘cpu’ hierarchy thus creating better fairness between the logged in users
    and their sessions.
  • We now provide /dev/log logging from early boot to late shutdown.
    If no syslog daemon is running the output is passed on to kmsg. As soon as a
    proper syslog daemon starts up the kmsg buffer is flushed to syslog, and hence
    we will have complete log coverage in syslog even for early boot.
  • systemctl kill was introduced, an easy command to send a signal to
    all processes of a service. Expect a blog story with more details about this
    shortly.
  • systemd gained the ability to load the SELinux policy if necessary, thus
    supporting non-initrd boots and initrd boots from the same binary with no
    duplicate work. This is in fact (and surprisingly) a first among Linux init
    systems.
  • We now initialize and set the system locale inside PID 1 to be inherited by
    all services and users.
  • systemd has native support for /etc/crypttab and can activate
    encrypted LUKS/dm-crypt disks both at boot-up and during runtime. A minimal
    password querying infrastructure is available, where multiple agents can be
    used to present the password to the user. During boot the password is queried
    either via Plymouth or directly on the console. If a system crypto disk is
    plugged in after boot you are queried for the password via a GNOME agent, or a
    wall(1) agent. Finally, while you run systemctl start (or a similar
    command) a minimal TTY password agent is available which asks you for passwords
    right-away if this is necessary. The password querying logic is very simple,
    additional agents can be implemented in a trivial amount of code (Yupp, KDE folks, you
    can add an agent for this, too). Note that the password querying logic in
    systemd is only for non-user passwords, i.e. passwords that have no relation to
    a specific user, but rather to specific hardware or system software. In future
    we hope to extend this so that this can be used to query the password of SSL
    certificates when Apache or other servers start.
  • We offer a minimal interface that external projects can use to extend the
    dependency graph systemd manages. In fact, the cryptsetup logic mentioned above
    is implemented via this ‘plugin’-like system. Since we did not want to add code
    that deals with cryptographic disks into the systemd process itself we
    introduced this interface (after all cryptographic volumes are not an essential
    feature of a minimal OS, and unncessary on most embedded uses; also the future
    might bring us STC which might make this at least partially obsolete). Simply
    by dropping a generator binary into
    /lib/systemd/system-generators which should write out systemd unit
    files into a temporary directory third-party packages may extend the systemd
    dependency tree dynamically. This could be useful for example to automatically
    create a systemd service for each KVM machine or LXC container. With that in
    place those containers/machines could be managed and supervised with the same
    tools as the usual system services.
  • We integrated automatic clean-up of directories such as /tmp into
    the tmpfiles logic we already had in place that recreates files and
    directories on volatile file systems such as /var/run,
    /var/lock or /tmp.
  • We now always measure and write to the log files the system startup time we
    measured, broken up into how many time was spent on the kernel, the initrd and
    the initialization of userspace.
  • We now safely destroy all user session before going down. This is a feature
    long missing on Linux: since user processes were not killed until the very last
    moment the unhealthy situation that user code was running at a time where no
    other daemon was remaining was a normal part of shutdown.
  • systemd now understands an ‘extreme’ form of disabling a service: if you
    symlink a service name in /etc/systemd/system to /dev/null
    then systemd will mark it as masked and completely refuse starting it,
    regardless if this is requested manually or automaticallly. Normally it should
    be sufficient to simply call systemctl disable to disable a service
    which still allows manual activation but no automatic activation. Masking a
    service goes one step further.
  • There’s now a simple condition syntax in places which allows
    skipping or enabling units depending on the existance of a file, whether a
    directory is empty or whether a kernel command line option is set.
  • In addition to normal shutdowns for reboot, halt or poweroff we now
    similarly support a kexec reboot, that reboots the machine without going though
    the BIOS code again.
  • We have bash completion support for systemctl. (Ran Benita)
  • Andrew Edmunds contributed basic support to boot Ubuntu with systemd.
  • Michael Biebl and Tollef Fog Heen have worked on the systemd integration
    into Debian to a level that it is now possible to boot a system without having
    the old initscripts packaged installed. For more details see the Debian Wiki. Michael even
    tested this integration on an Ubuntu Natty system and as it turns out this
    works almost equally well on Ubuntu already. If you are interesting in playing
    around with this, ping Michael.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

We have come quite far in the last year. systemd is about a year old now,
and we are now able to boot a system without legacy shell scripts remaining,
something that appeared to be a task for the distant future.

All of this is available in systemd 13 and in F15/Rawhide as I type
this. If you want to play around with this then consider installing Rawhide
(it’s fun!).

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update.html

It has been a while since my original
announcement of systemd
. Here’s a little status update, on what
happened since then. For simplicity’s sake I’ll just list here what we
worked on in a bulleted list, with no particular order and without
trying to cover this comprehensively:

  • systemd has been accepted as Feature for Fedora 14, and as it
    looks right now everything worked out nicely and we’ll ship F14 with
    systemd as init system.
  • We added a number of additional unit types: .timer for
    cron-style timer-based activation of services, .swap exposes
    swap files and partitions the same way we handle mount points, and
    .path can be used to activate units dependending on the
    existance/creation of files or fill status of spool directories.
  • We hooked systemd up to SELinux: systemd is now capabale of
    properly labelling directories, sockets and FIFOs it creates according
    to the SELinux policy for the services we maintain.
  • We hooked systemd up to the Linux auditing subsystem: as first
    init system at all systemd now generates auditing records for all
    services it starts/stops, including their failure status.
  • We hooked systemd up to TCP wrappers, for all socket connections
    it accepts.
  • We hooked systemd up to PAM, so that optionally, when systemd runs
    a service as a different user it initializes the usual PAM session
    setup and teardown hooks.
  • We hooked systemd up to D-Bus, so that D-Bus passes activation
    requests to systemd and systemd becomes the central point for all
    kinds of activation, thus greatly extending the control of the
    execution environment of bus activated services, and making them
    accessible through the same utilities as SysV services. Also, this
    enables us to do race-free parallelized start-up for D-Bus services
    and their clients, thus speeding up things even further.
  • systemd is now able to handle various Debian and OpenSUSE-specific
    extensions to the classic SysV init script formats natively, on top of
    the Fedora extensions we already parse.
  • The D-Bus coverage of the systemd interface is now complete,
    allowing both introspection of runtime data and of parsed
    configuration data. It’s fun now to introspect systemd with gdbus
    or d-feet.
  • We added a systemd
    PAM module
    , which assigns the processes of each user session to
    its own cgroup in the systemd cgroup tree. This also enables reliable
    killing of all processes associated with a session when the user logs
    out. This also manages a secure per-user /var/run-style directory
    which is supposed to be used for sockets and similar files that shall
    be cleaned up when the user logs out.
  • There’s a new tool systemd-cgls,
    which plots a pretty process tree based on the systemd cgroup
    hierarchy. It’s really pretty. Try it!
  • We now have our own cgroup hierarchy beneath
    /cgroup/systemd (though is will move to /sys/fs/
    before the F14 release).
  • We have pretty code that automatically spawns a getty on a serial
    port when the kernel console is redirected to a serial TTY.
  • systemctl got beefed up substantially (it can even draw
    dependency graphs now, via dot!), and the SysV compatiblity
    tools were extended to more completely and correctly support what was
    historically provided by SysV. For example, we’ll now warn the user
    when systemd service files have changed but systemd was not asked to
    reload its configuration. Also, you can now use systemd’s native
    client tools to reboot or shut-down an Upstart or sysvinit system, to
    facilitate upgrades.
  • We provide a reference
    implementation
    for the socket activation and other APIs for nicer
    interaction with systemd.
  • We have a pretty complete set of documentation
    now, some
    of it
    even extending to areas not directly related to systemd
    itself.
  • Quite a number of upstream packages now ship with systemd service
    files out-of-the-box now, that work across all distributions that have
    adopted systemd. It is our intention to unify the boot and service
    management between distributions with systemd, and this shows fruits
    already. Furthermore a number of upstream packages now ship our
    patches for socket-based activation.
  • Even more options that control the process execution environment
    or the sockets we create are now supported.
  • Earlier today I began my series of blog stories on systemd
    for administrators
    .
  • We reimplemented almost all boot-up and shutdown scripts of the
    standard Fedora install in much smaller, simpler and faster C
    utilities, or in systemd itself. Most of this will not be enabled in
    F14 however, even though it is shipped with systemd upstream. With
    this enabled the entire Linux system gains a completely new feeling as
    the number of shells we spawn approaches zero, and the PID of the
    first user terminal is way < 500 now, and the early boot-up is
    fully parallelized. We looked at the boot scripts of Fedora, OpenSUSE
    and Debian and distilled from this a list of functionality that makes
    up the early boot process and reimplemented this in C, if possible
    following the bahaviour of one of the existing implementations from
    these three distributions. This turned out to be much less effort than
    anticipated, and we are actually quite excited about this. Look
    forward to the fruits of this work in F15, when we might be able to
    present you a shell-less boot at least for standard desktop/laptop
    systems.
  • We spent some time reinvestigating the current syslog logic, and
    came up with an elegant and simple scheme to provide /dev/log
    compatible logging right from the time systemd is first initialized
    right until the time the kernel halts the machine. Through the wonders
    of socket based activation we first connect the /dev/log
    socket with a minimal bridge to the kernel log buffer (kmsg)
    and then, as soon as the real syslog is started up as part of the
    later bootup phase, we dynamically replace this minimal bridge by the
    real syslog daemon — without losing a single log message. Since one
    of the first things the real syslog daemon does is flushing the kernel
    log buffer into log files, all logged messages will sooner or later be
    stored on disk, regardless whether they have been generated during
    early boot, late boot or system runtime. On top of that if the syslog
    daemon terminates or is shut down during runtime, the bridge becomes
    active again and log output is written to kmsg again. The same applies
    when the system goes down. This provides a simple an robust way how we
    can ensure that no logs will ever be lost again, and logging is
    available from the beginning of boot-up to the end of
    shut-down. Plymouth will most likely adopt a similar scheme for initrd
    logging, thus ensuring that everything ever logged on the system will
    properly end up in the log files, whether it comes from the kernel,
    from the initrd, from early-boot, from runtime or shutdown. And if
    syslogd is not around, dmesg will provide you with access to
    the log messages. While this bridge is part of systemd upstream, we’ll
    most likely enable this bridge in Fedora only starting with F15. Also
    note that embedded systems that have no interest in shipping a full
    syslogd solution can simply use this syslog bridge during the entire
    runtime, and thus making the kernel log buffer the centralized log
    storage, with all the advantages this offers: zero disk IO at runtime,
    access to serial and netconsole logging, and remote debug access to
    the kernel log buffer.
  • We now install autofs units for many “API” kernel virtual file
    systems by default, such as binfmt_misc or
    hugetlbfs. That means that the file system access is readily
    available, client code no longer has to manually load the respective
    kernel modules, as they are autoloaded on first access of the file
    system. This has many advantages: it is not only faster to set up
    during boot, but also simpler for applications, as they can just
    assume the functionality is available. On top of that permission
    problems for the initialization go away, since manual module loading
    requires root privileges.
  • Many smaller fixes and enhancements, all across the board, which
    if mentioned here would make this blog story another blog
    novel. Suffice to say, we did a lot of polishing to ready systemd for
    F14.

All in all, systemd is progressing nicely, and the features we have
been working on in the last months are without exception features not
existing in any other of the init systems available on Linux and our
feature set already was far ahead of what the older init
implementations provide. And we have quite a bit planned for the
future. So, stay tuned!

Also note that I’ll speak about systemd at LinuxKongress
2010
in Nuremberg, Germany. Later this year I’ll also be speaking
at the Linux
Plumbers Conference
in Boston, MA. Make sure to drop by if you
want to learn about systemd or discuss exiciting new ideas or features
with us.

On IDs

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/ids.html

When programming software that cooperates with software running on behalf of
other users, other sessions or other computers it is often necessary to work with
unique identifiers. These can be bound to various hardware and software objects
as well as lifetimes. Often, when people look for such an ID to use they pick
the wrong one because semantics and lifetime or the IDs are not clear. Here’s a
little incomprehensive list of IDs accessible on Linux and how you should or
should not use them.

Hardware IDs

  1. /sys/class/dmi/id/product_uuid: The main board product UUID, as
    set by the board manufacturer and encoded in the BIOS DMI information. It may
    be used to identify a mainboard and only the mainboard. It changes when the
    user replaces the main board. Also, often enough BIOS manufacturers write bogus
    serials into it. In addition, it is x86-specific. Access for unprivileged users
    is forbidden. Hence it is of little general use.
  2. CPUID/EAX=3 CPU serial number: A CPU UUID, as set by the
    CPU manufacturer and encoded on the CPU chip. It may be used to identify a CPU
    and only a CPU. It changes when the user replaces the CPU. Also, most modern
    CPUs don’t implement this feature anymore, and older computers tend to disable
    this option by default, controllable via a BIOS Setup option. In addition, it
    is x86-specific. Hence this too is of little general use.
  3. /sys/class/net/*/address: One or more network MAC addresses, as
    set by the network adapter manufacturer and encoded on some network card
    EEPROM. It changes when the user replaces the network card. Since network cards
    are optional and there may be more than one the availability if this ID is not
    guaranteed and you might have more than one to choose from. On virtual machines
    the MAC addresses tend to be random. This too is hence of little general use.
  4. /sys/bus/usb/devices/*/serial: Serial numbers of various USB
    devices, as encoded in the USB device EEPROM. Most devices don’t have a serial
    number set, and if they have it is often bogus. If the user replaces his USB
    hardware or plugs it into another machine these IDs may change or appear in
    other machines. This hence too is of little use.

There are various other hardware IDs available, many of which you may
discover via the ID_SERIAL udev property of various devices, such hard disks
and similar. They all have in common that they are bound to specific
(replacable) hardware, not universally available, often filled with bogus data
and random in virtualized environments. Or in other words: don’t use them, don’t
rely on them for identification, unless you really know what you are doing and
in general they do not guarantee what you might hope they guarantee.

Software IDs

  1. /proc/sys/kernel/random/boot_id: A random ID that is regenerated
    on each boot. As such it can be used to identify the local machine’s current
    boot. It’s universally available on any recent Linux kernel. It’s a good and
    safe choice if you need to identify a specific boot on a specific booted
    kernel.
  2. gethostname(), /proc/sys/kernel/hostname: A non-random ID
    configured by the administrator to identify a machine in the network. Often
    this is not set at all or is set to some default value such as
    localhost and not even unique in the local network. In addition it
    might change during runtime, for example because it changes based on updated
    DHCP information. As such it is almost entirely useless for anything but
    presentation to the user. It has very weak semantics and relies on correct
    configuration by the administrator. Don’t use this to identify machines in a
    distributed environment. It won’t work unless centrally administered, which
    makes it useless in a globalized, mobile world. It has no place in
    automatically generated filenames that shall be bound to specific hosts. Just
    don’t use it, please. It’s really not what many people think it is.
    gethostname() is standardized in POSIX and hence portable to other
    Unixes.
  3. IP Addresses returned by SIOCGIFCONF or the respective Netlink APIs: These
    tend to be dynamically assigned and often enough only valid on local networks
    or even only the local links (i.e. 192.168.x.x style addresses, or even
    169.254.x.x/IPv4LL). Unfortunately they hence have little use outside of
    networking.
  4. gethostid(): Returns a supposedly unique 32-bit identifier for the
    current machine. The semantics of this is not clear. On most machines this
    simply returns a value based on a local IPv4 address. On others it is
    administrator controlled via the /etc/hostid file. Since the semantics
    of this ID are not clear and most often is just a value based on the IP address it is
    almost always the wrong choice to use. On top of that 32bit are not
    particularly a lot. On the other hand this is standardized in POSIX and hence
    portable to other Unixes. It’s probably best to ignore this value and if people
    don’t want to ignore it they should probably symlink /etc/hostid to
    /var/lib/dbus/machine-id or something similar.
  5. /var/lib/dbus/machine-id: An ID identifying a specific Linux/Unix
    installation. It does not change if hardware is replaced. It is not unreliable
    in virtualized environments. This value has clear semantics and is considered
    part of the D-Bus API. It is supposedly globally unique and portable to all
    systems that have D-Bus. On Linux, it is universally available, given that
    almost all non-embedded and even a fair share of the embedded machines ship
    D-Bus now. This is the recommended way to identify a machine, possibly with a
    fallback to the host name to cover systems that still lack D-Bus. If your
    application links against libdbus, you may access this ID with
    dbus_get_local_machine_id(), if not you can read it directly from the file system.
  6. /proc/self/sessionid: An ID identifying a specific Linux login
    session. This ID is maintained by the kernel and part of the auditing logic. It
    is uniquely assigned to each login session during a specific system boot,
    shared by each process of a session, even across su/sudo and cannot be changed
    by userspace. Unfortunately some distributions have so far failed to set things
    up properly for this to work (Hey, you, Ubuntu!), and this ID is always
    (uint32_t) -1 for them. But there’s hope they get this fixed
    eventually. Nonetheless it is a good choice for a unique session identifier on
    the local machine and for the current boot. To make this ID globally unique it
    is best combined with /proc/sys/kernel/random/boot_id.
  7. getuid(): An ID identifying a specific Unix/Linux user. This ID is
    usually automatically assigned when a user is created. It is not unique across
    machines and may be reassigned to a different user if the original user was
    deleted. As such it should be used only locally and with the limited validity
    in time in mind. To make this ID globally unique it is not sufficient to
    combine it with /var/lib/dbus/machine-id, because the same ID might be
    used for a different user that is created later with the same UID. Nonetheless
    this combination is often good enough. It is available on all POSIX systems.
  8. ID_FS_UUID: an ID that identifies a specific file system in the
    udev tree. It is not always clear how these serials are generated but this
    tends to be available on almost all modern disk file systems. It is not
    available for NFS mounts or virtual file systems. Nonetheless this is often a
    good way to identify a file system, and in the case of the root directory even
    an installation. However due to the weakly defined generation semantics the
    D-Bus machine ID is generally preferrable.

Generating IDs

Linux offers a kernel interface to generate UUIDs on demand, by reading from
/proc/sys/kernel/random/uuid. This is a very simple interface to
generate UUIDs. That said, the logic behind UUIDs is unnecessarily complex and
often it is a better choice to simply read 16 bytes or so from
/dev/urandom.

Summary

And the gist of it all: Use /var/lib/dbus/machine-id! Use
/proc/self/sessionid! Use /proc/sys/kernel/random/boot_id!
Use getuid()! Use /dev/urandom!
And forget about the
rest, in particular the host name, or the hardware IDs such as DMI. And keep in
mind that you may combine the aforementioned IDs in various ways to get
different semantics and validity constraints.

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

  • To start less.
  • And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate‘s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo)
in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

  1. service: these are the most obvious kind of unit:
    daemons that can be started, stopped, restarted, reloaded. For
    compatibility with SysV we not only support our own
    configuration files for services, but also are able to read
    classic SysV init scripts, in particular we parse the LSB
    header, if it exists. /etc/init.d is hence not much
    more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the
    file-system or on the Internet. We currently support AF_INET,
    AF_INET6, AF_UNIX sockets of the types stream, datagram, and
    sequential packet. We also support classic FIFOs as
    transport. Each socket unit has a matching
    service unit, that is started if the first connection
    comes in on the socket or FIFO. Example: nscd.socket
    starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the
    Linux device tree. If a device is marked for this via udev
    rules, it will be exposed as a device unit in
    systemd. Properties set with udev can be used as
    configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the
    file system hierarchy. systemd monitors all mount points how
    they come and go, and can also be used to mount or
    unmount mount-points. /etc/fstab is used here as an
    additional configuration source for these mount points, similar to
    how SysV init scripts can be used as additional configuration
    source for service units.
  5. automount: this unit type encapsulates an automount
    point in the file system hierarchy. Each automount
    unit has a matching mount unit, which is started
    (i.e. mounted) as soon as the automount directory is
    accessed.
  6. target: this unit type is used for logical
    grouping of units: instead of actually doing anything by itself
    it simply references other units, which thereby can be controlled
    together. Examples for this are: multi-user.target,
    which is a target that basically plays the role of run-level 5 on
    classic SysV system, or bluetooth.target which is
    requested as soon as a bluetooth dongle becomes available and
    which simply pulls in bluetooth related services that otherwise
    would not need to be started: bluetoothd and
    obexd and suchlike.
  7. snapshot: similar to target units
    snapshots do not actually do anything themselves and their only
    purpose is to reference other units. Snapshots can be used to
    save/rollback the state of all services and units of the init
    system. Primarily it has two intended use cases: to allow the
    user to temporarily enter a specific state such as “Emergency
    Shell”, terminating current services, and provide an easy way to
    return to the state before, pulling up all services again that
    got temporarily pulled down. And to ease support for system
    suspending: still many services cannot correctly deal with
    system suspend, and it is often a better idea to shut them down
    before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the
    environment, resource limits, working and root directory, umask,
    OOM killer adjustment, nice level, IO class and priority, CPU policy
    and priority, CPU affinity, timer slack, user id, group id,
    supplementary group ids, readable/writable/inaccessible
    directories, shared/private/slave mount flags,
    capabilities/bounding set, secure bits, CPU scheduler reset of
    fork, private /tmp name-space, cgroup control for
    various subsystems. Also, you can easily connect
    stdin/stdout/stderr of services to syslog, /dev/kmsg,
    arbitrary TTYs. If connected to a TTY for input systemd will make
    sure a process gets exclusive access, optionally waiting or enforcing
    it.
  2. Every executed process gets its own cgroup (currently by
    default in the debug subsystem, since that subsystem is not
    otherwise used and does not much more than the most basic
    process grouping), and it is very easy to configure systemd to
    place services in cgroups that have been configured externally,
    for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely
    follows the well-known .desktop files. It is a simple syntax for
    which parsers exist already in many software frameworks. Also, this
    allows us to rely on existing tools for i18n for service
    descriptions, and similar. Administrators and developers don’t
    need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init
    scripts. We take advantages of LSB and Red Hat chkconfig headers
    if they are available. If they aren’t we try to make the best of
    the otherwise available information, such as the start
    priorities in /etc/rc.d. These init scripts are simply
    considered a different source of configuration, hence an easy
    upgrade path to proper systemd services is available. Optionally
    we can read classic PID files for services to identify the main
    pid of a daemon. Note that we make use of the dependency
    information from the LSB init script headers, and translate
    those into native systemd dependencies. Side note: Upstart is
    unable to harvest and make use of that information. Boot-up on a
    plain Upstart system with mostly LSB SysV init scripts will
    hence not be parallelized, a similar system running systemd
    however will. In fact, for Upstart all SysV scripts together
    make one job that is executed, they are not treated
    individually, again in contrast to systemd where SysV init
    scripts are just another source of configuration and are all
    treated and controlled individually, much like any other native
    systemd service.
  5. Similarly, we read the existing /etc/fstab
    configuration file, and consider it just another source of
    configuration. Using the comment= fstab option you can
    even mark /etc/fstab entries to become systemd
    controlled automount points.
  6. If the same unit is configured in multiple configuration
    sources (e.g. /etc/systemd/system/avahi.service exists,
    and /etc/init.d/avahi too), then the native
    configuration will always take precedence, the legacy format is
    ignored, allowing an easy upgrade path and packages to carry
    both a SysV init script and a systemd service file for a
    while.
  7. We support a simple templating/instance mechanism. Example:
    instead of having six configuration files for six gettys, we
    only have one [email protected] file which gets instantiated to
    [email protected] and suchlike. The interface part can
    even be inherited by dependency expressions, i.e. it is easy to
    encode that a service [email protected] pulls in
    [email protected], while leaving the
    eth0 string wild-carded.
  8. For socket activation we support full compatibility with the
    traditional inetd modes, as well as a very simple mode that
    tries to mimic launchd socket activation and is recommended for
    new services. The inetd mode only allows passing one socket to
    the started daemon, while the native mode supports passing
    arbitrary numbers of file descriptors. We also support one
    instance per connection, as well as one instance for all
    connections modes. In the former mode we name the cgroup the
    daemon will be started in after the connection parameters, and
    utilize the templating logic mentioned above for this. Example:
    sshd.socket might spawn services
    [email protected] with a
    cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
    (i.e. the IP address and port numbers are used in the instance
    names. For AF_UNIX sockets we use PID and user id of the
    connecting client). This provides a nice way for the
    administrator to identify the various instances of a daemon and
    control their runtime individually. The native socket passing
    mode is very easily implementable in applications: if
    $LISTEN_FDS is set it contains the number of sockets
    passed and the daemon will find them sorted as listed in the
    .service file, starting from file descriptor 3 (a
    nicely written daemon could also use fstat() and
    getsockname() to identify the sockets in case it
    receives more than one). In addition we set $LISTEN_PID
    to the PID of the daemon that shall receive the fds, because
    environment variables are normally inherited by sub-processes and
    hence could confuse processes further down the chain. Even
    though this socket passing logic is very simple to implement in
    daemons, we will provide a BSD-licensed reference implementation
    that shows how to do this. We have ported a couple of existing
    daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a
    certain extent. This compatibility is in fact implemented with a
    FIFO-activated service, which simply translates these legacy
    requests to D-Bus requests. Effectively this means the old
    shutdown, poweroff and similar commands from
    Upstart and sysvinit continue to work with
    systemd.
  10. We also provide compatibility with utmp and
    wtmp. Possibly even to an extent that is far more
    than healthy, given how crufty utmp and wtmp
    are.
  11. systemd supports several kinds of
    dependencies between units. After/Before can be used to fix
    the ordering how units are activated. It is completely
    orthogonal to Requires and Wants, which
    express a positive requirement dependency, either mandatory, or
    optional. Then, there is Conflicts which
    expresses a negative requirement dependency. Finally, there are
    three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit
    is requested to start up or shut down we will add it and all its
    dependencies to a temporary transaction. Then, we will
    verify if the transaction is consistent (i.e. whether the
    ordering via After/Before of all units is
    cycle-free). If it is not, systemd will try to fix it up, and
    removes non-essential jobs from the transaction that might
    remove the loop. Also, systemd tries to suppress non-essential
    jobs in the transaction that would stop a running
    service. Non-essential jobs are those which the original request
    did not directly include but which where pulled in by
    Wants type of dependencies. Finally we check whether
    the jobs of the transaction contradict jobs that have already
    been queued, and optionally the transaction is aborted then. If
    all worked out and the transaction is consistent and minimized
    in its impact it is merged with all already outstanding jobs and
    added to the run queue. Effectively this means that before
    executing a requested operation, we will verify that it makes
    sense, fixing it if possible, and only failing if it really cannot
    work.
  13. We record start/exit time as well as the PID and exit status
    of every process we spawn and supervise. This data can be used
    to cross-link daemons with their data in abrtd, auditd and
    syslog. Think of an UI that will highlight crashed daemons for
    you, and allows you to easily navigate to the respective UIs for
    syslog, abrt, and auditd that will show the data generated from
    and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any
    time. The daemon state is serialized before the reexecution and
    deserialized afterwards. That way we provide a simple way to
    facilitate init system upgrades as well as handover from an
    initrd daemon to the final daemon. Open sockets and autofs
    mounts are properly serialized away, so that they stay
    connectible all the time, in a way that clients will not even
    notice that the init system reexecuted itself. Also, the fact
    that a big part of the service state is encoded anyway in the
    cgroup virtual file system would even allow us to resume
    execution without access to the serialization data. The
    reexecution code paths are actually mostly the same as the init
    system configuration reloading code paths, which
    guarantees that reexecution (which is probably more seldom
    triggered) gets similar testing as reloading (which is probably
    more common).
  15. Starting the work of removing shell scripts from the boot
    process we have recoded part of the basic system setup in C and
    moved it directly into systemd. Among that is mounting of the API
    file systems (i.e. virtual file systems such as /proc,
    /sys and /dev.) and setting of the
    host-name.
  16. Server state is introspectable and controllable via
    D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based
    activation, and we hence support dependencies between sockets and
    services, we also support traditional inter-service
    dependencies. We support multiple ways how such a service can
    signal its readiness: by forking and having the start process
    exit (i.e. traditional daemonize() behaviour), as well
    as by watching the bus until a configured service name appears.
  18. There’s an interactive mode which asks for confirmation each
    time a process is spawned by systemd. You may enable it by
    passing systemd.confirm_spawn=1 on the kernel command
    line.
  19. With the systemd.default= kernel command line
    parameter you can specify which unit systemd should start on
    boot-up. Normally you’d specify something like
    multi-user.target here, but another choice could even
    be a single service instead of a target, for example
    out-of-the-box we ship a service emergency.service that
    is similar in its usefulness as init=/bin/bash, however
    has the advantage of actually running the init system, hence
    offering the option to boot up the full system from the
    emergency shell.
  20. There’s a minimal UI that allows you to
    start/stop/introspect services. It’s far from complete but
    useful as a debugging tool. It’s written in Vala (yay!) and goes
    by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup
showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

  • We ask daemon writers not to fork or even double fork
    in their processes, but run their event loop from the initial process
    systemd starts for you. Also, don’t call setsid().
  • Don’t drop user privileges in the daemon itself, leave this
    to systemd and configure it in systemd service configuration
    files. (There are exceptions here. For example, for some daemons
    there are good reasons to drop privileges inside the daemon
    code, after an initialization phase that requires elevated
    privileges.)
  • Don’t write PID files
  • Grab a name on the bus
  • You may rely on systemd for logging, you are welcome to log
    whatever you need to log to stderr.
  • Let systemd create and watch sockets for you, so that socket
    activation works. Hence, interpret $LISTEN_FDS and
    $LISTEN_PID as described above.
  • Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?
Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.
Will this come to Fedora?
If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay’s pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That’s up to them. We’d certainly welcome their interest, and help with the integration.
Why didn’t you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.
Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.
If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.

LPC Audio BoF Notes

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/audio-bof-notes.html

Here are some very short notes from the Audio BoF
at the Linux Plumbers
Conference
in Portland two weeks ago. Sorry for the delay!

Biggest issue discussed was audio routing. On embedded devices this gets
more complex each day, and there are a lot of open questions on the desktop,
too. Different DSP scenarios; how do mixer controls match up with PCM streams
and jack sensing? How do we determine which volume control sliders that are in
the pipeline we are currently interested in? How does that relate to policy
decisions? Format to store audio routing in?

The ALSA scenario subsystem
currently being worked on by Liam Girdwood and the folks at SlimLogic and
currently on its way to being integrated into ALSA proper hopefully helps us,
so that we can strip a lot of complexity related to the routing logic from
PulseAudio and move it into a lower level which naturally knows more about the
hardware’s internal routing.

Does it make sense for some apps to bypass the ALSA userspace layer and
to talk to the kernel drivers via ioctl()s directly?i (i.e. thus not depending on ALSA’s
LISP intepreter, and a lot of other complexities)? Probably yes, but certainly
not in the short term future. Salsa? libsydney?

Should the timing deviation estimation/interpolation be moved from
PulseAudio into the kernel? Might be a good idea. Particularly interesting
when we try to to monitor not only the system and audio clocks, but the video
output and particularly the video input (i.e. video4linux) clocks, too. A
unified kernel-based timing system has advantages in accuracy, allows better
handling of (pseudo-) atomic timing snapshots, and would centralize timing
handling not only between different applications (PA and JACK) but also
between different subsystems. Problem: current timing stuff in PulseAudio
might be a bit too homegrown for moving it 1:1 into the kernel. Also, depends
on FP. Needs someone to push this. Apple does the clock handling in the
kernel. How does this relate to ALSA’s timer API?

Seems Ubuntu is going to kill OSS pretty soon too, following Fedora’s lead. Yay!

And that’s all I have. Should be the biggest points raised. Ping me if I
forgot something.

Linux Plumbers Conference 2009 CFP Ending Soon!

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/plumbersconf-2009.html

The Call for Papers for
the Linux Plumbers Conference (LPC)
in September in Portland, Oregon is ending soon, on June 15th 2009. It’s a conference
about the core infrastructure of Linux systems: the part of the system where
userspace and the kernel interface. It’s the first conference where the focus
is specifically on getting together the kernel people who work on the
userspace interfaces and the userspace people who have to deal with kernel
interfaces. It’s supposed to be a place where all the people doing
infrastructure work sit down and talk, so that each other understands better
what the requirements and needs of the other are, and where we can work
towards fixing the major problems we currently have with our lower-level
APIs.

Last year’s conference was hugely successful. If you want to read up what
happened then, LWN has good coverage.

Like last year, I will be running the Audio conference track of LPC. Audio
infrastructure on Linux is still heavily fragmented. Pro, desktop and embedded worlds are
very seperate. While we have quite good driver support the
user experience is far from perfect, mostly because our infrastructure is
so balkanized. Join us at the LPC and help to fix this! If you are doing audio infrastructure work on Linux, make sure to attend and submit a paper!

Sign up soon! Send in your paper quickly!

Plumbers Logo

See you in Portland!

A Guide Through The Linux Sound API Jungle

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/guide-to-sound-apis.html

At the Audio MC at the Linux Plumbers Conference one
thing became very clear: it is very difficult for programmers to
figure out which audio API to use for which purpose and which API not
to use when doing audio programming on Linux. So here’s my try to
guide you through this jungle:

What do you want to do?

I want to write a media-player-like application!
Use GStreamer! (Unless your focus is only KDE in which cases Phonon might be an alternative.)
I want to add event sounds to my application!
Use libcanberra, install your sound files according to the XDG Sound Theming/Naming Specifications! (Unless your focus is only KDE in which case KNotify might be an alternative although it has a different focus.)
I want to do professional audio programming, hard-disk recording, music synthesizing, MIDI interfacing!
Use JACK and/or the full ALSA interface.
I want to do basic PCM audio playback/capturing!
Use the safe ALSA subset.
I want to add sound to my game!
Use the audio API of SDL for full-screen games, libcanberra for simple games with standard UIs such as Gtk+.
I want to write a mixer application!
Use the layer you want to support directly: if you want to support enhanced desktop software mixers, use the PulseAudio volume control APIs. If you want to support hardware mixers, use the ALSA mixer APIs.
I want to write audio software for the plumbing layer!
Use the full ALSA stack.
I want to write audio software for embedded applications!
For technical appliances usually the safe ALSA subset is a good choice, this however depends highly on your use-case.

You want to know more about the different sound APIs?

GStreamer
GStreamer is the de-facto
standard media streaming system for Linux desktops. It supports decoding and
encoding of audio and video streams. You can use it for a wide range of
purposes from simple audio file playback to elaborate network
streaming setups. GStreamer supports a wide range of CODECs and audio
backends. GStreamer is not particularly suited for basic PCM playback
or low-latency/realtime applications. GStreamer is portable and not
limited in its use to Linux. Among the supported backends are ALSA, OSS, PulseAudio. [Programming Manuals and References]
libcanberra
libcanberra
is an abstract event sound API. It implements the XDG
Sound Theme and Naming Specifications
. libcanberra is a blessed
GNOME dependency, but itself has no dependency on GNOME/Gtk/GLib and can be
used with other desktop environments as well. In addition to an easy
interface for playing sound files, libcanberra provides caching
(which is very useful for networked thin clients) and allows passing
of various meta data to the underlying audio system which then can be
used to enhance user experience (such as positional event sounds) and
for improving accessibility. libcanberra supports multiple backends
and is portable beyond Linux. Among the supported backends are ALSA, OSS, PulseAudio, GStreamer. [API Reference]
JACK
JACK is a sound system for
connecting professional audio production applications and hardware
output. It’s focus is low-latency and application interconnection. It
is not useful for normal desktop or embedded use. It is not an API
that is particularly useful if all you want to do is simple PCM
playback. JACK supports multiple backends, although ALSA is best
supported. JACK is portable beyond Linux. Among the supported backends are ALSA, OSS. [API Reference]
Full ALSA
ALSA is the Linux API
for doing PCM playback and recording. ALSA is very focused on
hardware devices, although other backends are supported as well (to a
limit degree, see below). ALSA as a name is used both for the Linux
audio kernel drivers and a user-space library that wraps these. ALSA — the library — is
comprehensive, and portable (to a limited degree). The full ALSA API
can appear very complex and is large. However it supports almost
everything modern sound hardware can provide. Some of the
functionality of the ALSA API is limited in its use to actual hardware
devices supported by the Linux kernel (in contrast to software sound
servers and sound drivers implemented in user-space such as those for
Bluetooth and FireWire audio — among others) and Linux specific
drivers. [API
Reference
]
Safe ALSA
Only a subset of the full ALSA API works on all backends ALSA
supports. It is highly recommended to stick to this safe subset
if you do ALSA programming to keep programs portable, future-proof and
compatible with sound servers, Bluetooth audio and FireWire audio. See
below for more details about which functions of ALSA are considered
safe. The safe ALSA API is a suitable abstraction for basic,
portable PCM playback and recording — not just for ALSA kernel driver
supported devices. Among the supported backends are ALSA kernel driver
devices, OSS, PulseAudio, JACK.
Phonon and KNotify
Phonon is high-level
abstraction for media streaming systems such as GStreamer, but goes a
bit further than that. It supports multiple backends. KNotify is a
system for “notifications”, which goes beyond mere event
sounds. However it does not support the XDG Sound Theming/Naming
Specifications at this point, and also doesn’t support caching or
passing of event meta-data to an underlying sound system. KNotify
supports multiple backends for audio playback via Phonon. Both APIs
are KDE/Qt specific and should not be used outside of KDE/Qt
applications. [Phonon API Reference] [KNotify API Reference]
SDL
SDL is a portable API
primarily used for full-screen game development. Among other stuff it
includes a portable audio interface. Among others SDL support OSS,
PulseAudio, ALSA as backends. [API Reference]
PulseAudio
PulseAudio is a sound system
for Linux desktops and embedded environments that runs in user-space
and (usually) on top of ALSA. PulseAudio supports network
transparency, per-application volumes, spatial events sounds, allows
switching of sound streams between devices on-the-fly, policy
decisions, and many other high-level operations. PulseAudio adds a glitch-free
audio playback model to the Linux audio stack. PulseAudio is not
useful in professional audio production environments. PulseAudio is
portable beyond Linux. PulseAudio has a native API and also supports
the safe subset of ALSA, in addition to limited,
LD_PRELOAD-based OSS compatibility. Among others PulseAudio supports
OSS and ALSA as backends and provides connectivity to JACK. [API
Reference
]
OSS
The Open Sound System is a
low-level PCM API supported by a variety of Unixes including Linux. It
started out as the standard Linux audio system and is supported on
current Linux kernels in the API version 3 as OSS3. OSS3 is considered
obsolete and has been fully replaced by ALSA. A successor to OSS3
called OSS4 is available but plays virtually no role on Linux and is
not supported in standard kernels or by any of the relevant
distributions. The OSS API is very low-level, based around direct
kernel interfacing using ioctl()s. It it is hence awkward to use and
can practically not be virtualized for usage on non-kernel audio
systems like sound servers (such as PulseAudio) or user-space sound
drivers (such as Bluetooth or FireWire audio). OSS3’s timing model
cannot properly be mapped to software sound servers at all, and is
also problematic on non-PCI hardware such as USB audio. Also, OSS does
not do sample type conversion, remapping or resampling if
necessary. This means that clients that properly want to support OSS
need to include a complete set of converters/remappers/resamplers for
the case when the hardware does not natively support the requested
sampling parameters. With modern sound cards it is very common to
support only S32LE samples at 48KHz and nothing else. If an OSS client
assumes it can always play back S16LE samples at 44.1KHz it will thus
fail. OSS3 is portable to other Unix-like systems, various differences
however apply. OSS also doesn’t support surround sound and other
functionality of modern sounds systems properly. OSS should be
considered obsolete and not be used in new applications.
ALSA and
PulseAudio have limited LD_PRELOAD-based compatibility with OSS. [Programming Guide]

All sound systems and APIs listed above are supported in all
relevant current distributions. For libcanberra support the newest
development release of your distribution might be necessary.

All sound systems and APIs listed above are suitable for
development for commercial (read: closed source) applications, since
they are licensed under LGPL or more liberal licenses or no client
library is involved.

You want to know why and when you should use a specific sound API?

GStreamer
GStreamer is best used for very high-level needs: i.e. you want to
play an audio file or video stream and do not care about all the tiny
details down to the PCM or codec level.
libcanberra
libcanberra is best used when adding sound feedback to user input
in UIs. It can also be used to play simple sound files for
notification purposes.
JACK
JACK is best used in professional audio production and where interconnecting applications is required.
Full ALSA
The full ALSA interface is best used for software on “plumbing layer” or when you want to make use of very specific hardware features, which might be need for audio production purposes.
Safe ALSA
The safe ALSA interface is best used for software that wants to output/record basic PCM data from hardware devices or software sound systems.
Phonon and KNotify
Phonon and KNotify should only be used in KDE/Qt applications and only for high-level media playback, resp. simple audio notifications.
SDL
SDL is best used in full-screen games.
PulseAudio
For now, the PulseAudio API should be used only for applications
that want to expose sound-server-specific functionality (such as
mixers) or when a PCM output abstraction layer is already available in
your application and it thus makes sense to add an additional backend
to it for PulseAudio to keep the stack of audio layers minimal.
OSS
OSS should not be used for new programs.

You want to know more about the safe ALSA subset?

Here’s a list of DOS and DONTS in the ALSA API if you care about
that you application stays future-proof and works fine with
non-hardware backends or backends for user-space sound drivers such as
Bluetooth and FireWire audio. Some of these recommendations apply for
people using the full ALSA API as well, since some functionality
should be considered obsolete for all cases.

If your application’s code does not follow these rules, you must have
a very good reason for that. Otherwise your code should simply be considered
broken!

DONTS:

  • Do not use “async handlers”, e.g. via
    snd_async_add_pcm_handler() and friends. Asynchronous
    handlers are implemented using POSIX signals, which is a very
    questionable use of them, especially from libraries and plugins. Even
    when you don’t want to limit yourself to the safe ALSA subset
    it is highly recommended not to use this functionality. Read
    this for a longer explanation why signals for audio IO are
    evil.
  • Do not parse the ALSA configuration file yourself or with
    any of the ALSA functions such as snd_config_xxx(). If you
    need to enumerate audio devices use snd_device_name_hint()
    (and related functions). That
    is the only API that also supports enumerating non-hardware audio
    devices and audio devices with drivers implemented in userspace.
  • Do not parse any of the files from
    /proc/asound/. Those files only include information about
    kernel sound drivers — user-space plugins are not listed there. Also,
    the set of kernel devices might differ from the way they are presented
    in user-space. (i.e. sub-devices are mapped in different ways to
    actual user-space devices such as surround51 an suchlike.
  • Do not rely on stable device indexes from ALSA. Nowadays
    they depend on the initialization order of the drivers during boot-up
    time and are thus not stable.
  • Do not use the snd_card_xxx() APIs. For
    enumerating use snd_device_name_hint() (and related
    functions). snd_card_xxx() is obsolete. It will only list
    kernel hardware devices. User-space devices such as sound servers,
    Bluetooth audio are not included. snd_card_load() is
    completely obsolete in these days.
  • Do not hard-code device strings, especially not
    hw:0 or plughw:0 or even dmix — these devices define no channel
    mapping and are mapped to raw kernel devices. It is highly recommended
    to use exclusively default as device string. If specific
    channel mappings are required the correct device strings should be
    front for stereo, surround40 for Surround 4.0,
    surround41, surround51, and so on. Unfortunately at
    this point ALSA does not define standard device names with channel
    mappings for non-kernel devices. This means default may only
    be used safely for mono and stereo streams. You should probably prefix
    your device string with plug: to make sure ALSA transparently
    reformats/remaps/resamples your PCM stream for you if the
    hardware/backend does not support your sampling parameters
    natively.
  • Do not assume that any particular sample type is supported
    except the following ones: U8, S16_LE, S16_BE, S32_LE, S32_BE,
    FLOAT_LE, FLOAT_BE, MU_LAW, A_LAW.
  • Do not use snd_pcm_avail_update() for
    synchronization purposes. It should be used exclusively to query the
    amount of bytes that may be written/read right now. Do not use
    snd_pcm_delay() to query the fill level of your playback
    buffer. It should be used exclusively for synchronisation
    purposes. Make sure you fully understand the difference, and note that
    the two functions return values that are not necessarily directly
    connected!
  • Do not assume that the mixer controls always know dB information.
  • Do not assume that all devices support MMAP style buffer access.
  • Do not assume that the hardware pointer inside the (possibly mmaped) playback buffer is the actual position of the sample in the DAC. There might be an extra latency involved.
  • Do not try to recover with your own code from ALSA error conditions such as buffer under-runs. Use snd_pcm_recover() instead.
  • Do not touch buffering/period metrics unless you have
    specific latency needs. Develop defensively, handling correctly the
    case when the backend cannot fulfill your buffering metrics
    requests. Be aware that the buffering metrics of the playback buffer
    only indirectly influence the overall latency in many
    cases. i.e. setting the buffer size to a fixed value might actually result in
    practical latencies that are much higher.
  • Do not assume that snd_pcm_rewind() is available and works and to which degree.
  • Do not assume that the time when a PCM stream can receive
    new data is strictly dependant on the sampling and buffering
    parameters and the resulting average throughput. Always make sure to
    supply new audio data to the device when it asks for it by signalling
    “writability” on the fd. (And similarly for capturing)
  • Do not use the “simple” interface snd_spcm_xxx().
  • Do not use any of the functions marked as “obsolete”.
  • Do not use the timer, midi, rawmidi, hwdep subsystems.

DOS:

  • Use snd_device_name_hint() for enumerating audio devices.
  • Use snd_smixer_xx() instead of raw snd_ctl_xxx()
  • For synchronization purposes use snd_pcm_delay().
  • For checking buffer playback/capture fill level use snd_pcm_update_avail().
  • Use snd_pcm_recover() to recover from errors returned by any of the ALSA functions.
  • If possible use the largest buffer sizes the device supports to maximize power saving and drop-out safety. Use snd_pcm_rewind() if you need to react to user input quickly.

FAQ

What about ESD and NAS?
ESD and NAS are obsolete, both as API and as sound daemon. Do not develop for it any further.
ALSA isn’t portable!
That’s not true! Actually the user-space library is relatively portable, it even includes a backend for OSS sound devices. There is no real reason that would disallow using the ALSA libraries on other Unixes as well.
Portability is key to me! What can I do?
Unfortunately no truly portable (i.e. to Win32) PCM API is
available right now that I could truly recommend. The systems shown
above are more or less portable at least to Unix-like operating
systems. That does not mean however that there are suitable backends
for all of them available. If you care about portability to Win32 and
MacOS you probably have to find a solution outside of the
recommendations above, or contribute the necessary
backends/portability fixes. None of the systems (with the exception of
OSS) is truly bound to Linux or Unix-like kernels.
What about PortAudio?
I don’t think that PortAudio is very good API for Unix-like operating systems. I cannot recommend it, but it’s your choice.
Oh, why do you hate OSS4 so much?
I don’t hate anything or anyone. I just don’t think OSS4 is a
serious option, especially not on Linux. On Linux, it is also
completely redundant due to ALSA.
You idiot, you have no clue!
You are right, I totally don’t. But that doesn’t hinder me from recommending things. Ha!
Hey I wrote/know this tiny new project which is an awesome abstraction layer for audio/media!
Sorry, that’s not sufficient. I only list software here that is known to be sufficiently relevant and sufficiently well maintained.

Final Words

Of course these recommendations are very basic and are only intended to
lead into the right direction. For each use-case different necessities
apply and hence options that I did not consider here might become
viable. It’s up to you to decide how much of what I wrote here
actually applies to your application.

This summary only includes software systems that are considered
stable and universally available at the time of writing. In the
future I hope to introduce a more suitable and portable replacement
for the safe ALSA subset of functions. I plan to update this text
from time to time to keep things up-to-date.

If you feel that I forgot a use case or an important API, then
please contact me or leave a comment. However, I think the summary
above is sufficiently comprehensive and if an entry is missing I most
likely deliberately left it out.

(Also note that I am upstream for both PulseAudio and libcanberra and did some minor contributions to ALSA, GStreamer and some other of the systems listed above. Yes, I am biased.)

Oh, and please syndicate this, digg it. I’d like to see this guide to be well-known all around the Linux community. Thank you!

Linux Plumbers Conference CFP Extended!

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/plumbersconf-2.html

The Call for Papers for
the Linux Plumbers Conference
in September in Portland, Oregon has been extended until July 31st 2008. It’s a conference
about the core infrastructure of Linux systems: the part of the system where
userspace and the kernel interface. It’s the first conference where the focus
is specifically on getting together the kernel people who work on the
userspace interfaces and the userspace people who have to deal with kernel
interfaces. It’s supposed to be a place where all the people doing
infrastructure work sit down and talk, so that each other understands better
what the requirements and needs of the other are, and where we can work
towards fixing the major problems we currently have with our lower-level
APIs.

I am running the Audio microconf of the Plumbers Conference. Audio
infrastructure on Linux is still heavily fragmented. Pro, desktop and embedded worlds are
almost completely seperate worlds. While we have quite good driver support the
user experience is far from perfect, mostly due because our infrastructure is
so balkanized. Join us at the Plumbers Conference and help to fix this! If you are doing audio infrastructure work on Linux, make sure to attend and submit a paper!

Sign up soon! Send in your paper early! The conference is expected to sell out pretty quickly!

Plumbers Logo

See you in Portland!

PulseAudio FUD

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/jeffrey-stedfast.html

Jeffrey Stedfast

Jeffrey Stedfast seems to have made it his new hobby
to
bash
PulseAudio.
In a series of very negative blog postings he flamed my software and hence me
in best NotZed-like fashion. Particularly interesting in this case is the
fact that he apologized to me privately on IRC for this behaviour shortly
after his first posting when he was critizised on #gnome-hackers
only to continue flaming and bashing in more blog posts shortly after. Flaming
is very much part of the Free Software community I guess. A lot of people do
it from time to time (including me). But maybe there are better places for
this than Planet Gnome. And maybe doing it for days is not particularly nice.
And maybe flaming sucks in the first place anyway.

Regardless what I think about Jeffrey and his behaviour on Planet Gnome,
let’s have a look on his trophies, the five “bugs” he posted:

  1. Not directly related to PulseAudio itself. Also, finding errors in code that is related to esd is not exactly the most difficult thing in the world.
  2. The same theme.
  3. Fixed 3 months ago. It is certainly not my fault that this isn’t available in Jeffrey’s distro.
  4. A real, valid bug report. Fixed in git a while back, but not available in any released version. May only be triggered under heavy load or with a bad high-latency scheduler.
  5. A valid bug, but not really in PulseAudio. Mostly caused because the ALSA API and PA API don’t really match 100%.

OK, Jeffrey found a real bug, but I wouldn’t say this is really enough to make all the fuss about. Or is it?

Why PulseAudio?

Jeffrey wrote something about ‘solution looking for a problem‘ when
speaking of PulseAudio. While that was certainly not a nice thing to say it
however tells me one thing: I apparently didn’t manage to communicate well
enough why I am doing PulseAudio in the first place. So, why am I doing it then?

  • There’s so much more a good audio system needs to provide than just the
    most basic mixing functionality. Per-application volumes, moving streams
    between devices during playback, positional event sounds (i.e. click on the
    left side of the screen, have the sound event come out through the left
    speakers), secure session-switching support, monitoring of sound playback
    levels, rescuing playback streams to other audio devices on hot unplug,
    automatic hotplug configuration, automatic up/downmixing stereo/surround,
    high-quality resampling, network transparency, sound effects, simultaneous
    output to multiple sound devices are all features PA provides right now, and
    what you don’t get without it. It also provides the infrastructure for
    upcoming features like volume-follows-focus, automatic attenuation of music on
    signal on VoIP stream, UPnP media renderer support, Apple RAOP support,
    mixing/volume adjustments with dynamic range compression, adaptive volume of
    event sounds based on the volume of music streams, jack sensing, switching
    between stereo/surround/spdif during runtime, …
  • And even for the most basic mixing functionality plain ALSA/dmix is not
    really everlasting happiness. Due to the way it works all clients are forced
    to use the same buffering metrics all the time, that means all clients are
    limited in their wakeup/latency settings. You will burn more CPU than
    necessary this way, keep the risk of drop-outs unnecessarily high and still
    not be able to make clients with low-latency requirements happy. ‘Glitch-Free’
    PulseAudio
    fixes all this. Quite frankly I believe that ‘glitch-free’
    PulseAudio is the single most important killer feature that should be enough
    to convince everyone why PulseAudio is the right thing to do. Maybe people
    actually don’t know that they want this. But they absolutely do, especially
    the embedded people — if used properly it is a must for power-saving during
    audio playback. It’s a pity that how awesome this feature is you cannot
    directly see from the user interface.[1]
  • PulseAudio provides compatibility with a lot of sound systems/APIs that bare ALSA
    or bare OSS don’t provide.
  • And last but not least, I love breaking Jeffrey’s audio. It’s just soo much fun, you really have to try it! 😉

If you want to know more about why I think that PulseAudio is an important part of the modern Linux desktop audio stack, please read my slides from FOSS.in 2007.

Misconceptions

Many people (like Jeffrey) wonder why have software mixing at all if you
have hardware mixing? The thing is, hardware mixing is a thing of the past,
modern soundcards don’t do it anymore. Precisely for doing things like mixing
in software SIMD CPU extensions like SSE have been invented. Modern sound
cards these days are kind of “dumbed” down, high-quality DACs. They don’t do
mixing anymore, many modern chips don’t even do volume control anymore.
Remember the days where having a Wavetable chip was a killer feature of a
sound card? Those days are gone, today wavetable synthesizing is done almost
exlcusively in software — and that’s exactly what happened to hardware mixing
too. And it is good that way. In software mixing is is much easier to do
fancier stuff like DRC which will increase quality of mixing. And modern CPUs provide
all the necessary SIMD command sets to implement this efficiently.

Other people believe that JACK would be a better solution for the problem.
This is nonsense. JACK has been designed for a very different purpose. It is
optimized for low latency inter-application communication. It requires
floating point samples, it knows nothing about channel mappings, it depends on
every client to behave correctly. And so on, and so on. It is a sound server
for audio production. For desktop applications it is however not well suited.
For a desktop saving power is very important, one application misbehaving
shouldn’t have an effect on other application’s playback; converting from/to
FP all the time is not going to help battery life either. Please understand
that for the purpose of pro audio you can make completely different
compromises than you can do on the desktop. For example, while having
‘glitch-free’ is great for embedded and desktop use, it makes no sense at all
for pro audio, and would only have a drawback on performance. So, please stop
bringing up JACK again and again. It’s just not the right tool for desktop
audio, and this opinion is shared by the JACK developers themselves.

Jeffrey thinks that audio mixing is nothing for userspace. Which is
basically what OSS4 tries to do: mixing in kernel space. However, the future
of PCM audio is floating points. Mixing them in kernel space is problematic because (at least on Linux) FP in kernel space is a no-no.
Also, the kernel people made clear more than once that maths/decoding/encoding like this
should happen in userspace. Quite honestly, doing the mixing in kernel space
is probably one of the primary reasons why I think that OSS4 is a bad idea.
The fancier your mixing gets (i.e. including resampling, upmixing, downmixing,
DRC, …) the more difficulties you will have to move such a complex,
time-intensive code into the kernel.

Not everytime your audio breaks it is alone PulseAudio’s fault. For
example, the original flame of Jeffrey’s was about the low volume that he
experienced when running PA. This is mostly due to the suckish way we
initialize the default volumes of ALSA sound cards. Most distributions have
simple scripts that initialize ALSA sound card volumes to fixed values like
75% of the available range, without understanding what the range or the
controls actually mean. This is actually a very bad thing to do. Integrated
USB speakers for example tend export the full amplification range via the
mixer controls. 75% for them is incredibly loud. For other hardware (like
apparently Jeffrey’s) it is too low in volume. How to fix this has been
discussed on the ALSA mailing list, but no final solution has been presented
yet. Nonetheless, the fact that the volume was too low, is completely
unrelated to PulseAudio.

PulseAudio interfaces with lower-level technologies like ALSA on one hand,
and with high-level applications on the other hand. Those systems are not
perfect. Especially closed-source applications tend to do very evil things
with the audio APIs (Flash!) that are very hard to support on virtualized
sound systems such as PulseAudio [2]. However, things are getting better. My list of issues I found in
ALSA
is getting shorter. Many applications have already been fixed.

The reflex “my audio is broken it must be PulseAudio’s fault” is certainly
easy to come up with, but it certainly is not always right.

Also note that — like many areas in Free Software — development of the
desktop audio stack on Linux is a bit understaffed. AFAIK there are only two
people working on ALSA full-time and only me working on PulseAudio and other
userspace audio infrastructure, assisted by a few others who supply code and patches
from time to time, some more and some less.

More Breakage to Come

I now tried to explain why the audio experience on systems with PulseAudio
might not be as good as some people hoped, but what about the future? To be
frank: the next version of PulseAudio (0.9.11) will break even more things.
The ‘glitch-free’ stuff mentioned above uses quite a few features of the
underlying ALSA infrastructure that apparently noone has been using before —
and which just don’t work properly yet on all drivers. And there are quite a
few drivers around, and I only have a very limited set of hardware to test
with. Already I know that the some of the most popular drivers (USB and HDA)
do not work entirely correctly with ‘glitch-free’.

So you ask why I plan to release this code knowing that it will break
things? Well, it works on some hardware/drivers properly, and for the others I
know work-arounds to get things to work. And 0.9.11 has been delayed for too
long already. Also I need testing from a bigger audience. And it is not so
much 0.9.11 that is buggy, it is the code it is based on. ‘Glitch-free’ PA
0.9.11 is going to part of Fedora 10. Fedora has always been more bleeding
edge than other other distributions. Picking 0.9.11 just like that for an
‘LTS’ release might however be a not a good idea.

So, please bear with me when I release 0.9.11. Snapshots have already
been available in Rawhide for a while, and hell didn’t freeze over.

The Distributions’ Role in the Game

Some distributions did a better job adopting PulseAudio than others. On the
good side I certainly have to list Mandriva, Debian[3], and
Fedora[4]. OTOH Ubuntu didn’t exactly do a stellar job. They didn’t
do their homework. Adopting PA in a distribution is a fair amount of work,
given that it interfaces with so many different things at so many different
places. The integration with other systems is crucial. The information was all
out there, communicated on the wiki, the mailing lists and on the PA IRC
channel. But if you join and hang around on neither, then you won’t get the
memo. To my surprise when Ubuntu adopted PulseAudio they moved into one of their
‘LTS’ releases rightaway [5]. Which I guess can be called gutsy —
on the background that I work for Red Hat and PulseAudio is not part of RHEL
at this time. I get a lot of flak from Ubuntu users, and I am pretty sure the
vast amount of it is undeserving and not my fault.

Why Jeffrey’s distro of choice (SUSE?) didn’t package pavucontrol 0.9.6
although it has been released months ago I don’t know. But there’s certainly no reason to whine about
that to me
and bash me for it.

Having said all this — it’s easy to point to other software’s faults or
other people’s failures. So, admitting this, PulseAudio is certainly not
bug-free, far from that. It’s a relatively complex piece of software
(threading, real-time, lock-free, sensitive to timing, …), and every
software has its bugs. In some workloads they might be easier to find than it
others. And I am working on fixing those which are found. I won’t forget any
bug report, but the order and priority I work on them is still mostly up to me
I guess, right? There’s still a lot of work to do in desktop audio, it will
take some time to get things completely right and complete.

Calls for “audio should just work ™” are often heard. But if you don’t
want to stick with a sound system that was state of the art in the 90’s for
all times, then I fear things *will have* to break from time to time. And
Jeffrey, I have no idea what you are actually hacking on. Some people
mentioned something with Evolution. If that’s true, then quite honestly,
“email should just work”, too, shouldn’t it? Evolution is not exactly
famous for it’s legendary bug-freeness and stability, or did I miss something?
Maybe you should be the one to start with making things “just work”, especially since
Evolution has been around for much longer already.

Back to Work

Now that I responded to Jeffrey’s FUD I think we all can go back to work
and end this flamefest! I wish people would actually try to understand
things before writing an insulting rant — without the slightest clue — but
with words like “clusterfuck”. I’d like to thank all the people who commented
on Jeffrey’s blog and basically already said what I wrote here
now.

So, and now I am off hacking a bit on PulseAudio a bit more — or should
I say in Jeffrey’s words: on my clusterfuck that is an epic fail and that no desktop user needs?

Footnotes

[1] BTW ‘glitch-free’ is nothing I invented, other OS have been doing something
like this for quite a while (Vista, Mac OS). On Linux however, PulseAudio is
the first and only implementation (at least to my knowledge).

[2] In fact, Flash 9 can not be made fully working on PulseAudio.
This is because the way Flash destructs it’s driver backends is racy.
Unfixably racy, from external code. Jeffrey complained about Flash instability
in his second post. This is unfair to PulseAudio, because I cannot fix this.
This is like complaining that X crashes when you use binary-only
fglrx.

[3] To Debian’s standards at least. Since development of Debian is
very distributed the integration of such a system as PulseAudio is much more
difficult since in touches so many different packages in the system that are
kind of private property by a lot of different maintainers with different
views on things.

[4] I maintain the Fedora stuff myself, so I might be a bit biased on this one… 😉

[5] I guess Ubuntu sees that this was a bit too much too early, too.
At least that’s how I understood my invitation to UDS in Prague. Since that
summit I haven’t heard anything from them anymore, though.

Linux Plumbers Conference CFP

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/plumbersconf.html

The Call for Papers for
the Linux Plumbers Conference
in September in Portland is out now. It’s a conference about the core
infrastructure of Linux systems: the part of the system where userspace and the
kernel interface. It’s the first conference where the focus is specifically on
getting together the kernel people who work on the userspace interfaces and the
userspace people who have to deal with kernel interfaces. It’s supposed to be a
place where all the people doing infrastructure work sit down and talk, so that
each other understands better what the requirements and needs of the other are,
and where we can work towards fixing the major problems we currently have with
our lower-level APIs.

I am running the Audio microconf of the Plumbers Conference. Audio
infrastructure on Linux is still heavily fragmented. Pro, desktop and embedded worlds are
almost completely seperate worlds. While we have quite good driver support the
user experience is far from perfect, mostly due because our infrastructure is
so balkanized. Join us at the Plumbers Conference and help to fix this! If you are doing audio infrastructure work on Linux, make sure to attend or — even better — submit a paper!

Sign up soon! Send in your paper early! The conference is expected to sell out pretty quickly!

Plumbers Logo

See you in Portland!

What’s Cooking in PulseAudio’s glitch-free Branch

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/pulse-glitch-free.html

A while ago I started development of special branch of PulseAudio which is called
glitch-free. In a few days I will merge it back to PulseAudio
trunk, and eventually release it as 0.9.11. I think it’s time to
explain a little what all this “glitch-freeness” is about, what made
it so tricky to implement, and why this is totally awesome
technology. So, here we go:

Traditional Playback Model

Traditionally on most operating systems audio is scheduled via
sound card interrupts
(IRQs)
. When an application opens a sound card for playback it
configures it for a fixed size playback buffer. Then it fills this
buffer with digital PCM
sample data. And after that it tells the hardware to start
playback. Then, the hardware reads the samples from the buffer, one at
a time, and passes it on to the DAC
so that eventually it reaches the speakers.

After a certain number of samples played the sound hardware
generates an interrupt. This interrupt is forwarded to the
application. On Linux/Unix this is done via poll()/select(),
which the application uses to sleep on the sound card file
descriptor. When the application is notified via this interrupt it
overwrites the samples that were just played by the hardware with new
data and goes to sleep again. When the next interrupt arrives the next
block of samples is overwritten, and so on and so on. When the
hardware reaches the end of the hardware buffer it starts from its
beginning again, in a true ring buffer
fashion. This goes on and on and on.

The number of samples after which an interrupt is generated is
usually called a fragment (ALSA likes to call the same thing a
period for some reason). The number of fragments the entire
playback buffer is split into is usually integral and usually a power of
two, 2 and 4 being the most frequently used values.

Schematic overview
Image 1: Schematic overview of the playback buffer in the traditional playback model, in the best way the author can visualize this with his limited drawing abilities.

If the application is not quick enough to fill up the hardware
buffer again after an interrupt we get a buffer underrun
(“drop-out”). An underrun is clearly hearable by the user as a
discontinuity in audio which is something we clearly don’t want. We
thus have to carefully make sure that the buffer and fragment sizes
are chosen in a way that the software has enough time to calculate the
data that needs to be played, and the OS has enough time to forward
the interrupt from the hardware to the userspace software and the
write request back to the hardware.

Depending on the requirements of the application the size of the
playback buffer is chosen. It can be as small as 4ms for low-latency
applications (such as music synthesizers), or as long as 2s for
applications where latency doesn’t matter (such as music players). The
hardware buffer size directly translates to the latency that the
playback adds to the system. The smaller the fragment sizes the
application configures, the more time the application has to fill up
the playback buffer again.

Let’s formalize this a bit: Let BUF_SIZE be the size of the
hardware playback buffer in samples, FRAG_SIZE the size of one
fragment in samples, and NFRAGS the number of fragments the buffer is
split into (equivalent to BUF_SIZE divided by FRAG_SIZE), RATE the sampling
rate in samples per second. Then, the overall latency is identical to
BUF_SIZE/RATE. An interrupt is generated every FRAG_SIZE/RATE. Every
time one of those interrupts is generated the application should fill
up one fragment again, if it missed one interrupt this might become
more than one. If it doesn’t miss any interrupt it has
(NFRAGS-1)*FRAG_SIZE/RATE time to fulfill the request. If it needs
more time than this we’ll get an underrun. The fill level of the
playback buffer should thus usually oscillate between BUF_SIZE and
(NFRAGS-1)*FRAG_SIZE. In case of missed interrupts it might however
fall considerably lower, in the worst case to 0 which is, again, an
underrun.

It is difficult to choose the buffer and fragment sizes in an
optimal way for an application:

  • The buffer size should be as large as possible to minimize the
    risk of drop-outs.
  • The buffer size should be as small as possible to guarantee
    minimal latencies.
  • The fragment size should be as large as possible to minimize the
    number of interrupts, and thus the required CPU time used, to maximize
    the time the CPU can sleep for between interrupts and thus the battery
    lifetime (i.e. the fewer interrupts are generated the lower your audio
    app will show up in powertop, and that’s what all is about,
    right?)
  • The fragment size should be as small as possible to give the
    application as much time as possible to fill up the playback buffer,
    to minimize drop-outs.

As you can easily see it is impossible to choose buffering metrics
in a way that they are optimal on all four requirements.

This traditional model has major drawbacks:

  • The buffering metrics are highly dependant on what the sound hardware
    can provide. Portable software needs to be able to deal with hardware
    that can only provide a very limited set of buffer and fragment
    sizes.
  • The buffer metrics are configured only once, when the device is
    opened, they usually cannot be reconfigured during playback without
    major discontinuities in audio. This is problematic if more than one
    application wants to output audio at the same time via a sound server
    (or dmix) and they have different requirements on
    latency. For these sound servers/dmix the fragment metrics are
    configured statically in a configuration file, and are the same during
    the whole lifetime. If a client connects that needs lower latencies,
    it basically lost. If a client connects that doesn’t need as low
    latencies, we will continouisly burn more CPU/battery than
    necessary.
  • It is practically impossible to choose the buffer metrics optimal
    for your application — there are too many variables in the equation:
    you can’t know anything about the IRQ/scheduling latencies of the
    OS/machine your software will be running on; you cannot know how much
    time it will actually take to produce the audio data that shall be
    pushed to the audio device (unless you start counting cycles, which is
    a good way to make your code unportable); the scheduling latencies are
    hugely dependant on the system load on most current OSes (unless you
    have an RT system, which we generally do not have). As said, for sound
    servers/dmix it is impossible to know in advance what the requirements
    on latency are that the applications that might eventually connect
    will have.
  • Since the number of fragments is integral and at least 2
    on almost all existing hardware we will generate at least two interrupts
    on each buffer iteration. If we fix the buffer size to 2s then we will
    generate an interrupt at least every 1s. We’d then have 1s to fill up
    the buffer again — on all modern systems this is far more than we’d
    ever need. It would be much better if we could fix the fragment size
    to 1.9s, which still gives us 100ms to fill up the playback buffer
    again, still more than necessary on most systems.

Due to the limitations of this model most current (Linux/Unix)
software uses buffer metrics that turned out to “work most of the
time”, very often they are chosen without much thinking, by copying
other people’s code, or totally at random.

PulseAudio <= 0.9.10 uses a fragment size of 25ms by default, with
four fragments. That means that right now, unless you reconfigure your
PulseAudio manually clients will not get latencies lower than 100ms
whatever you try, and as long as music is playing you will
get 40 interrupts/s. (The relevant configuration options for PulseAudio are
default-fragments= and default-fragment-size-msec=
in daemon.conf)

dmix uses 16 fragments by default with a size of 21 ms each (on my
system at least — this varies, depending on your hardware). You can’t
get less than 47 interrupts/s. (You can change the parameters in
.asoundrc)

So much about the traditional model and its limitations. Now, we’ll
have a peek on how the new glitch-free branch of PulseAudio
does its things. The technology is not really new. It’s inspired
by what Vista does these days and what Apple CoreAudio has already
been doing for quite a while. However, on Linux this technology is
new, we have been lagging behind quite a bit. Also I claim that what
PA does now goes beyond what Vista/MacOS does in many ways, though of
course, they provide much more than we provide in many other ways. The
name glitch-free is inspired by the term Microsoft uses to
call this model, however I must admit that I am not sure that my
definition of this term and theirs actually is the same.

Glitch-Free Playback Model

The first basic idea of the glitch-free playback model (a
better, less marketingy name is probably timer-based audio
scheduling
which is the term I internally use in the PA codebase)
is to no longer depend on sound card interrupts to schedule audio but
use system timers instead. System timers are far more flexible then
the fragment-based sound card timers. They can be reconfigured at any
time, and have a granularity that is independant from any buffer
metrics of the sound card. The second basic idea is to use playback
buffers that are as large as possible, up to a limit of 2s or 5s. The
third basic idea is to allow rewriting of the hardware buffer at any
time. This allows instant reaction on user-input (i.e. pause/seek
requests in your music player, or instant event sounds) although the
huge latency imposed by the hardware playback buffer would suggest
otherwise.

PA configures the audio hardware to the largest playback buffer
size possible, up to 2s. The sound card interrupts are disabled as far
as possible (most of the time this means to simply lower NFRAGS to the
minimal value supported by the hardware. It would be great if ALSA
would allow us to disable sound card interrupts entirely). Then, PA
constantly determines what the minimal latency requirement of all
connected clients is. If no client specified any requirements we fill
up the whole buffer all the time, i.e. have an actual latency of
2s. However, if some applications specified requirements, we take the
lowest one and only use as much of the configured hardware buffer as
this value allows us. In practice, this means we only partially fill the
buffer each time we wake up. Then, we configure a system timer
to wake us up 10ms before the buffer would run empty and fill it up
again then. If the overall latency is configured to less than 10ms we
wakeup after half the latency requested.

If the sleep time turns out to be too long (i.e. it took more than
10ms to fill up the hardware buffer) we will get an underrun. If this
happens we can double the time we wake up before the buffer would run
empty, to 20ms, and so on. If we notice that we only used much less
than the time we estimated, we can halve this value again. This
adaptive scheme makes sure that in the unlikely event of a buffer
underrun it will happen most likely only once and never again.

When a new client connects or an existing client disconnects, or
when a client wants to rewrite what it already wrote, or the user
wants to change the volume of one of the streams, then PA will
resample its data passed by the client, convert it to the proper
hardware sample type, and remix it with the data of the other
clients. This of course makes it necessary to keep a “history” of data
of all clients around so that if one client requests a
rewrite we have the necessary data around to remix what already was
mixed before.

The benefits of this model are manyfold:

  • We minimize the overall number of interrupts, down to what the
    latency requirements of the connected clients allow us. i.e. we save power,
    don’t show up in powertop anymore for normal music playback.
  • We maximize drop-out safety, because we buffer up to 2s in the
    usual cases. Only with operating systems which have scheduling
    latencies > 2s we can still get drop-outs. Thankfully no
    operating system is that bad.
  • In the event of an underrun we don’t get stuck in it, but instead
    are able to recover quickly and can make sure it doesn’t happen again.
  • We provide “zero-latency”. Each client can rewrite its playback
    buffer at any time, and this is forwarded to the hardware, even if
    this means that the sample currently being played needs to be
    rewritten. This means much quicker reaction to user input, a more
    responsive user experience.
  • We become much less dependant on what the sound hardware provides
    us with. We can configure wakeup times that are independant from the
    fragment settings that the hardware actually supports.
  • We can provide almost any latency a client might request,
    dynamically without reconfiguration, without discontinuities in
    audio.

Of course, this scheme also comes with major complications:

  • System timers and sound card timers deviate. On many sound cards
    by quite a bit. Also, not all sound cards allow the user to query the
    playback frame index at any time, but only shortly after each IRQ. To
    compensate for this deviation PA contains a non-trivial algorithm
    which tries to estimate and follow the deviation over time. If this
    doesn’t work properly it might happen that an underrun happens much
    earlier than we expected.
  • System timers on Unix are not very high precision. On traditional
    Linux with HZ=100 sleep times for timers are rounded up to multiples
    of 10ms. Only very recent Linux kernels with hrtimers can
    provide something better, but only on x86 and x86-64 until now. This
    makes the whole scheme unusable for low latency setups unless you run
    the very latest Linux. Also, hrtimers are not (yet) exposed in
    poll()/select(). It requires major jumping through loops to
    work around this limitation.
  • We need to keep a history of sample data for each stream around, thus increasing the memory
    footprint and potentially increased cache pressure. PA tries to work
    against the increased memory footprint and cache pressure this might cause by doing
    zero-copy memory management.
  • We’re still dependant on the maximum playback buffer size the
    sound hardware supports. Many sound cards don’t even support 2s, but only
    300ms or suchlike.
  • The rewriting of the client buffers causing rewriting of the
    hardware buffer complicates the resampling/converting step
    immensly. In general the code to implement this model is more complex
    than for the traditional model. Also, ALSA has not really been
    designed with this design in mind, which makes some things very hard
    to get right and suboptimal.
  • Generally, this works reliably only on newest ALSA, newest kernel,
    newest everything. It has pretty steep requirements on software and
    sometimes even on hardware. To stay comptible with systems that don’t
    fulfill these requirements we need to carry around code for the
    traditional playback model as well, increasing the code base by far.

The advantages of the scheme clearly outweigh the complexities it
causes. Especially the power-saving features of glitch-free PA should
be enough reason for the embedded Linux people to adopt it
quickly. Make PA disappear from powertop even if you play music!

The code in the glitch-free is still rough and sometimes
incomplete. I will merge it shortly into trunk and then
upload a snapshot to Rawhide.

I hope this text also explains to the few remaining PA haters a
little better why PA is a good thing, and why everyone should have it
on his Linux desktop. Of course these changes are not visible on the
surface, my hope with this blog story is to explain a bit better why
infrastructure matters, and counter misconceptions what PA actually is
and what it gives you on top of ALSA.

BOSSA 2008

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/bossa-2008.html

Just three words: awesome awesome awesome.

And for those asking for it, here are my
slides
, in which I try to explain the new “glitch-free” audio scheduling
core of PulseAudio that I recently commited to the glitch-free branch
in PA SVN. I also try to make clear why this functionality is practically a
*MUST* for all people who want to have low-latency audio, minimal power
consumption and maximum drop-out safety for their audio playback. And thus, why
all those fancy embedded Linux devices should adopt it better sooner than
later. The slides might appear a bit terse if you don’t have that awesome guy
they usually come with presenting them to you.

Avahi on your N800

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/avahi-n800.html

I’d love to see proper Avahi support in the Nokia N800 (just think of proper
file manager integration of announced WebDAV shares!), but until now Nokia
doesn’t ship Avahi in Maemo. However, there’s now a simple way to install at
least basic Avahi support on the N800. The INdT includes Avahi in their Canola builds. Hence: just install
Canola and your N800 will register itself via mDNS on your network.

In related news: I am happy to see that Avahi has apparently been included in the just announced GNOME Embedded Platform.

avahi-autoipd Released and ‘State of the Lemur’

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/avahi-0.6.14.html

A few minutes ago I released Avahi 0.6.14
which besides other, minor fixes and cleanups includes a new component avahi-autoipd.
This new daemon is an implementation of IPv4LL (aka RFC3927, aka
APIPA), a method for acquiring link-local IP addresses (those from the range
169.254/16) without a central server, such as DHCP.

Yes, there are already plenty Free implementations of this protocol
available. However, this one tries to do it right and integrates well with the
rest of Avahi. For a longer rationale for adding this tool to our distribution
instead of relying on externals tools, please read this
mailing list thread
.

It is my hope that this tool is quickly adopted by the popular
distributions, which will allow Linux to finally catch up with technology that
has been available in Windows systems since Win98 times. If you’re a
distributor please follow these
notes
which describe how to integrate this new tool into your distribution
best.

Because avahi-autoipd acts as dhclient plug-in by default,
and only activates itself as last resort for acquiring an IP address I hope
that it will get much less in the way of the user than previous implementations
of this technology for Linux.

State of the Lemur

Almost 22 months after my first SVN commit to the flexmdns (which was the
name I chose for my mDNS implementation when I first started to work on it)
source code repository, 18 months after Trent and I decided to join our two
projects under the name “Avahi” and 12 months after the release of Avahi 0.1,
it’s time for a little “State of the Lemur” post.

To make it short: Avahi is ubiquitous in the Free Software world. 😉

All major (Debian, Ubuntu, Fedora, Gentoo, Mandriva, OpenSUSE) and many
minor distributions have it. A quick Google-based poll I did a few weeks ago
shows that it is part of at least 19 different
distributions
, including a range of embedded ones. The list of applications
making native use
of the Avahi client API is growing, currently bearing 31
items. That list does not include the legacy HOWL applications and the
applications that use our Bonjour compatibility API which can run on top of
Avahi, hence the real number of applications that can make use of Avahi is
slightly higher. The first commercial hardware appliances which include Avahi are
slowly appearing on the market. I know of at least three such products, one
being Bubba.

If you package Avahi for a distribution, add Avahi support to an
application, or build a hardware appliance with Avahi, please make sure to add
an item to the respective lists linked above, it’s a Wiki. Thank you!
(Anonymous registration without Mail address required, though)

Avahi 0.6.13 released

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/avahi-0.6.13.html

Avahi Logo

I am happy to bring you yet another release of Avahi, everyone’s favourite Zeroconf stack.

  • Add a new D-Bus method for changing the mDNS host name during
    runtime. This functionality is only available to members of the
    UNIX group “netdev”, which is the same access group that is
    enforced by GNOME’s NetworkManager daemon. Since NM will probably
    be the most prominent user of this new method, we decided to limit
    access to the same group. The access group can be set by passing
    –with-avahi-priv-access-group= to “configure”. If you need more
    sophisticated access control you can freely edit
    /etc/dbus/system.d/avahi-dbus.conf.
  • Add a new utility “avahi-set-host-name” which is a command line
    wrapper around the aforementioned SetHostName() method.
  • Bonjour API compatibility library:
    • Implement DNSServiceUpdateRecord()
    • Allow passing NULL as callback function for DNSServiceRegister()
    • Implement subtype registration in DNSServiceRegister() in a
      way that is compatible with Bonjour.
    • Update to newer copy of dns_sd.h
  • If the host name changes update names of static services wich
    contain wildcards.
  • Don’t build documentation about embedding the Avahi mDNS stack into
    other programs by default. This is a feature used only by embedded
    developers. Pass –enable-core-docs to “configure” to enable
    building these docs, like in Avahi <= 0.6.12.
  • Build Qt documentation only when Qt support is enabled in
    the configuration. Same for GLib.
  • Change algorithm used to find a new host name on conflict. In
    Avahi <= 0.6.12 a conflicting host name of “foobar” would be
    changed to the new name “foobar2”. With 0.6.13 “foobar-2” will be
    picked instead. This follows Bonjour’s behaviour and has the
    advantage not confusing people with regular host names ending in
    digits.
  • Don’t disable all static services when SIGHUP is recieved.
  • Fix build when Avahi is configured without Gtk+ but with Python
    support
  • Fix build on MacOS X
  • Support using Solaris DBM instead of gdbm for the service type
    database. The latter is still recommended
  • Minor other fixes and documentation updates

The relevant NetworkManager bug about SetHostName() is #352828.

And our bug tracker is back to only two open bugs for Avahi. That’s a good feeling, I can tell you!