Tag Archives: variables

Why systemd?

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/why.html

systemd is
still a young project, but it is not a baby anymore. The initial
announcement
I posted precisely a year ago. Since then most of the
big distributions have decided to adopt it in one way or another, many
smaller distributions have already switched. The first big
distribution with systemd by default will be Fedora 15, due end of
May. It is expected that the others will follow the lead a bit later
(with one exception). Many
embedded developers have already adopted it too, and there’s even a company specializing on engineering and
consulting services for systemd
. In short: within one year
systemd became a really successful project.

However, there are still folks who we haven’t won over yet. If you
fall into one of the following categories, then please have a look on
the comparison of init systems below:

  • You are working on an embedded project and are wondering whether
    it should be based on systemd.
  • You are a user or administrator and wondering which distribution
    to pick, and are pondering whether it should be based on systemd or
    not.
  • You are a user or administrator and wondering why your favourite
    distribution has switched to systemd, if everything already worked so
    well before.
  • You are developing a distribution that hasn’t switched yet, and
    you are wondering whether to invest the work and go systemd.

And even if you don’t fall into any of these categories, you might still
find the comparison interesting.

We’ll be comparing the three most relevant init systems for Linux:
sysvinit, Upstart and systemd. Of course there are other init systems
in existance, but they play virtually no role in the big
picture. Unless you run Android (which is a completely different beast
anyway), you’ll almost definitely run one of these three init systems
on your Linux kernel. (OK, or busybox, but then you are basically not
running any init system at all.) Unless you have a soft spot for
exotic init systems there’s little need to look further. Also, I am
kinda lazy, and don’t want to spend the time on analyzing those other
systems in enough detail to be completely fair to them.

Speaking of fairness: I am of course one of the creators of
systemd. I will try my best to be fair to the other two contenders,
but in the end, take it with a grain of salt. I am sure though that
should I be grossly unfair or otherwise incorrect somebody will point
it out in the comments of this story, so consider having a look on
those, before you put too much trust in what I say.

We’ll look at the currently implemented features in a released
version. Grand plans don’t count.

General Features

sysvinitUpstartsystemd
Interfacing via D-Busnoyesyes
Shell-free bootupnonoyes
Modular C coded early boot services includednonoyes
Read-Aheadnono[1]yes
Socket-based Activationnono[2]yes
Socket-based Activation: inetd compatibilitynono[2]yes
Bus-based Activationnono[3]yes
Device-based Activationnono[4]yes
Configuration of device dependencies with udev rulesnonoyes
Path-based Activation (inotify)nonoyes
Timer-based Activationnonoyes
Mount handlingnono[5]yes
fsck handlingnono[5]yes
Quota handlingnonoyes
Automount handlingnonoyes
Swap handlingnonoyes
Snapshotting of system statenonoyes
XDG_RUNTIME_DIR Supportnonoyes
Optionally kills remaining processes of users logging outnonoyes
Linux Control Groups Integrationnonoyes
Audit record generation for started servicesnonoyes
SELinux integrationnonoyes
PAM integrationnonoyes
Encrypted hard disk handling (LUKS)nonoyes
SSL Certificate/LUKS Password handling, including Plymouth, Console, wall(1), TTY and GNOME agentsnonoyes
Network Loopback device handlingnonoyes
binfmt_misc handlingnonoyes
System-wide locale handlingnonoyes
Console and keyboard setupnonoyes
Infrastructure for creating, removing, cleaning up of temporary and volatile filesnonoyes
Handling for /proc/sys sysctlnonoyes
Plymouth integrationnoyesyes
Save/restore random seednonoyes
Static loading of kernel modulesnonoyes
Automatic serial console handlingnonoyes
Unique Machine ID handlingnonoyes
Dynamic host name and machine meta data handlingnonoyes
Reliable termination of servicesnonoyes
Early boot /dev/log loggingnonoyes
Minimal kmsg-based syslog daemon for embedded usenonoyes
Respawning on service crash without losing connectivitynonoyes
Gapless service upgradesnonoyes
Graphical UInonoyes
Built-In Profiling and Toolsnonoyes
Instantiated servicesnoyesyes
PolicyKit integrationnonoyes
Remote access/Cluster support built into client toolsnonoyes
Can list all processes of a servicenonoyes
Can identify service of a processnonoyes
Automatic per-service CPU cgroups to even out CPU usage between themnonoyes
Automatic per-user cgroupsnonoyes
SysV compatibilityyesyesyes
SysV services controllable like native servicesyesnoyes
SysV-compatible /dev/initctlyesnoyes
Reexecution with full serialization of stateyesnoyes
Interactive boot-upno[6]no[6]yes
Container support (as advanced chroot() replacement)nonoyes
Dependency-based bootupno[7]noyes
Disabling of services without editing filesyesnoyes
Masking of services without editing filesnonoyes
Robust system shutdown within PID 1nonoyes
Built-in kexec supportnonoyes
Dynamic service generationnonoyes
Upstream support in various other OS componentsyesnoyes
Service files compatible between distributionsnonoyes
Signal delivery to servicesnonoyes
Reliable termination of user sessions before shutdownnonoyes
utmp/wtmp supportyesyesyes
Easily writable, extensible and parseable service files, suitable for manipulation with enterprise management toolsnonoyes

[1] Read-Ahead implementation for Upstart available in separate package ureadahead, requires non-standard kernel patch.

[2] Socket activation implementation for Upstart available as preview, lacks parallelization support hence entirely misses the point of socket activation.

[3] Bus activation implementation for Upstart posted as patch, not merged.

[4] udev device event bridge implementation for Upstart available as preview, forwards entire udev database into Upstart, not practical.

[5] Mount handling utility mountall for Upstart available in separate package, covers only boot-time mounts, very limited dependency system.

[6] Some distributions offer this implemented in shell.

[7] LSB init scripts support this, if they are used.

Available Native Service Settings

sysvinitUpstartsystemd
OOM Adjustmentnoyes[1]yes
Working Directorynoyesyes
Root Directory (chroot())noyesyes
Environment Variablesnoyesyes
Environment Variables from external filenonoyes
Resource Limitsnosome[2]yes
umasknoyesyes
User/Group/Supplementary Groupsnonoyes
IO Scheduling Class/Prioritynonoyes
CPU Scheduling Nice Valuenoyesyes
CPU Scheduling Policy/Prioritynonoyes
CPU Scheduling Reset on fork() controlnonoyes
CPU affinitynonoyes
Timer Slacknonoyes
Capabilities Controlnonoyes
Secure Bits Controlnonoyes
Control Group Controlnonoyes
High-level file system namespace control: making directories inacessiblenonoyes
High-level file system namespace control: making directories read-onlynonoyes
High-level file system namespace control: private /tmpnonoyes
High-level file system namespace control: mount inheritancenonoyes
Input on Consoleyesyesyes
Output on Syslognonoyes
Output on kmsg/dmesgnonoyes
Output on arbitrary TTYnonoyes
Kill signal controlnonoyes
Conditional execution: by identified CPU virtualization/containernonoyes
Conditional execution: by file existancenonoyes
Conditional execution: by security frameworknonoyes
Conditional execution: by kernel command linenonoyes

[1] Upstart supports only the deprecated oom_score_adj mechanism, not the current oom_adj logic.

[2] Upstart lacks support for RLIMIT_RTTIME and RLIMIT_RTPRIO.

Note that some of these options are relatively easily added to SysV
init scripts, by editing the shell sources. The table above focusses
on easily accessible options that do not require source code
editing.

Miscellaneous

sysvinitUpstartsystemd
Maturity> 15 years6 years1 year
Specialized professional consulting and engineering services availablenonoyes
SCMSubversionBazaargit
Copyright-assignment-free contributingyesnoyes

Summary

As the tables above hopefully show in all clarity systemd
has left behind both sysvinit and Upstart in almost every
aspect. With the exception of the project’s age/maturity systemd wins
in every category. At this point in time it will be very hard for
sysvinit and Upstart to catch up with the features systemd provides
today. In one year we managed to push systemd forward much further
than Upstart has been pushed in six.

It is our intention to drive forward the development of the Linux
platform with systemd. In the next release cycle we will focus more
strongly on providing the same features and speed improvement we
already offer for the system to the user login session. This will
bring much closer integration with the other parts of the OS and
applications, making the most of the features the service manager
provides, and making it available to login sessions. Certain
components such as ConsoleKit will be made redundant by these
upgrades, and services relying on them will be updated. The
burden for maintaining these then obsolete components
will be passed on the vendors who plan to continue to rely on
them.

If you are wondering whether or not to adopt systemd, then systemd
obviously wins when it comes to mere features. Of course that should
not be the only aspect to keep in mind. In the long run, sticking with
the existing infrastructure (such as ConsoleKit) comes at a price:
porting work needs to take place, and additional maintainance work for
bitrotting code needs to be done. Going it on your own means increased
workload.

That said, adopting systemd is also not free. Especially if you
made investments in the other two solutions adopting systemd means
work. The basic work to adopt systemd is relatively minimal for
porting over SysV systems (since compatibility is provided), but can
mean substantial work when coming from Upstart. If you plan to go for
a 100% systemd system without any SysV compatibility (recommended for
embedded, long run goal for the big distributions) you need to be
willing to invest some work to rewrite init scripts as simple systemd
unit files.

systemd is in the process of becoming a comprehensive, integrated
and modular platform providing everything needed to bootstrap and
maintain an operating system’s userspace. It includes C rewrites of
all basic early boot init scripts that are shipped with the various
distributions. Especially for the embedded case adopting systemd
provides you in one step with almost everything you need, and you can
pick the modules you want. The other two init systems are singular
individual components, which to be useful need a great number of
additional components with differing interfaces. The emphasis of
systemd to provide a platform instead of just a component allows for
closer integration, and cleaner APIs. Sooner or later this will
trickle up to the applications. Already, there are accepted XDG
specifications (e.g. XDG basedir spec, more specifically
XDG_RUNTIME_DIR) that are not supported on the other init systems.

systemd is also a big opportunity for Linux standardization. Since
it standardizes many interfaces of the system that previously have
been differing on every distribution, on every implementation,
adopting it helps to work against the balkanization of the Linux
interfaces. Choosing systemd means redefining more closely
what the Linux platform is about. This improves the lifes of
programmers, users and administrators alike.

I believe that momentum is clearly with systemd. We invite you to
join our community and be part of that momentum.

systemd for Administrators, Part VI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/changing-roots.html

Here’s another installment of
my
ongoing
series
on
systemd for Administrators:

Changing Roots

As administrator or developer sooner or later you’ll ecounter chroot()
environments
. The chroot() system call simply shifts what
a process and all its children consider the root directory /, thus
limiting what the process can see of the file hierarchy to a subtree
of it. Primarily chroot() environments have two uses:

  1. For security purposes: In this use a specific isolated daemon is
    chroot()ed into a private subdirectory, so that when exploited the
    attacker can see only the subdirectory instead of the full OS
    hierarchy: he is trapped inside the chroot() jail.
  2. To set up and control a debugging, testing, building, installation
    or recovery image of an OS: For this a whole guest operating
    system hierarchy is mounted or bootstraped into a subdirectory of the
    host OS, and then a shell (or some other application) is started
    inside it, with this subdirectory turned into its /. To the shell it
    appears as if it was running inside a system that can differ greatly
    from the host OS. For example, it might run a different distribution
    or even a different architecture (Example: host x86_64, guest
    i386). The full hierarchy of the host OS it cannot see.

On a classic System-V-based operating system it is relatively easy
to use chroot() environments. For example, to start a specific daemon
for test or other reasons inside a chroot()-based guest OS tree, mount
/proc, /sys and a few other API file systems into
the tree, and then use chroot(1) to enter the chroot, and
finally run the SysV init script via /sbin/service from
inside the chroot.

On a systemd-based OS things are not that easy anymore. One of the
big advantages of systemd is that all daemons are guaranteed to be
invoked in a completely clean and independent context which is in no
way related to the context of the user asking for the service to be
started. While in sysvinit-based systems a large part of the execution
context (like resource limits, environment variables and suchlike) is
inherited from the user shell invoking the init skript, in systemd the
user just notifies the init daemon, and the init daemon will then fork
off the daemon in a sane, well-defined and pristine execution context
and no inheritance of the user context parameters takes place. While
this is a formidable feature it actually breaks traditional approaches
to invoke a service inside a chroot() environment: since the actual
daemon is always spawned off PID 1 and thus inherits the chroot()
settings from it, it is irrelevant whether the client which asked for
the daemon to start is chroot()ed or not. On top of that, since
systemd actually places its local communications sockets in
/run/systemd a process in a chroot() environment will not even
be able to talk to the init system (which however is probably a good thing, and the
daring can work around this of course by making use of bind
mounts.)

This of course opens the question how to use chroot()s properly in
a systemd environment. And here’s what we came up with for you, which
hopefully answers this question thoroughly and comprehensively:

Let’s cover the first usecase first: locking a daemon into a
chroot() jail for security purposes. To begin with, chroot() as a
security tool is actually quite dubious, since chroot() is not a
one-way street. It is relatively easy to escape a chroot()
environment, as even the
man page points out
. Only in combination with a few other
techniques it can be made somewhat secure. Due to that it usually
requires specific support in the applications to chroot() themselves
in a tamper-proof way. On top of that it usually requires a deep
understanding of the chroot()ed service to set up the chroot()
environment properly, for example to know which directories to bind mount from
the host tree, in order to make available all communication channels
in the chroot() the service actually needs. Putting this together,
chroot()ing software for security purposes is almost always done best
in the C code of the daemon itself. The developer knows best (or at
least should know best) how to properly secure down the
chroot(), and what the minimal set of files, file systems and
directories is the daemon will need inside the chroot(). These days a
number of daemons are capable of doing this, unfortunately however of
those running by default on a normal Fedora installation only two are
doing this: Avahi and
RealtimeKit. Both apparently written by the same really smart
dude. Chapeau! 😉 (Verify this easily by running ls -l
/proc/*/root
on your system.)

That all said, systemd of course does offer you a way to chroot()
specific daemons and manage them like any other with the usual
tools. This is supported via the RootDirectory= option in
systemd service files. Here’s an example:

[Unit]
Description=A chroot()ed Service

[Service]
RootDirectory=/srv/chroot/foobar
ExecStartPre=/usr/local/bin/setup-foobar-chroot.sh
ExecStart=/usr/bin/foobard
RootDirectoryStartOnly=yes

In this example, RootDirectory= configures where to
chroot() to before invoking the daemon binary specified with
ExecStart=. Note that the path specified in
ExecStart= needs to refer to the binary inside the chroot(),
it is not a path to the binary in the host tree (i.e. in this example
the binary executed is seen as
/srv/chroot/foobar/usr/bin/foobard from the host OS). Before
the daemon is started a shell script setup-foobar-chroot.sh
is invoked, whose purpose it is to set up the chroot environment as
necessary, i.e. mount /proc and similar file systems into it,
depending on what the service might need. With the
RootDirectoryStartOnly= switch we ensure that only the daemon
as specified in ExecStart= is chrooted, but not the
ExecStartPre= script which needs to have access to the full
OS hierarchy so that it can bind mount directories from there. (For
more information on these switches see the respective man
pages.)
If you place a unit file like this in
/etc/systemd/system/foobar.service you can start your
chroot()ed service by typing systemctl start
foobar.service
. You may then introspect it with systemctl
status foobar.service
. It is accessible to the administrator like
any other service, the fact that it is chroot()ed does — unlike on
SysV — not alter how your monitoring and control tools interact with
it.

Newer Linux kernels support file system namespaces. These are
similar to chroot() but a lot more powerful, and they do not
suffer by the same security problems as chroot(). systemd
exposes a subset of what you can do with file system namespaces right
in the unit files themselves. Often these are a useful and simpler
alternative to setting up full chroot() environment in a
subdirectory. With the switches ReadOnlyDirectories= and
InaccessibleDirectories= you may setup a file system
namespace jail for your service. Initially, it will be identical to
your host OS’ file system namespace. By listing directories in these
directives you may then mark certain directories or mount points of
the host OS as read-only or even completely inaccessible to the
daemon. Example:

[Unit]
Description=A Service With No Access to /home

[Service]
ExecStart=/usr/bin/foobard
InaccessibleDirectories=/home

This service will have access to the entire file system tree of the
host OS with one exception: /home will not be visible to it, thus
protecting the user’s data from potential exploiters. (See the
man page for details on these options.
)

File system namespaces are in fact a better replacement for
chroot()s in many many ways. Eventually Avahi and RealtimeKit
should probably be updated to make use of namespaces replacing
chroot()s.

So much about the security usecase. Now, let’s look at the other
use case: setting up and controlling OS images for debugging, testing,
building, installing or recovering.

chroot() environments are relatively simple things: they only
virtualize the file system hierarchy. By chroot()ing into a
subdirectory a process still has complete access to all system calls,
can kill all processes and shares about everything else with the host
it is running on. To run an OS (or a small part of an OS) inside a
chroot() is hence a dangerous affair: the isolation between host and
guest is limited to the file system, everything else can be freely
accessed from inside the chroot(). For example, if you upgrade a
distribution inside a chroot(), and the package scripts send a SIGTERM
to PID 1 to trigger a reexecution of the init system, this will
actually take place in the host OS! On top of that, SysV shared
memory, abstract namespace sockets and other IPC primitives are shared
between host and guest. While a completely secure isolation for
testing, debugging, building, installing or recovering an OS is
probably not necessary, a basic isolation to avoid accidental
modifications of the host OS from inside the chroot() environment is
desirable: you never know what code package scripts execute which
might interfere with the host OS.

To deal with chroot() setups for this use systemd offers you a
couple of features:

First of all, systemctl detects when it is run in a
chroot. If so, most of its operations will become NOPs, with the
exception of systemctl enable and systemctl
disable
. If a package installation script hence calls these two
commands, services will be enabled in the guest OS. However, should a
package installation script include a command like systemctl
restart
as part of the package upgrade process this will have no
effect at all when run in a chroot() environment.

More importantly however systemd comes out-of-the-box with the systemd-nspawn
tool which acts as chroot(1) on steroids: it makes use of file system
and PID namespaces to boot a simple lightweight container on a file
system tree. It can be used almost like chroot(1), except that the
isolation from the host OS is much more complete, a lot more secure
and even easier to use. In fact, systemd-nspawn is capable of
booting a complete systemd or sysvinit OS in container with a single
command. Since it virtualizes PIDs, the init system in the container
can act as PID 1 and thus do its job as normal. In contrast to
chroot(1) this tool will implicitly mount /proc,
/sys for you.

Here’s an example how in three commands you can boot a Debian OS on
your Fedora machine inside an nspawn container:

# yum install debootstrap
# debootstrap --arch=amd64 unstable debian-tree/
# systemd-nspawn -D debian-tree/

This will bootstrap the OS directory tree and then simply invoke a
shell in it. If you want to boot a full system in the container, use a
command like this:

# systemd-nspawn -D debian-tree/ /sbin/init

And after a quick bootup you should have a shell prompt, inside a
complete OS, booted in your container. The container will not be able
to see any of the processes outside of it. It will share the network
configuration, but not be able to modify it. (Expect a couple of
EPERMs during boot for that, which however should not be
fatal). Directories like /sys and /proc/sys are
available in the container, but mounted read-only in order to avoid
that the container can modify kernel or hardware configuration. Note
however that this protects the host OS only from accidental
changes of its parameters. A process in the container can manually
remount the file systems read-writeable and then change whatever it
wants to change.

So, what’s so great about systemd-nspawn again?

  1. It’s really easy to use. No need to manually mount /proc
    and /sys into your chroot() environment. The tool will do it
    for you and the kernel automatically cleans it up when the container
    terminates.
  2. The isolation is much more complete, protecting the host OS from
    accidental changes from inside the container.
  3. It’s so good that you can actually boot a full OS in the
    container, not just a single lonesome shell.
  4. It’s actually tiny and installed everywhere where systemd is
    installed. No complicated installation or setup.

systemd itself has been modified to work very well in such a
container. For example, when shutting down and detecting that it is
run in a container, it just calls exit(), instead of reboot() as last
step.

Note that systemd-nspawn is not a full container
solution. If you need that LXC is the better choice for
you. It uses the same underlying kernel technology but offers a lot
more, including network virtualization. If you so will,
systemd-nspawn is the GNOME 3 of container solutions:
slick and trivially easy to use — but with few configuration
options. LXC OTOH is more like KDE: more configuration options than lines of
code. I wrote systemd-nspawn specifically to cover testing,
debugging, building, installing, recovering. That’s what you should use
it for and what it is really good at, and where it is a much much nicer
alternative to chroot(1).

So, let’s get this finished, this was already long enough. Here’s what to take home from
this little blog story:

  1. Secure chroot()s are best done natively in the C sources of your program.
  2. ReadOnlyDirectories=, InaccessibleDirectories=
    might be suitable alternatives to a full chroot() environment.
  3. RootDirectory= is your friend if you want to chroot() a specific service.
  4. systemd-nspawn is made of awesome.
  5. chroot()s are lame, file system namespaces are totally l33t.

All of this is readily available on your Fedora 15 system.

And that’s it for today. See you again for the next installment.

systemd for Administrators, Part VI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/changing-roots.html

Here’s another installment of
my
ongoing
series
on
systemd for Administrators:

Changing Roots

As administrator or developer sooner or later you’ll ecounter chroot()
environments
. The chroot() system call simply shifts what
a process and all its children consider the root directory /, thus
limiting what the process can see of the file hierarchy to a subtree
of it. Primarily chroot() environments have two uses:

For security purposes: In this use a specific isolated daemon is
chroot()ed into a private subdirectory, so that when exploited the
attacker can see only the subdirectory instead of the full OS
hierarchy: he is trapped inside the chroot() jail.

To set up and control a debugging, testing, building, installation
or recovery image of an OS: For this a whole guest operating
system hierarchy is mounted or bootstraped into a subdirectory of the
host OS, and then a shell (or some other application) is started
inside it, with this subdirectory turned into its /. To the shell it
appears as if it was running inside a system that can differ greatly
from the host OS. For example, it might run a different distribution
or even a different architecture (Example: host x86_64, guest
i386). The full hierarchy of the host OS it cannot see.

On a classic System-V-based operating system it is relatively easy
to use chroot() environments. For example, to start a specific daemon
for test or other reasons inside a chroot()-based guest OS tree, mount
/proc, /sys and a few other API file systems into
the tree, and then use chroot(1) to enter the chroot, and
finally run the SysV init script via /sbin/service from
inside the chroot.

On a systemd-based OS things are not that easy anymore. One of the
big advantages of systemd is that all daemons are guaranteed to be
invoked in a completely clean and independent context which is in no
way related to the context of the user asking for the service to be
started. While in sysvinit-based systems a large part of the execution
context (like resource limits, environment variables and suchlike) is
inherited from the user shell invoking the init skript, in systemd the
user just notifies the init daemon, and the init daemon will then fork
off the daemon in a sane, well-defined and pristine execution context
and no inheritance of the user context parameters takes place. While
this is a formidable feature it actually breaks traditional approaches
to invoke a service inside a chroot() environment: since the actual
daemon is always spawned off PID 1 and thus inherits the chroot()
settings from it, it is irrelevant whether the client which asked for
the daemon to start is chroot()ed or not. On top of that, since
systemd actually places its local communications sockets in
/run/systemd a process in a chroot() environment will not even
be able to talk to the init system (which however is probably a good thing, and the
daring can work around this of course by making use of bind
mounts.)

This of course opens the question how to use chroot()s properly in
a systemd environment. And here’s what we came up with for you, which
hopefully answers this question thoroughly and comprehensively:

Let’s cover the first usecase first: locking a daemon into a
chroot() jail for security purposes. To begin with, chroot() as a
security tool is actually quite dubious, since chroot() is not a
one-way street. It is relatively easy to escape a chroot()
environment, as even the
man page points out
. Only in combination with a few other
techniques it can be made somewhat secure. Due to that it usually
requires specific support in the applications to chroot() themselves
in a tamper-proof way. On top of that it usually requires a deep
understanding of the chroot()ed service to set up the chroot()
environment properly, for example to know which directories to bind mount from
the host tree, in order to make available all communication channels
in the chroot() the service actually needs. Putting this together,
chroot()ing software for security purposes is almost always done best
in the C code of the daemon itself. The developer knows best (or at
least should know best) how to properly secure down the
chroot(), and what the minimal set of files, file systems and
directories is the daemon will need inside the chroot(). These days a
number of daemons are capable of doing this, unfortunately however of
those running by default on a normal Fedora installation only two are
doing this: Avahi and
RealtimeKit. Both apparently written by the same really smart
dude. Chapeau! 😉 (Verify this easily by running ls -l
/proc/*/root on your system.)

That all said, systemd of course does offer you a way to chroot()
specific daemons and manage them like any other with the usual
tools. This is supported via the RootDirectory= option in
systemd service files. Here’s an example:

[Unit]
Description=A chroot()ed Service

[Service]
RootDirectory=/srv/chroot/foobar
ExecStartPre=/usr/local/bin/setup-foobar-chroot.sh
ExecStart=/usr/bin/foobard
RootDirectoryStartOnly=yes

In this example, RootDirectory= configures where to
chroot() to before invoking the daemon binary specified with
ExecStart=. Note that the path specified in
ExecStart= needs to refer to the binary inside the chroot(),
it is not a path to the binary in the host tree (i.e. in this example
the binary executed is seen as
/srv/chroot/foobar/usr/bin/foobard from the host OS). Before
the daemon is started a shell script setup-foobar-chroot.sh
is invoked, whose purpose it is to set up the chroot environment as
necessary, i.e. mount /proc and similar file systems into it,
depending on what the service might need. With the
RootDirectoryStartOnly= switch we ensure that only the daemon
as specified in ExecStart= is chrooted, but not the
ExecStartPre= script which needs to have access to the full
OS hierarchy so that it can bind mount directories from there. (For
more information on these switches see the respective man
pages.)
If you place a unit file like this in
/etc/systemd/system/foobar.service you can start your
chroot()ed service by typing systemctl start
foobar.service. You may then introspect it with systemctl
status foobar.service. It is accessible to the administrator like
any other service, the fact that it is chroot()ed does — unlike on
SysV — not alter how your monitoring and control tools interact with
it.

Newer Linux kernels support file system namespaces. These are
similar to chroot() but a lot more powerful, and they do not
suffer by the same security problems as chroot(). systemd
exposes a subset of what you can do with file system namespaces right
in the unit files themselves. Often these are a useful and simpler
alternative to setting up full chroot() environment in a
subdirectory. With the switches ReadOnlyDirectories= and
InaccessibleDirectories= you may setup a file system
namespace jail for your service. Initially, it will be identical to
your host OS’ file system namespace. By listing directories in these
directives you may then mark certain directories or mount points of
the host OS as read-only or even completely inaccessible to the
daemon. Example:

[Unit]
Description=A Service With No Access to /home

[Service]
ExecStart=/usr/bin/foobard
InaccessibleDirectories=/home

This service will have access to the entire file system tree of the
host OS with one exception: /home will not be visible to it, thus
protecting the user’s data from potential exploiters. (See the
man page for details on these options.
)

File system namespaces are in fact a better replacement for
chroot()s in many many ways. Eventually Avahi and RealtimeKit
should probably be updated to make use of namespaces replacing
chroot()s.

So much about the security usecase. Now, let’s look at the other
use case: setting up and controlling OS images for debugging, testing,
building, installing or recovering.

chroot() environments are relatively simple things: they only
virtualize the file system hierarchy. By chroot()ing into a
subdirectory a process still has complete access to all system calls,
can kill all processes and shares about everything else with the host
it is running on. To run an OS (or a small part of an OS) inside a
chroot() is hence a dangerous affair: the isolation between host and
guest is limited to the file system, everything else can be freely
accessed from inside the chroot(). For example, if you upgrade a
distribution inside a chroot(), and the package scripts send a SIGTERM
to PID 1 to trigger a reexecution of the init system, this will
actually take place in the host OS! On top of that, SysV shared
memory, abstract namespace sockets and other IPC primitives are shared
between host and guest. While a completely secure isolation for
testing, debugging, building, installing or recovering an OS is
probably not necessary, a basic isolation to avoid accidental
modifications of the host OS from inside the chroot() environment is
desirable: you never know what code package scripts execute which
might interfere with the host OS.

To deal with chroot() setups for this use systemd offers you a
couple of features:

First of all, systemctl detects when it is run in a
chroot. If so, most of its operations will become NOPs, with the
exception of systemctl enable and systemctl
disable. If a package installation script hence calls these two
commands, services will be enabled in the guest OS. However, should a
package installation script include a command like systemctl
restart as part of the package upgrade process this will have no
effect at all when run in a chroot() environment.

More importantly however systemd comes out-of-the-box with the systemd-nspawn
tool which acts as chroot(1) on steroids: it makes use of file system
and PID namespaces to boot a simple lightweight container on a file
system tree. It can be used almost like chroot(1), except that the
isolation from the host OS is much more complete, a lot more secure
and even easier to use. In fact, systemd-nspawn is capable of
booting a complete systemd or sysvinit OS in container with a single
command. Since it virtualizes PIDs, the init system in the container
can act as PID 1 and thus do its job as normal. In contrast to
chroot(1) this tool will implicitly mount /proc,
/sys for you.

Here’s an example how in three commands you can boot a Debian OS on
your Fedora machine inside an nspawn container:

# yum install debootstrap
# debootstrap –arch=amd64 unstable debian-tree/
# systemd-nspawn -D debian-tree/

This will bootstrap the OS directory tree and then simply invoke a
shell in it. If you want to boot a full system in the container, use a
command like this:

# systemd-nspawn -D debian-tree/ /sbin/init

And after a quick bootup you should have a shell prompt, inside a
complete OS, booted in your container. The container will not be able
to see any of the processes outside of it. It will share the network
configuration, but not be able to modify it. (Expect a couple of
EPERMs during boot for that, which however should not be
fatal). Directories like /sys and /proc/sys are
available in the container, but mounted read-only in order to avoid
that the container can modify kernel or hardware configuration. Note
however that this protects the host OS only from accidental
changes of its parameters. A process in the container can manually
remount the file systems read-writeable and then change whatever it
wants to change.

So, what’s so great about systemd-nspawn again?

It’s really easy to use. No need to manually mount /proc
and /sys into your chroot() environment. The tool will do it
for you and the kernel automatically cleans it up when the container
terminates.

The isolation is much more complete, protecting the host OS from
accidental changes from inside the container.

It’s so good that you can actually boot a full OS in the
container, not just a single lonesome shell.

It’s actually tiny and installed everywhere where systemd is
installed. No complicated installation or setup.

systemd itself has been modified to work very well in such a
container. For example, when shutting down and detecting that it is
run in a container, it just calls exit(), instead of reboot() as last
step.

Note that systemd-nspawn is not a full container
solution. If you need that LXC is the better choice for
you. It uses the same underlying kernel technology but offers a lot
more, including network virtualization. If you so will,
systemd-nspawn is the GNOME 3 of container solutions:
slick and trivially easy to use — but with few configuration
options. LXC OTOH is more like KDE: more configuration options than lines of
code. I wrote systemd-nspawn specifically to cover testing,
debugging, building, installing, recovering. That’s what you should use
it for and what it is really good at, and where it is a much much nicer
alternative to chroot(1).

So, let’s get this finished, this was already long enough. Here’s what to take home from
this little blog story:

Secure chroot()s are best done natively in the C sources of your program.

ReadOnlyDirectories=, InaccessibleDirectories=
might be suitable alternatives to a full chroot() environment.

RootDirectory= is your friend if you want to chroot() a specific service.

systemd-nspawn is made of awesome.

chroot()s are lame, file system namespaces are totally l33t.

All of this is readily available on your Fedora 15 system.

And that’s it for today. See you again for the next installment.

systemd for Administrators, Part VI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/changing-roots.html

Here’s another installment of
my
ongoing
series
on
systemd for Administrators:

Changing Roots

As administrator or developer sooner or later you’ll ecounter chroot()
environments
. The chroot() system call simply shifts what
a process and all its children consider the root directory /, thus
limiting what the process can see of the file hierarchy to a subtree
of it. Primarily chroot() environments have two uses:

  1. For security purposes: In this use a specific isolated daemon is
    chroot()ed into a private subdirectory, so that when exploited the
    attacker can see only the subdirectory instead of the full OS
    hierarchy: he is trapped inside the chroot() jail.
  2. To set up and control a debugging, testing, building, installation
    or recovery image of an OS: For this a whole guest operating
    system hierarchy is mounted or bootstraped into a subdirectory of the
    host OS, and then a shell (or some other application) is started
    inside it, with this subdirectory turned into its /. To the shell it
    appears as if it was running inside a system that can differ greatly
    from the host OS. For example, it might run a different distribution
    or even a different architecture (Example: host x86_64, guest
    i386). The full hierarchy of the host OS it cannot see.

On a classic System-V-based operating system it is relatively easy
to use chroot() environments. For example, to start a specific daemon
for test or other reasons inside a chroot()-based guest OS tree, mount
/proc, /sys and a few other API file systems into
the tree, and then use chroot(1) to enter the chroot, and
finally run the SysV init script via /sbin/service from
inside the chroot.

On a systemd-based OS things are not that easy anymore. One of the
big advantages of systemd is that all daemons are guaranteed to be
invoked in a completely clean and independent context which is in no
way related to the context of the user asking for the service to be
started. While in sysvinit-based systems a large part of the execution
context (like resource limits, environment variables and suchlike) is
inherited from the user shell invoking the init skript, in systemd the
user just notifies the init daemon, and the init daemon will then fork
off the daemon in a sane, well-defined and pristine execution context
and no inheritance of the user context parameters takes place. While
this is a formidable feature it actually breaks traditional approaches
to invoke a service inside a chroot() environment: since the actual
daemon is always spawned off PID 1 and thus inherits the chroot()
settings from it, it is irrelevant whether the client which asked for
the daemon to start is chroot()ed or not. On top of that, since
systemd actually places its local communications sockets in
/run/systemd a process in a chroot() environment will not even
be able to talk to the init system (which however is probably a good thing, and the
daring can work around this of course by making use of bind
mounts.)

This of course opens the question how to use chroot()s properly in
a systemd environment. And here’s what we came up with for you, which
hopefully answers this question thoroughly and comprehensively:

Let’s cover the first usecase first: locking a daemon into a
chroot() jail for security purposes. To begin with, chroot() as a
security tool is actually quite dubious, since chroot() is not a
one-way street. It is relatively easy to escape a chroot()
environment, as even the
man page points out
. Only in combination with a few other
techniques it can be made somewhat secure. Due to that it usually
requires specific support in the applications to chroot() themselves
in a tamper-proof way. On top of that it usually requires a deep
understanding of the chroot()ed service to set up the chroot()
environment properly, for example to know which directories to bind mount from
the host tree, in order to make available all communication channels
in the chroot() the service actually needs. Putting this together,
chroot()ing software for security purposes is almost always done best
in the C code of the daemon itself. The developer knows best (or at
least should know best) how to properly secure down the
chroot(), and what the minimal set of files, file systems and
directories is the daemon will need inside the chroot(). These days a
number of daemons are capable of doing this, unfortunately however of
those running by default on a normal Fedora installation only two are
doing this: Avahi and
RealtimeKit. Both apparently written by the same really smart
dude. Chapeau! 😉 (Verify this easily by running ls -l
/proc/*/root
on your system.)

That all said, systemd of course does offer you a way to chroot()
specific daemons and manage them like any other with the usual
tools. This is supported via the RootDirectory= option in
systemd service files. Here’s an example:

[Unit]
Description=A chroot()ed Service

[Service]
RootDirectory=/srv/chroot/foobar
ExecStartPre=/usr/local/bin/setup-foobar-chroot.sh
ExecStart=/usr/bin/foobard
RootDirectoryStartOnly=yes

In this example, RootDirectory= configures where to
chroot() to before invoking the daemon binary specified with
ExecStart=. Note that the path specified in
ExecStart= needs to refer to the binary inside the chroot(),
it is not a path to the binary in the host tree (i.e. in this example
the binary executed is seen as
/srv/chroot/foobar/usr/bin/foobard from the host OS). Before
the daemon is started a shell script setup-foobar-chroot.sh
is invoked, whose purpose it is to set up the chroot environment as
necessary, i.e. mount /proc and similar file systems into it,
depending on what the service might need. With the
RootDirectoryStartOnly= switch we ensure that only the daemon
as specified in ExecStart= is chrooted, but not the
ExecStartPre= script which needs to have access to the full
OS hierarchy so that it can bind mount directories from there. (For
more information on these switches see the respective man
pages.)
If you place a unit file like this in
/etc/systemd/system/foobar.service you can start your
chroot()ed service by typing systemctl start
foobar.service
. You may then introspect it with systemctl
status foobar.service
. It is accessible to the administrator like
any other service, the fact that it is chroot()ed does — unlike on
SysV — not alter how your monitoring and control tools interact with
it.

Newer Linux kernels support file system namespaces. These are
similar to chroot() but a lot more powerful, and they do not
suffer by the same security problems as chroot(). systemd
exposes a subset of what you can do with file system namespaces right
in the unit files themselves. Often these are a useful and simpler
alternative to setting up full chroot() environment in a
subdirectory. With the switches ReadOnlyDirectories= and
InaccessibleDirectories= you may setup a file system
namespace jail for your service. Initially, it will be identical to
your host OS’ file system namespace. By listing directories in these
directives you may then mark certain directories or mount points of
the host OS as read-only or even completely inaccessible to the
daemon. Example:

[Unit]
Description=A Service With No Access to /home

[Service]
ExecStart=/usr/bin/foobard
InaccessibleDirectories=/home

This service will have access to the entire file system tree of the
host OS with one exception: /home will not be visible to it, thus
protecting the user’s data from potential exploiters. (See the
man page for details on these options.
)

File system namespaces are in fact a better replacement for
chroot()s in many many ways. Eventually Avahi and RealtimeKit
should probably be updated to make use of namespaces replacing
chroot()s.

So much about the security usecase. Now, let’s look at the other
use case: setting up and controlling OS images for debugging, testing,
building, installing or recovering.

chroot() environments are relatively simple things: they only
virtualize the file system hierarchy. By chroot()ing into a
subdirectory a process still has complete access to all system calls,
can kill all processes and shares about everything else with the host
it is running on. To run an OS (or a small part of an OS) inside a
chroot() is hence a dangerous affair: the isolation between host and
guest is limited to the file system, everything else can be freely
accessed from inside the chroot(). For example, if you upgrade a
distribution inside a chroot(), and the package scripts send a SIGTERM
to PID 1 to trigger a reexecution of the init system, this will
actually take place in the host OS! On top of that, SysV shared
memory, abstract namespace sockets and other IPC primitives are shared
between host and guest. While a completely secure isolation for
testing, debugging, building, installing or recovering an OS is
probably not necessary, a basic isolation to avoid accidental
modifications of the host OS from inside the chroot() environment is
desirable: you never know what code package scripts execute which
might interfere with the host OS.

To deal with chroot() setups for this use systemd offers you a
couple of features:

First of all, systemctl detects when it is run in a
chroot. If so, most of its operations will become NOPs, with the
exception of systemctl enable and systemctl
disable
. If a package installation script hence calls these two
commands, services will be enabled in the guest OS. However, should a
package installation script include a command like systemctl
restart
as part of the package upgrade process this will have no
effect at all when run in a chroot() environment.

More importantly however systemd comes out-of-the-box with the systemd-nspawn
tool which acts as chroot(1) on steroids: it makes use of file system
and PID namespaces to boot a simple lightweight container on a file
system tree. It can be used almost like chroot(1), except that the
isolation from the host OS is much more complete, a lot more secure
and even easier to use. In fact, systemd-nspawn is capable of
booting a complete systemd or sysvinit OS in container with a single
command. Since it virtualizes PIDs, the init system in the container
can act as PID 1 and thus do its job as normal. In contrast to
chroot(1) this tool will implicitly mount /proc,
/sys for you.

Here’s an example how in three commands you can boot a Debian OS on
your Fedora machine inside an nspawn container:

# yum install debootstrap
# debootstrap --arch=amd64 unstable debian-tree/
# systemd-nspawn -D debian-tree/

This will bootstrap the OS directory tree and then simply invoke a
shell in it. If you want to boot a full system in the container, use a
command like this:

# systemd-nspawn -D debian-tree/ /sbin/init

And after a quick bootup you should have a shell prompt, inside a
complete OS, booted in your container. The container will not be able
to see any of the processes outside of it. It will share the network
configuration, but not be able to modify it. (Expect a couple of
EPERMs during boot for that, which however should not be
fatal). Directories like /sys and /proc/sys are
available in the container, but mounted read-only in order to avoid
that the container can modify kernel or hardware configuration. Note
however that this protects the host OS only from accidental
changes of its parameters. A process in the container can manually
remount the file systems read-writeable and then change whatever it
wants to change.

So, what’s so great about systemd-nspawn again?

  1. It’s really easy to use. No need to manually mount /proc
    and /sys into your chroot() environment. The tool will do it
    for you and the kernel automatically cleans it up when the container
    terminates.
  2. The isolation is much more complete, protecting the host OS from
    accidental changes from inside the container.
  3. It’s so good that you can actually boot a full OS in the
    container, not just a single lonesome shell.
  4. It’s actually tiny and installed everywhere where systemd is
    installed. No complicated installation or setup.

systemd itself has been modified to work very well in such a
container. For example, when shutting down and detecting that it is
run in a container, it just calls exit(), instead of reboot() as last
step.

Note that systemd-nspawn is not a full container
solution. If you need that LXC is the better choice for
you. It uses the same underlying kernel technology but offers a lot
more, including network virtualization. If you so will,
systemd-nspawn is the GNOME 3 of container solutions:
slick and trivially easy to use — but with few configuration
options. LXC OTOH is more like KDE: more configuration options than lines of
code. I wrote systemd-nspawn specifically to cover testing,
debugging, building, installing, recovering. That’s what you should use
it for and what it is really good at, and where it is a much much nicer
alternative to chroot(1).

So, let’s get this finished, this was already long enough. Here’s what to take home from
this little blog story:

  1. Secure chroot()s are best done natively in the C sources of your program.
  2. ReadOnlyDirectories=, InaccessibleDirectories=
    might be suitable alternatives to a full chroot() environment.
  3. RootDirectory= is your friend if you want to chroot() a specific service.
  4. systemd-nspawn is made of awesome.
  5. chroot()s are lame, file system namespaces are totally l33t.

All of this is readily available on your Fedora 15 system.

And that’s it for today. See you again for the next installment.

Google Cache Prowling and Useful Firefox Security Plugins

Post Syndicated from David original http://feedproxy.google.com/~r/DevilsAdvocateSecurity/~3/fm5-ct_b1cI/google-cache-prowling-and-useful.html

I find that I often check Google’s cache of sites that have been taken down, either as part of an incident investigation or to verify that data has been removed. One of the nicer ways to do this is using a tool like Jeffrey To‘s Google Cache Continue script.This joins my toolbox of existing Firefox plugins such as: NoScript – script blocking URLParams – website get/post parameters Firebug – editing of pages, including Javascript variablesFoxyProxy – a Firefox based proxy switcher that works very well with web app testing tools. IETab – to pop an IE tab into a Firefox testing session Leet Key – for Base64, Hex, BIN, and other transforms ShowIP – shows the current site’s actual IP address, as well as enabling a number of other useful host lookup tools. You can find a whole relation mapped list of Firefox plugins in the FireCAT listing – enjoy!

_uacct = “UA-1423386-1”;
urchinTracker();

Google Cache Prowling and Useful Firefox Security Plugins

Post Syndicated from David original http://feedproxy.google.com/~r/DevilsAdvocateSecurity/~3/fm5-ct_b1cI/google-cache-prowling-and-useful.html

I find that I often check Google’s cache of sites that have been taken down, either as part of an incident investigation or to verify that data has been removed. One of the nicer ways to do this is using a tool like Jeffrey To‘s Google Cache Continue script.This joins my toolbox of existing Firefox plugins such as: NoScript – script blocking URLParams – website get/post parameters Firebug – editing of pages, including Javascript variablesFoxyProxy – a Firefox based proxy switcher that works very well with web app testing tools. IETab – to pop an IE tab into a Firefox testing session Leet Key – for Base64, Hex, BIN, and other transforms ShowIP – shows the current site’s actual IP address, as well as enabling a number of other useful host lookup tools. You can find a whole relation mapped list of Firefox plugins in the FireCAT listing – enjoy!

_uacct = “UA-1423386-1”;
urchinTracker();

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

  • To start less.
  • And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate‘s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo)
in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

  1. service: these are the most obvious kind of unit:
    daemons that can be started, stopped, restarted, reloaded. For
    compatibility with SysV we not only support our own
    configuration files for services, but also are able to read
    classic SysV init scripts, in particular we parse the LSB
    header, if it exists. /etc/init.d is hence not much
    more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the
    file-system or on the Internet. We currently support AF_INET,
    AF_INET6, AF_UNIX sockets of the types stream, datagram, and
    sequential packet. We also support classic FIFOs as
    transport. Each socket unit has a matching
    service unit, that is started if the first connection
    comes in on the socket or FIFO. Example: nscd.socket
    starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the
    Linux device tree. If a device is marked for this via udev
    rules, it will be exposed as a device unit in
    systemd. Properties set with udev can be used as
    configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the
    file system hierarchy. systemd monitors all mount points how
    they come and go, and can also be used to mount or
    unmount mount-points. /etc/fstab is used here as an
    additional configuration source for these mount points, similar to
    how SysV init scripts can be used as additional configuration
    source for service units.
  5. automount: this unit type encapsulates an automount
    point in the file system hierarchy. Each automount
    unit has a matching mount unit, which is started
    (i.e. mounted) as soon as the automount directory is
    accessed.
  6. target: this unit type is used for logical
    grouping of units: instead of actually doing anything by itself
    it simply references other units, which thereby can be controlled
    together. Examples for this are: multi-user.target,
    which is a target that basically plays the role of run-level 5 on
    classic SysV system, or bluetooth.target which is
    requested as soon as a bluetooth dongle becomes available and
    which simply pulls in bluetooth related services that otherwise
    would not need to be started: bluetoothd and
    obexd and suchlike.
  7. snapshot: similar to target units
    snapshots do not actually do anything themselves and their only
    purpose is to reference other units. Snapshots can be used to
    save/rollback the state of all services and units of the init
    system. Primarily it has two intended use cases: to allow the
    user to temporarily enter a specific state such as “Emergency
    Shell”, terminating current services, and provide an easy way to
    return to the state before, pulling up all services again that
    got temporarily pulled down. And to ease support for system
    suspending: still many services cannot correctly deal with
    system suspend, and it is often a better idea to shut them down
    before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the
    environment, resource limits, working and root directory, umask,
    OOM killer adjustment, nice level, IO class and priority, CPU policy
    and priority, CPU affinity, timer slack, user id, group id,
    supplementary group ids, readable/writable/inaccessible
    directories, shared/private/slave mount flags,
    capabilities/bounding set, secure bits, CPU scheduler reset of
    fork, private /tmp name-space, cgroup control for
    various subsystems. Also, you can easily connect
    stdin/stdout/stderr of services to syslog, /dev/kmsg,
    arbitrary TTYs. If connected to a TTY for input systemd will make
    sure a process gets exclusive access, optionally waiting or enforcing
    it.
  2. Every executed process gets its own cgroup (currently by
    default in the debug subsystem, since that subsystem is not
    otherwise used and does not much more than the most basic
    process grouping), and it is very easy to configure systemd to
    place services in cgroups that have been configured externally,
    for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely
    follows the well-known .desktop files. It is a simple syntax for
    which parsers exist already in many software frameworks. Also, this
    allows us to rely on existing tools for i18n for service
    descriptions, and similar. Administrators and developers don’t
    need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init
    scripts. We take advantages of LSB and Red Hat chkconfig headers
    if they are available. If they aren’t we try to make the best of
    the otherwise available information, such as the start
    priorities in /etc/rc.d. These init scripts are simply
    considered a different source of configuration, hence an easy
    upgrade path to proper systemd services is available. Optionally
    we can read classic PID files for services to identify the main
    pid of a daemon. Note that we make use of the dependency
    information from the LSB init script headers, and translate
    those into native systemd dependencies. Side note: Upstart is
    unable to harvest and make use of that information. Boot-up on a
    plain Upstart system with mostly LSB SysV init scripts will
    hence not be parallelized, a similar system running systemd
    however will. In fact, for Upstart all SysV scripts together
    make one job that is executed, they are not treated
    individually, again in contrast to systemd where SysV init
    scripts are just another source of configuration and are all
    treated and controlled individually, much like any other native
    systemd service.
  5. Similarly, we read the existing /etc/fstab
    configuration file, and consider it just another source of
    configuration. Using the comment= fstab option you can
    even mark /etc/fstab entries to become systemd
    controlled automount points.
  6. If the same unit is configured in multiple configuration
    sources (e.g. /etc/systemd/system/avahi.service exists,
    and /etc/init.d/avahi too), then the native
    configuration will always take precedence, the legacy format is
    ignored, allowing an easy upgrade path and packages to carry
    both a SysV init script and a systemd service file for a
    while.
  7. We support a simple templating/instance mechanism. Example:
    instead of having six configuration files for six gettys, we
    only have one [email protected] file which gets instantiated to
    [email protected] and suchlike. The interface part can
    even be inherited by dependency expressions, i.e. it is easy to
    encode that a service [email protected] pulls in
    [email protected], while leaving the
    eth0 string wild-carded.
  8. For socket activation we support full compatibility with the
    traditional inetd modes, as well as a very simple mode that
    tries to mimic launchd socket activation and is recommended for
    new services. The inetd mode only allows passing one socket to
    the started daemon, while the native mode supports passing
    arbitrary numbers of file descriptors. We also support one
    instance per connection, as well as one instance for all
    connections modes. In the former mode we name the cgroup the
    daemon will be started in after the connection parameters, and
    utilize the templating logic mentioned above for this. Example:
    sshd.socket might spawn services
    [email protected] with a
    cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
    (i.e. the IP address and port numbers are used in the instance
    names. For AF_UNIX sockets we use PID and user id of the
    connecting client). This provides a nice way for the
    administrator to identify the various instances of a daemon and
    control their runtime individually. The native socket passing
    mode is very easily implementable in applications: if
    $LISTEN_FDS is set it contains the number of sockets
    passed and the daemon will find them sorted as listed in the
    .service file, starting from file descriptor 3 (a
    nicely written daemon could also use fstat() and
    getsockname() to identify the sockets in case it
    receives more than one). In addition we set $LISTEN_PID
    to the PID of the daemon that shall receive the fds, because
    environment variables are normally inherited by sub-processes and
    hence could confuse processes further down the chain. Even
    though this socket passing logic is very simple to implement in
    daemons, we will provide a BSD-licensed reference implementation
    that shows how to do this. We have ported a couple of existing
    daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a
    certain extent. This compatibility is in fact implemented with a
    FIFO-activated service, which simply translates these legacy
    requests to D-Bus requests. Effectively this means the old
    shutdown, poweroff and similar commands from
    Upstart and sysvinit continue to work with
    systemd.
  10. We also provide compatibility with utmp and
    wtmp. Possibly even to an extent that is far more
    than healthy, given how crufty utmp and wtmp
    are.
  11. systemd supports several kinds of
    dependencies between units. After/Before can be used to fix
    the ordering how units are activated. It is completely
    orthogonal to Requires and Wants, which
    express a positive requirement dependency, either mandatory, or
    optional. Then, there is Conflicts which
    expresses a negative requirement dependency. Finally, there are
    three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit
    is requested to start up or shut down we will add it and all its
    dependencies to a temporary transaction. Then, we will
    verify if the transaction is consistent (i.e. whether the
    ordering via After/Before of all units is
    cycle-free). If it is not, systemd will try to fix it up, and
    removes non-essential jobs from the transaction that might
    remove the loop. Also, systemd tries to suppress non-essential
    jobs in the transaction that would stop a running
    service. Non-essential jobs are those which the original request
    did not directly include but which where pulled in by
    Wants type of dependencies. Finally we check whether
    the jobs of the transaction contradict jobs that have already
    been queued, and optionally the transaction is aborted then. If
    all worked out and the transaction is consistent and minimized
    in its impact it is merged with all already outstanding jobs and
    added to the run queue. Effectively this means that before
    executing a requested operation, we will verify that it makes
    sense, fixing it if possible, and only failing if it really cannot
    work.
  13. We record start/exit time as well as the PID and exit status
    of every process we spawn and supervise. This data can be used
    to cross-link daemons with their data in abrtd, auditd and
    syslog. Think of an UI that will highlight crashed daemons for
    you, and allows you to easily navigate to the respective UIs for
    syslog, abrt, and auditd that will show the data generated from
    and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any
    time. The daemon state is serialized before the reexecution and
    deserialized afterwards. That way we provide a simple way to
    facilitate init system upgrades as well as handover from an
    initrd daemon to the final daemon. Open sockets and autofs
    mounts are properly serialized away, so that they stay
    connectible all the time, in a way that clients will not even
    notice that the init system reexecuted itself. Also, the fact
    that a big part of the service state is encoded anyway in the
    cgroup virtual file system would even allow us to resume
    execution without access to the serialization data. The
    reexecution code paths are actually mostly the same as the init
    system configuration reloading code paths, which
    guarantees that reexecution (which is probably more seldom
    triggered) gets similar testing as reloading (which is probably
    more common).
  15. Starting the work of removing shell scripts from the boot
    process we have recoded part of the basic system setup in C and
    moved it directly into systemd. Among that is mounting of the API
    file systems (i.e. virtual file systems such as /proc,
    /sys and /dev.) and setting of the
    host-name.
  16. Server state is introspectable and controllable via
    D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based
    activation, and we hence support dependencies between sockets and
    services, we also support traditional inter-service
    dependencies. We support multiple ways how such a service can
    signal its readiness: by forking and having the start process
    exit (i.e. traditional daemonize() behaviour), as well
    as by watching the bus until a configured service name appears.
  18. There’s an interactive mode which asks for confirmation each
    time a process is spawned by systemd. You may enable it by
    passing systemd.confirm_spawn=1 on the kernel command
    line.
  19. With the systemd.default= kernel command line
    parameter you can specify which unit systemd should start on
    boot-up. Normally you’d specify something like
    multi-user.target here, but another choice could even
    be a single service instead of a target, for example
    out-of-the-box we ship a service emergency.service that
    is similar in its usefulness as init=/bin/bash, however
    has the advantage of actually running the init system, hence
    offering the option to boot up the full system from the
    emergency shell.
  20. There’s a minimal UI that allows you to
    start/stop/introspect services. It’s far from complete but
    useful as a debugging tool. It’s written in Vala (yay!) and goes
    by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup
showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

  • We ask daemon writers not to fork or even double fork
    in their processes, but run their event loop from the initial process
    systemd starts for you. Also, don’t call setsid().
  • Don’t drop user privileges in the daemon itself, leave this
    to systemd and configure it in systemd service configuration
    files. (There are exceptions here. For example, for some daemons
    there are good reasons to drop privileges inside the daemon
    code, after an initialization phase that requires elevated
    privileges.)
  • Don’t write PID files
  • Grab a name on the bus
  • You may rely on systemd for logging, you are welcome to log
    whatever you need to log to stderr.
  • Let systemd create and watch sockets for you, so that socket
    activation works. Hence, interpret $LISTEN_FDS and
    $LISTEN_PID as described above.
  • Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?
Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.
Will this come to Fedora?
If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay’s pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That’s up to them. We’d certainly welcome their interest, and help with the integration.
Why didn’t you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.
Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.
If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.

Measuring Lock Contention

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/mutrace.html

When naively profiling multi-threaded applications the time spent waiting
for mutexes is not necessarily visible in the generated output. However lock
contention can have a big impact on the runtime behaviour of applications. On
Linux valgrind’s
drd
can be used to track down mutex contention. Unfortunately running
applications under valgrind/drd slows them down massively, often having the
effect of itself generating many of the contentions one is trying to track
down. Also due to its slowness it is very time consuming work.

To improve the situation if have now written a mutex profiler called
mutrace
. In contrast to valgrind/drd it does not virtualize the
CPU instruction set, making it a lot faster. In fact, the hooks mutrace
relies on to profile mutex operations should only minimally influence
application runtime. mutrace is not useful for finding
synchronizations bugs, it is solely useful for profiling locks.

Now, enough of this introductionary blabla. Let’s have a look on the data
mutrace can generate for you. As an example we’ll look at
gedit as a bit of a prototypical Gnome application. Gtk+ and the other
Gnome libraries are not really known for their heavy use of multi-threading,
and the APIs are generally not thread-safe (for a good reason). However,
internally subsytems such as gio do use threading quite extensibly.
And as it turns out there are a few hotspots that can be discovered with
mutrace:

$ LD_PRELOAD=/home/lennart/projects/mutrace/libmutrace.so gedit
mutrace: 0.1 sucessfully initialized.

gedit is now running and its mutex use is being profiled. For this example
I have now opened a file with it, typed a few letters and then quit the program
again without saving. As soon as gedit exits mutrace will print the
profiling data it gathered to stderr. The full output you can see
here.
The most interesting part is at the end of the generated output, a
breakdown of the most contended mutexes:

mutrace: 10 most contended mutexes:

Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Type
35 368268 407 275 120,822 0,000 0,894 normal
5 234645 100 21 86,855 0,000 0,494 normal
26 177324 47 4 98,610 0,001 0,150 normal
19 55758 53 2 23,931 0,000 0,092 normal
53 106 73 1 0,769 0,007 0,160 normal
25 15156 70 1 6,633 0,000 0,019 normal
4 973 10 1 4,376 0,004 0,174 normal
75 68 62 0 0,038 0,001 0,004 normal
9 1663 52 0 1,068 0,001 0,412 normal
3 136553 41 0 61,408 0,000 0,281 normal
… … … … … … … …

mutrace: Total runtime 9678,142 ms.

(Sorry, LC_NUMERIC was set to de_DE.UTF-8, so if you can’t make sense of
all the commas, think s/,/./g!)

For each mutex a line is printed. The ‘Locked’ column tells how often the
mutex was locked during the entire runtime of about 10s. The ‘Changed’ column
tells us how often the owning thread of the mutex changed. The ‘Cont.’ column
tells us how often the lock was already taken when we tried to take it and we
had to wait. The fifth column tell us for how long during the entire runtime
the lock was locked, the sixth tells us the average lock time, and the seventh
column tells us the longest time the lock was held. Finally, the last column
tells us what kind of mutex this is (recursive, normal or otherwise).

The most contended lock in the example above is #35. 275 times during the
runtime a thread had to wait until another thread released this mutex. All in
all more then 120ms of the entire runtime (about 10s) were spent with this
lock taken!

In the full output we can now look up which mutex #35 actually is:

Mutex #35 (0x0x7f48c7057d28) first referenced by:
/home/lennart/projects/mutrace/libmutrace.so(pthread_mutex_lock+0x70) [0x7f48c97dc900]
/lib64/libglib-2.0.so.0(g_static_rw_lock_writer_lock+0x6a) [0x7f48c674a03a]
/lib64/libgobject-2.0.so.0(g_type_init_with_debug_flags+0x4b) [0x7f48c6e38ddb]
/usr/lib64/libgdk-x11-2.0.so.0(gdk_pre_parse_libgtk_only+0x8c) [0x7f48c853171c]
/usr/lib64/libgtk-x11-2.0.so.0(+0x14b31f) [0x7f48c891831f]
/lib64/libglib-2.0.so.0(g_option_context_parse+0x90) [0x7f48c67308e0]
/usr/lib64/libgtk-x11-2.0.so.0(gtk_parse_args+0xa1) [0x7f48c8918021]
/usr/lib64/libgtk-x11-2.0.so.0(gtk_init_check+0x9) [0x7f48c8918079]
/usr/lib64/libgtk-x11-2.0.so.0(gtk_init+0x9) [0x7f48c89180a9]
/usr/bin/gedit(main+0x166) [0x427fc6]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x7f48c5b42b4d]
/usr/bin/gedit() [0x4276c9]

As it appears in this Gtk+ program the rwlock type_rw_lock
(defined in glib’s gobject/gtype.c) is a hotspot. GLib’s rwlocks are
implemented on top of mutexes, so an obvious attempt in improving this could
be to actually make them use the operating system’s rwlock primitives.

If a mutex is used often but only ever by the same thread it cannot starve
other threads. The ‘Changed.’ column lists how often a specific mutex changed
the owning thread. If the number is high this means the risk of contention is
also high. The ‘Cont.’ column tells you about contention that actually took
place.

Due to the way mutrace works we cannot profile mutexes that are
used internally in glibc, such as those used for synchronizing stdio
and suchlike. mutrace is implemented entirely in userspace. It
uses all kinds of exotic GCC, glibc and kernel features, so you might have a
hard time compiling and running it on anything but a very recent Linux
distribution. I have tested it on Rawhide but it should work on slightly older
distributions, too.

Make sure to build your application with -rdynamic to make the
backtraces mutrace generates useful.

As of now, mutrace only profiles mutexes. Adding support for
rwlocks should be easy to add though. Patches welcome.

The output mutrace generates can be influenced by various
MUTRACE_xxx environment variables. See the sources for more
information.

And now, please take mutrace and profile and speed up your application!

You may find the sources in my
git repository.

Measuring Lock Contention

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/mutrace.html

When naively profiling multi-threaded applications the time spent waiting
for mutexes is not necessarily visible in the generated output. However lock
contention can have a big impact on the runtime behaviour of applications. On
Linux valgrind’s
drd
can be used to track down mutex contention. Unfortunately running
applications under valgrind/drd slows them down massively, often having the
effect of itself generating many of the contentions one is trying to track
down. Also due to its slowness it is very time consuming work.

To improve the situation if have now written a mutex profiler called
mutrace
. In contrast to valgrind/drd it does not virtualize the
CPU instruction set, making it a lot faster. In fact, the hooks mutrace
relies on to profile mutex operations should only minimally influence
application runtime. mutrace is not useful for finding
synchronizations bugs, it is solely useful for profiling locks.

Now, enough of this introductionary blabla. Let’s have a look on the data
mutrace can generate for you. As an example we’ll look at
gedit as a bit of a prototypical Gnome application. Gtk+ and the other
Gnome libraries are not really known for their heavy use of multi-threading,
and the APIs are generally not thread-safe (for a good reason). However,
internally subsytems such as gio do use threading quite extensibly.
And as it turns out there are a few hotspots that can be discovered with
mutrace:

$ LD_PRELOAD=/home/lennart/projects/mutrace/libmutrace.so gedit
mutrace: 0.1 sucessfully initialized.

gedit is now running and its mutex use is being profiled. For this example
I have now opened a file with it, typed a few letters and then quit the program
again without saving. As soon as gedit exits mutrace will print the
profiling data it gathered to stderr. The full output you can see
here.
The most interesting part is at the end of the generated output, a
breakdown of the most contended mutexes:

mutrace: 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]       Type
      35   368268      407      275      120,822        0,000        0,894     normal
       5   234645      100       21       86,855        0,000        0,494     normal
      26   177324       47        4       98,610        0,001        0,150     normal
      19    55758       53        2       23,931        0,000        0,092     normal
      53      106       73        1        0,769        0,007        0,160     normal
      25    15156       70        1        6,633        0,000        0,019     normal
       4      973       10        1        4,376        0,004        0,174     normal
      75       68       62        0        0,038        0,001        0,004     normal
       9     1663       52        0        1,068        0,001        0,412     normal
       3   136553       41        0       61,408        0,000        0,281     normal
     ...      ...      ...      ...          ...          ...          ...        ...

mutrace: Total runtime 9678,142 ms.

(Sorry, LC_NUMERIC was set to de_DE.UTF-8, so if you can’t make sense of
all the commas, think s/,/./g!)

For each mutex a line is printed. The ‘Locked’ column tells how often the
mutex was locked during the entire runtime of about 10s. The ‘Changed’ column
tells us how often the owning thread of the mutex changed. The ‘Cont.’ column
tells us how often the lock was already taken when we tried to take it and we
had to wait. The fifth column tell us for how long during the entire runtime
the lock was locked, the sixth tells us the average lock time, and the seventh
column tells us the longest time the lock was held. Finally, the last column
tells us what kind of mutex this is (recursive, normal or otherwise).

The most contended lock in the example above is #35. 275 times during the
runtime a thread had to wait until another thread released this mutex. All in
all more then 120ms of the entire runtime (about 10s) were spent with this
lock taken!

In the full output we can now look up which mutex #35 actually is:

Mutex #35 (0x0x7f48c7057d28) first referenced by:
	/home/lennart/projects/mutrace/libmutrace.so(pthread_mutex_lock+0x70) [0x7f48c97dc900]
	/lib64/libglib-2.0.so.0(g_static_rw_lock_writer_lock+0x6a) [0x7f48c674a03a]
	/lib64/libgobject-2.0.so.0(g_type_init_with_debug_flags+0x4b) [0x7f48c6e38ddb]
	/usr/lib64/libgdk-x11-2.0.so.0(gdk_pre_parse_libgtk_only+0x8c) [0x7f48c853171c]
	/usr/lib64/libgtk-x11-2.0.so.0(+0x14b31f) [0x7f48c891831f]
	/lib64/libglib-2.0.so.0(g_option_context_parse+0x90) [0x7f48c67308e0]
	/usr/lib64/libgtk-x11-2.0.so.0(gtk_parse_args+0xa1) [0x7f48c8918021]
	/usr/lib64/libgtk-x11-2.0.so.0(gtk_init_check+0x9) [0x7f48c8918079]
	/usr/lib64/libgtk-x11-2.0.so.0(gtk_init+0x9) [0x7f48c89180a9]
	/usr/bin/gedit(main+0x166) [0x427fc6]
	/lib64/libc.so.6(__libc_start_main+0xfd) [0x7f48c5b42b4d]
	/usr/bin/gedit() [0x4276c9]

As it appears in this Gtk+ program the rwlock type_rw_lock
(defined in glib’s gobject/gtype.c) is a hotspot. GLib’s rwlocks are
implemented on top of mutexes, so an obvious attempt in improving this could
be to actually make them use the operating system’s rwlock primitives.

If a mutex is used often but only ever by the same thread it cannot starve
other threads. The ‘Changed.’ column lists how often a specific mutex changed
the owning thread. If the number is high this means the risk of contention is
also high. The ‘Cont.’ column tells you about contention that actually took
place.

Due to the way mutrace works we cannot profile mutexes that are
used internally in glibc, such as those used for synchronizing stdio
and suchlike.

mutrace is implemented entirely in userspace. It
uses all kinds of exotic GCC, glibc and kernel features, so you might have a
hard time compiling and running it on anything but a very recent Linux
distribution. I have tested it on Rawhide but it should work on slightly older
distributions, too.

Make sure to build your application with -rdynamic to make the
backtraces mutrace generates useful.

As of now, mutrace only profiles mutexes. Adding support for
rwlocks should be easy to add though. Patches welcome.

The output mutrace generates can be influenced by various
MUTRACE_xxx environment variables. See the sources for more
information.

And now, please take mutrace and profile and speed up your application!

You may find the sources in my
git repository.

What’s Cooking in PulseAudio’s glitch-free Branch

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/pulse-glitch-free.html

A while ago I started development of special branch of PulseAudio which is called
glitch-free. In a few days I will merge it back to PulseAudio
trunk, and eventually release it as 0.9.11. I think it’s time to
explain a little what all this “glitch-freeness” is about, what made
it so tricky to implement, and why this is totally awesome
technology. So, here we go:

Traditional Playback Model

Traditionally on most operating systems audio is scheduled via
sound card interrupts
(IRQs)
. When an application opens a sound card for playback it
configures it for a fixed size playback buffer. Then it fills this
buffer with digital PCM
sample data. And after that it tells the hardware to start
playback. Then, the hardware reads the samples from the buffer, one at
a time, and passes it on to the DAC
so that eventually it reaches the speakers.

After a certain number of samples played the sound hardware
generates an interrupt. This interrupt is forwarded to the
application. On Linux/Unix this is done via poll()/select(),
which the application uses to sleep on the sound card file
descriptor. When the application is notified via this interrupt it
overwrites the samples that were just played by the hardware with new
data and goes to sleep again. When the next interrupt arrives the next
block of samples is overwritten, and so on and so on. When the
hardware reaches the end of the hardware buffer it starts from its
beginning again, in a true ring buffer
fashion. This goes on and on and on.

The number of samples after which an interrupt is generated is
usually called a fragment (ALSA likes to call the same thing a
period for some reason). The number of fragments the entire
playback buffer is split into is usually integral and usually a power of
two, 2 and 4 being the most frequently used values.

Schematic overview
Image 1: Schematic overview of the playback buffer in the traditional playback model, in the best way the author can visualize this with his limited drawing abilities.

If the application is not quick enough to fill up the hardware
buffer again after an interrupt we get a buffer underrun
(“drop-out”). An underrun is clearly hearable by the user as a
discontinuity in audio which is something we clearly don’t want. We
thus have to carefully make sure that the buffer and fragment sizes
are chosen in a way that the software has enough time to calculate the
data that needs to be played, and the OS has enough time to forward
the interrupt from the hardware to the userspace software and the
write request back to the hardware.

Depending on the requirements of the application the size of the
playback buffer is chosen. It can be as small as 4ms for low-latency
applications (such as music synthesizers), or as long as 2s for
applications where latency doesn’t matter (such as music players). The
hardware buffer size directly translates to the latency that the
playback adds to the system. The smaller the fragment sizes the
application configures, the more time the application has to fill up
the playback buffer again.

Let’s formalize this a bit: Let BUF_SIZE be the size of the
hardware playback buffer in samples, FRAG_SIZE the size of one
fragment in samples, and NFRAGS the number of fragments the buffer is
split into (equivalent to BUF_SIZE divided by FRAG_SIZE), RATE the sampling
rate in samples per second. Then, the overall latency is identical to
BUF_SIZE/RATE. An interrupt is generated every FRAG_SIZE/RATE. Every
time one of those interrupts is generated the application should fill
up one fragment again, if it missed one interrupt this might become
more than one. If it doesn’t miss any interrupt it has
(NFRAGS-1)*FRAG_SIZE/RATE time to fulfill the request. If it needs
more time than this we’ll get an underrun. The fill level of the
playback buffer should thus usually oscillate between BUF_SIZE and
(NFRAGS-1)*FRAG_SIZE. In case of missed interrupts it might however
fall considerably lower, in the worst case to 0 which is, again, an
underrun.

It is difficult to choose the buffer and fragment sizes in an
optimal way for an application:

The buffer size should be as large as possible to minimize the
risk of drop-outs.

The buffer size should be as small as possible to guarantee
minimal latencies.

The fragment size should be as large as possible to minimize the
number of interrupts, and thus the required CPU time used, to maximize
the time the CPU can sleep for between interrupts and thus the battery
lifetime (i.e. the fewer interrupts are generated the lower your audio
app will show up in powertop, and that’s what all is about,
right?)

The fragment size should be as small as possible to give the
application as much time as possible to fill up the playback buffer,
to minimize drop-outs.

As you can easily see it is impossible to choose buffering metrics
in a way that they are optimal on all four requirements.

This traditional model has major drawbacks:

The buffering metrics are highly dependant on what the sound hardware
can provide. Portable software needs to be able to deal with hardware
that can only provide a very limited set of buffer and fragment
sizes.

The buffer metrics are configured only once, when the device is
opened, they usually cannot be reconfigured during playback without
major discontinuities in audio. This is problematic if more than one
application wants to output audio at the same time via a sound server
(or dmix) and they have different requirements on
latency. For these sound servers/dmix the fragment metrics are
configured statically in a configuration file, and are the same during
the whole lifetime. If a client connects that needs lower latencies,
it basically lost. If a client connects that doesn’t need as low
latencies, we will continouisly burn more CPU/battery than
necessary.

It is practically impossible to choose the buffer metrics optimal
for your application — there are too many variables in the equation:
you can’t know anything about the IRQ/scheduling latencies of the
OS/machine your software will be running on; you cannot know how much
time it will actually take to produce the audio data that shall be
pushed to the audio device (unless you start counting cycles, which is
a good way to make your code unportable); the scheduling latencies are
hugely dependant on the system load on most current OSes (unless you
have an RT system, which we generally do not have). As said, for sound
servers/dmix it is impossible to know in advance what the requirements
on latency are that the applications that might eventually connect
will have.

Since the number of fragments is integral and at least 2
on almost all existing hardware we will generate at least two interrupts
on each buffer iteration. If we fix the buffer size to 2s then we will
generate an interrupt at least every 1s. We’d then have 1s to fill up
the buffer again — on all modern systems this is far more than we’d
ever need. It would be much better if we could fix the fragment size
to 1.9s, which still gives us 100ms to fill up the playback buffer
again, still more than necessary on most systems.

Due to the limitations of this model most current (Linux/Unix)
software uses buffer metrics that turned out to “work most of the
time”, very often they are chosen without much thinking, by copying
other people’s code, or totally at random.

PulseAudio <= 0.9.10 uses a fragment size of 25ms by default, with
four fragments. That means that right now, unless you reconfigure your
PulseAudio manually clients will not get latencies lower than 100ms
whatever you try, and as long as music is playing you will
get 40 interrupts/s. (The relevant configuration options for PulseAudio are
default-fragments= and default-fragment-size-msec=
in daemon.conf)

dmix uses 16 fragments by default with a size of 21 ms each (on my
system at least — this varies, depending on your hardware). You can’t
get less than 47 interrupts/s. (You can change the parameters in
.asoundrc)

So much about the traditional model and its limitations. Now, we’ll
have a peek on how the new glitch-free branch of PulseAudio
does its things. The technology is not really new. It’s inspired
by what Vista does these days and what Apple CoreAudio has already
been doing for quite a while. However, on Linux this technology is
new, we have been lagging behind quite a bit. Also I claim that what
PA does now goes beyond what Vista/MacOS does in many ways, though of
course, they provide much more than we provide in many other ways. The
name glitch-free is inspired by the term Microsoft uses to
call this model, however I must admit that I am not sure that my
definition of this term and theirs actually is the same.

Glitch-Free Playback Model

The first basic idea of the glitch-free playback model (a
better, less marketingy name is probably timer-based audio
scheduling which is the term I internally use in the PA codebase)
is to no longer depend on sound card interrupts to schedule audio but
use system timers instead. System timers are far more flexible then
the fragment-based sound card timers. They can be reconfigured at any
time, and have a granularity that is independant from any buffer
metrics of the sound card. The second basic idea is to use playback
buffers that are as large as possible, up to a limit of 2s or 5s. The
third basic idea is to allow rewriting of the hardware buffer at any
time. This allows instant reaction on user-input (i.e. pause/seek
requests in your music player, or instant event sounds) although the
huge latency imposed by the hardware playback buffer would suggest
otherwise.

PA configures the audio hardware to the largest playback buffer
size possible, up to 2s. The sound card interrupts are disabled as far
as possible (most of the time this means to simply lower NFRAGS to the
minimal value supported by the hardware. It would be great if ALSA
would allow us to disable sound card interrupts entirely). Then, PA
constantly determines what the minimal latency requirement of all
connected clients is. If no client specified any requirements we fill
up the whole buffer all the time, i.e. have an actual latency of
2s. However, if some applications specified requirements, we take the
lowest one and only use as much of the configured hardware buffer as
this value allows us. In practice, this means we only partially fill the
buffer each time we wake up. Then, we configure a system timer
to wake us up 10ms before the buffer would run empty and fill it up
again then. If the overall latency is configured to less than 10ms we
wakeup after half the latency requested.

If the sleep time turns out to be too long (i.e. it took more than
10ms to fill up the hardware buffer) we will get an underrun. If this
happens we can double the time we wake up before the buffer would run
empty, to 20ms, and so on. If we notice that we only used much less
than the time we estimated, we can halve this value again. This
adaptive scheme makes sure that in the unlikely event of a buffer
underrun it will happen most likely only once and never again.

When a new client connects or an existing client disconnects, or
when a client wants to rewrite what it already wrote, or the user
wants to change the volume of one of the streams, then PA will
resample its data passed by the client, convert it to the proper
hardware sample type, and remix it with the data of the other
clients. This of course makes it necessary to keep a “history” of data
of all clients around so that if one client requests a
rewrite we have the necessary data around to remix what already was
mixed before.

The benefits of this model are manyfold:

We minimize the overall number of interrupts, down to what the
latency requirements of the connected clients allow us. i.e. we save power,
don’t show up in powertop anymore for normal music playback.

We maximize drop-out safety, because we buffer up to 2s in the
usual cases. Only with operating systems which have scheduling
latencies > 2s we can still get drop-outs. Thankfully no
operating system is that bad.

In the event of an underrun we don’t get stuck in it, but instead
are able to recover quickly and can make sure it doesn’t happen again.

We provide “zero-latency”. Each client can rewrite its playback
buffer at any time, and this is forwarded to the hardware, even if
this means that the sample currently being played needs to be
rewritten. This means much quicker reaction to user input, a more
responsive user experience.

We become much less dependant on what the sound hardware provides
us with. We can configure wakeup times that are independant from the
fragment settings that the hardware actually supports.

We can provide almost any latency a client might request,
dynamically without reconfiguration, without discontinuities in
audio.

Of course, this scheme also comes with major complications:

System timers and sound card timers deviate. On many sound cards
by quite a bit. Also, not all sound cards allow the user to query the
playback frame index at any time, but only shortly after each IRQ. To
compensate for this deviation PA contains a non-trivial algorithm
which tries to estimate and follow the deviation over time. If this
doesn’t work properly it might happen that an underrun happens much
earlier than we expected.

System timers on Unix are not very high precision. On traditional
Linux with HZ=100 sleep times for timers are rounded up to multiples
of 10ms. Only very recent Linux kernels with hrtimers can
provide something better, but only on x86 and x86-64 until now. This
makes the whole scheme unusable for low latency setups unless you run
the very latest Linux. Also, hrtimers are not (yet) exposed in
poll()/select(). It requires major jumping through loops to
work around this limitation.

We need to keep a history of sample data for each stream around, thus increasing the memory
footprint and potentially increased cache pressure. PA tries to work
against the increased memory footprint and cache pressure this might cause by doing
zero-copy memory management.

We’re still dependant on the maximum playback buffer size the
sound hardware supports. Many sound cards don’t even support 2s, but only
300ms or suchlike.

The rewriting of the client buffers causing rewriting of the
hardware buffer complicates the resampling/converting step
immensly. In general the code to implement this model is more complex
than for the traditional model. Also, ALSA has not really been
designed with this design in mind, which makes some things very hard
to get right and suboptimal.

Generally, this works reliably only on newest ALSA, newest kernel,
newest everything. It has pretty steep requirements on software and
sometimes even on hardware. To stay comptible with systems that don’t
fulfill these requirements we need to carry around code for the
traditional playback model as well, increasing the code base by far.

The advantages of the scheme clearly outweigh the complexities it
causes. Especially the power-saving features of glitch-free PA should
be enough reason for the embedded Linux people to adopt it
quickly. Make PA disappear from powertop even if you play music!

The code in the glitch-free is still rough and sometimes
incomplete. I will merge it shortly into trunk and then
upload a snapshot to Rawhide.

I hope this text also explains to the few remaining PA haters a
little better why PA is a good thing, and why everyone should have it
on his Linux desktop. Of course these changes are not visible on the
surface, my hope with this blog story is to explain a bit better why
infrastructure matters, and counter misconceptions what PA actually is
and what it gives you on top of ALSA.

What’s Cooking in PulseAudio’s glitch-free Branch

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/pulse-glitch-free.html

A while ago I started development of special branch of PulseAudio which is called
glitch-free. In a few days I will merge it back to PulseAudio
trunk, and eventually release it as 0.9.11. I think it’s time to
explain a little what all this “glitch-freeness” is about, what made
it so tricky to implement, and why this is totally awesome
technology. So, here we go:

Traditional Playback Model

Traditionally on most operating systems audio is scheduled via
sound card interrupts
(IRQs)
. When an application opens a sound card for playback it
configures it for a fixed size playback buffer. Then it fills this
buffer with digital PCM
sample data. And after that it tells the hardware to start
playback. Then, the hardware reads the samples from the buffer, one at
a time, and passes it on to the DAC
so that eventually it reaches the speakers.

After a certain number of samples played the sound hardware
generates an interrupt. This interrupt is forwarded to the
application. On Linux/Unix this is done via poll()/select(),
which the application uses to sleep on the sound card file
descriptor. When the application is notified via this interrupt it
overwrites the samples that were just played by the hardware with new
data and goes to sleep again. When the next interrupt arrives the next
block of samples is overwritten, and so on and so on. When the
hardware reaches the end of the hardware buffer it starts from its
beginning again, in a true ring buffer
fashion. This goes on and on and on.

The number of samples after which an interrupt is generated is
usually called a fragment (ALSA likes to call the same thing a
period for some reason). The number of fragments the entire
playback buffer is split into is usually integral and usually a power of
two, 2 and 4 being the most frequently used values.

Schematic overview
Image 1: Schematic overview of the playback buffer in the traditional playback model, in the best way the author can visualize this with his limited drawing abilities.

If the application is not quick enough to fill up the hardware
buffer again after an interrupt we get a buffer underrun
(“drop-out”). An underrun is clearly hearable by the user as a
discontinuity in audio which is something we clearly don’t want. We
thus have to carefully make sure that the buffer and fragment sizes
are chosen in a way that the software has enough time to calculate the
data that needs to be played, and the OS has enough time to forward
the interrupt from the hardware to the userspace software and the
write request back to the hardware.

Depending on the requirements of the application the size of the
playback buffer is chosen. It can be as small as 4ms for low-latency
applications (such as music synthesizers), or as long as 2s for
applications where latency doesn’t matter (such as music players). The
hardware buffer size directly translates to the latency that the
playback adds to the system. The smaller the fragment sizes the
application configures, the more time the application has to fill up
the playback buffer again.

Let’s formalize this a bit: Let BUF_SIZE be the size of the
hardware playback buffer in samples, FRAG_SIZE the size of one
fragment in samples, and NFRAGS the number of fragments the buffer is
split into (equivalent to BUF_SIZE divided by FRAG_SIZE), RATE the sampling
rate in samples per second. Then, the overall latency is identical to
BUF_SIZE/RATE. An interrupt is generated every FRAG_SIZE/RATE. Every
time one of those interrupts is generated the application should fill
up one fragment again, if it missed one interrupt this might become
more than one. If it doesn’t miss any interrupt it has
(NFRAGS-1)*FRAG_SIZE/RATE time to fulfill the request. If it needs
more time than this we’ll get an underrun. The fill level of the
playback buffer should thus usually oscillate between BUF_SIZE and
(NFRAGS-1)*FRAG_SIZE. In case of missed interrupts it might however
fall considerably lower, in the worst case to 0 which is, again, an
underrun.

It is difficult to choose the buffer and fragment sizes in an
optimal way for an application:

  • The buffer size should be as large as possible to minimize the
    risk of drop-outs.
  • The buffer size should be as small as possible to guarantee
    minimal latencies.
  • The fragment size should be as large as possible to minimize the
    number of interrupts, and thus the required CPU time used, to maximize
    the time the CPU can sleep for between interrupts and thus the battery
    lifetime (i.e. the fewer interrupts are generated the lower your audio
    app will show up in powertop, and that’s what all is about,
    right?)
  • The fragment size should be as small as possible to give the
    application as much time as possible to fill up the playback buffer,
    to minimize drop-outs.

As you can easily see it is impossible to choose buffering metrics
in a way that they are optimal on all four requirements.

This traditional model has major drawbacks:

  • The buffering metrics are highly dependant on what the sound hardware
    can provide. Portable software needs to be able to deal with hardware
    that can only provide a very limited set of buffer and fragment
    sizes.
  • The buffer metrics are configured only once, when the device is
    opened, they usually cannot be reconfigured during playback without
    major discontinuities in audio. This is problematic if more than one
    application wants to output audio at the same time via a sound server
    (or dmix) and they have different requirements on
    latency. For these sound servers/dmix the fragment metrics are
    configured statically in a configuration file, and are the same during
    the whole lifetime. If a client connects that needs lower latencies,
    it basically lost. If a client connects that doesn’t need as low
    latencies, we will continouisly burn more CPU/battery than
    necessary.
  • It is practically impossible to choose the buffer metrics optimal
    for your application — there are too many variables in the equation:
    you can’t know anything about the IRQ/scheduling latencies of the
    OS/machine your software will be running on; you cannot know how much
    time it will actually take to produce the audio data that shall be
    pushed to the audio device (unless you start counting cycles, which is
    a good way to make your code unportable); the scheduling latencies are
    hugely dependant on the system load on most current OSes (unless you
    have an RT system, which we generally do not have). As said, for sound
    servers/dmix it is impossible to know in advance what the requirements
    on latency are that the applications that might eventually connect
    will have.
  • Since the number of fragments is integral and at least 2
    on almost all existing hardware we will generate at least two interrupts
    on each buffer iteration. If we fix the buffer size to 2s then we will
    generate an interrupt at least every 1s. We’d then have 1s to fill up
    the buffer again — on all modern systems this is far more than we’d
    ever need. It would be much better if we could fix the fragment size
    to 1.9s, which still gives us 100ms to fill up the playback buffer
    again, still more than necessary on most systems.

Due to the limitations of this model most current (Linux/Unix)
software uses buffer metrics that turned out to “work most of the
time”, very often they are chosen without much thinking, by copying
other people’s code, or totally at random.

PulseAudio <= 0.9.10 uses a fragment size of 25ms by default, with
four fragments. That means that right now, unless you reconfigure your
PulseAudio manually clients will not get latencies lower than 100ms
whatever you try, and as long as music is playing you will
get 40 interrupts/s. (The relevant configuration options for PulseAudio are
default-fragments= and default-fragment-size-msec=
in daemon.conf)

dmix uses 16 fragments by default with a size of 21 ms each (on my
system at least — this varies, depending on your hardware). You can’t
get less than 47 interrupts/s. (You can change the parameters in
.asoundrc)

So much about the traditional model and its limitations. Now, we’ll
have a peek on how the new glitch-free branch of PulseAudio
does its things. The technology is not really new. It’s inspired
by what Vista does these days and what Apple CoreAudio has already
been doing for quite a while. However, on Linux this technology is
new, we have been lagging behind quite a bit. Also I claim that what
PA does now goes beyond what Vista/MacOS does in many ways, though of
course, they provide much more than we provide in many other ways. The
name glitch-free is inspired by the term Microsoft uses to
call this model, however I must admit that I am not sure that my
definition of this term and theirs actually is the same.

Glitch-Free Playback Model

The first basic idea of the glitch-free playback model (a
better, less marketingy name is probably timer-based audio
scheduling
which is the term I internally use in the PA codebase)
is to no longer depend on sound card interrupts to schedule audio but
use system timers instead. System timers are far more flexible then
the fragment-based sound card timers. They can be reconfigured at any
time, and have a granularity that is independant from any buffer
metrics of the sound card. The second basic idea is to use playback
buffers that are as large as possible, up to a limit of 2s or 5s. The
third basic idea is to allow rewriting of the hardware buffer at any
time. This allows instant reaction on user-input (i.e. pause/seek
requests in your music player, or instant event sounds) although the
huge latency imposed by the hardware playback buffer would suggest
otherwise.

PA configures the audio hardware to the largest playback buffer
size possible, up to 2s. The sound card interrupts are disabled as far
as possible (most of the time this means to simply lower NFRAGS to the
minimal value supported by the hardware. It would be great if ALSA
would allow us to disable sound card interrupts entirely). Then, PA
constantly determines what the minimal latency requirement of all
connected clients is. If no client specified any requirements we fill
up the whole buffer all the time, i.e. have an actual latency of
2s. However, if some applications specified requirements, we take the
lowest one and only use as much of the configured hardware buffer as
this value allows us. In practice, this means we only partially fill the
buffer each time we wake up. Then, we configure a system timer
to wake us up 10ms before the buffer would run empty and fill it up
again then. If the overall latency is configured to less than 10ms we
wakeup after half the latency requested.

If the sleep time turns out to be too long (i.e. it took more than
10ms to fill up the hardware buffer) we will get an underrun. If this
happens we can double the time we wake up before the buffer would run
empty, to 20ms, and so on. If we notice that we only used much less
than the time we estimated, we can halve this value again. This
adaptive scheme makes sure that in the unlikely event of a buffer
underrun it will happen most likely only once and never again.

When a new client connects or an existing client disconnects, or
when a client wants to rewrite what it already wrote, or the user
wants to change the volume of one of the streams, then PA will
resample its data passed by the client, convert it to the proper
hardware sample type, and remix it with the data of the other
clients. This of course makes it necessary to keep a “history” of data
of all clients around so that if one client requests a
rewrite we have the necessary data around to remix what already was
mixed before.

The benefits of this model are manyfold:

  • We minimize the overall number of interrupts, down to what the
    latency requirements of the connected clients allow us. i.e. we save power,
    don’t show up in powertop anymore for normal music playback.
  • We maximize drop-out safety, because we buffer up to 2s in the
    usual cases. Only with operating systems which have scheduling
    latencies > 2s we can still get drop-outs. Thankfully no
    operating system is that bad.
  • In the event of an underrun we don’t get stuck in it, but instead
    are able to recover quickly and can make sure it doesn’t happen again.
  • We provide “zero-latency”. Each client can rewrite its playback
    buffer at any time, and this is forwarded to the hardware, even if
    this means that the sample currently being played needs to be
    rewritten. This means much quicker reaction to user input, a more
    responsive user experience.
  • We become much less dependant on what the sound hardware provides
    us with. We can configure wakeup times that are independant from the
    fragment settings that the hardware actually supports.
  • We can provide almost any latency a client might request,
    dynamically without reconfiguration, without discontinuities in
    audio.

Of course, this scheme also comes with major complications:

  • System timers and sound card timers deviate. On many sound cards
    by quite a bit. Also, not all sound cards allow the user to query the
    playback frame index at any time, but only shortly after each IRQ. To
    compensate for this deviation PA contains a non-trivial algorithm
    which tries to estimate and follow the deviation over time. If this
    doesn’t work properly it might happen that an underrun happens much
    earlier than we expected.
  • System timers on Unix are not very high precision. On traditional
    Linux with HZ=100 sleep times for timers are rounded up to multiples
    of 10ms. Only very recent Linux kernels with hrtimers can
    provide something better, but only on x86 and x86-64 until now. This
    makes the whole scheme unusable for low latency setups unless you run
    the very latest Linux. Also, hrtimers are not (yet) exposed in
    poll()/select(). It requires major jumping through loops to
    work around this limitation.
  • We need to keep a history of sample data for each stream around, thus increasing the memory
    footprint and potentially increased cache pressure. PA tries to work
    against the increased memory footprint and cache pressure this might cause by doing
    zero-copy memory management.
  • We’re still dependant on the maximum playback buffer size the
    sound hardware supports. Many sound cards don’t even support 2s, but only
    300ms or suchlike.
  • The rewriting of the client buffers causing rewriting of the
    hardware buffer complicates the resampling/converting step
    immensly. In general the code to implement this model is more complex
    than for the traditional model. Also, ALSA has not really been
    designed with this design in mind, which makes some things very hard
    to get right and suboptimal.
  • Generally, this works reliably only on newest ALSA, newest kernel,
    newest everything. It has pretty steep requirements on software and
    sometimes even on hardware. To stay comptible with systems that don’t
    fulfill these requirements we need to carry around code for the
    traditional playback model as well, increasing the code base by far.

The advantages of the scheme clearly outweigh the complexities it
causes. Especially the power-saving features of glitch-free PA should
be enough reason for the embedded Linux people to adopt it
quickly. Make PA disappear from powertop even if you play music!

The code in the glitch-free is still rough and sometimes
incomplete. I will merge it shortly into trunk and then
upload a snapshot to Rawhide.

I hope this text also explains to the few remaining PA haters a
little better why PA is a good thing, and why everyone should have it
on his Linux desktop. Of course these changes are not visible on the
surface, my hope with this blog story is to explain a bit better why
infrastructure matters, and counter misconceptions what PA actually is
and what it gives you on top of ALSA.

Emulated atomic operations and real-time scheduling

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/atomic-rt.html

Unfortunately not all CPU architectures have native support for atomic
operations
, or only support a very limited subset. Most
prominently ARMv5 (and
older) hasn’t any support besides the most basic atomic swap
operation[1]. Now, more and more free code is starting to
use atomic operations and lock-free
algorithms
, one being my own project, PulseAudio. If you have ever done real-time
programming
you probably know that you cannot really do it without
support for atomic operations. One question remains however: what to
do on CPUs which support only the most basic atomic operations
natively?

On the kernel side atomic ops are very easy to emulate: just disable
interrupts temporarily, then do your operation non-atomically, and afterwards
enable them again. That’s relatively cheap and works fine (unless you are on SMP — which
fortunately you usually are not for those CPUs). The Linux
kernel does it this way and it is good. But what to do in user-space, where you cannot just go and disable interrupts?

Let’s see how the different userspace libraries/frameworks do it
for ARMv5, a very relevant architecture that only knows an atomic swap (exchange)
but no CAS
or even atomic arithmetics. Let’s start with an excerpt from glibc’s
atomic operations implementation for ARM
:

/* Atomic compare and exchange. These sequences are not actually atomic;
there is a race if *MEM != OLDVAL and we are preempted between the two
swaps. However, they are very close to atomic, and are the best that a
pre-ARMv6 implementation can do without operating system support.
LinuxThreads has been using these sequences for many years. */

This comment says it all. Not good. The more you make use of atomic
operations the more likely you’re going to to hit this race. Let’s
hope glibc is not a heavy user of atomic operations. PulseAudio however is, and
PulseAudio happens to be my focus.

Let’s have a look on how Qt4
does it:

extern Q_CORE_EXPORT char q_atomic_lock;

inline char q_atomic_swp(volatile char *ptr, char newval)
{
register int ret;
asm volatile(“swpb %0,%1,[%2]”
: “=&r”(ret)
: “r”(newval), “r”(ptr)
: “cc”, “memory”);
return ret;
}

inline int q_atomic_test_and_set_int(volatile int *ptr, int expected, int newval)
{
int ret = 0;
while (q_atomic_swp(&q_atomic_lock, ~0) != 0);
if (*ptr == expected) {
*ptr = newval;
ret = 1;
}
q_atomic_swp(&q_atomic_lock, 0);
return ret;
}

So, what do we have here? A slightly better version. In standard
situations it actually works. But it sucks big time, too. Why? It
contains a spin lock: the variable q_atomic_lock is used for
locking the atomic operation. The code tries to set it to non-zero,
and if that fails it tries again, until it succeeds, in the hope that
the other thread — which currently holds the lock — gives it up. The
big problem here is: it might take a while until that happens, up to
1/HZ time on Linux. Usually you want to use atomic operations to
minimize the need for mutexes and thus speed things up. Now, here you
got a lock, and it’s the worst kind: the spinning lock. Not
good. Also, if used from a real-time thread the machine simply locks
up when we enter the loop in contended state, because preemption is
disabled for RT threads and thus the loop will spin forever. Evil. And
then, there’s another problem: it’s a big bottleneck, because all
atomic operations are synchronized via a single variable which is
q_atomic_lock. Not good either. And let’s not forget that
only code that has access to q_atomic_lock actually can
execute this code safely. If you want to use it for
lock-free IPC via shared memory this is going to break. And let’s not
forget that it is unusable from signal handlers (which probably
doesn’t matter much, though). So, in summary: this code sucks,
too.

Next try, let’s have a look on how glib
does it:

static volatile int atomic_spin = 0;

static int atomic_spin_trylock (void)
{
int result;

asm volatile (
“swp %0, %1, [%2]n”
: “=&r,&r” (result)
: “r,0” (1), “r,r” (&atomic_spin)
: “memory”);
if (result == 0)
return 0;
else
return -1;
}

static void atomic_spin_lock (void)
{
while (atomic_spin_trylock())
sched_yield();
}

static void atomic_spin_unlock (void)
{
atomic_spin = 0;
}

gint
g_atomic_int_exchange_and_add (volatile gint *atomic,
gint val)
{
gint result;

atomic_spin_lock();
result = *atomic;
*atomic += val;
atomic_spin_unlock();

return result;
}

Once again, a spin loop. However, this implementation makes use of
sched_yield() for asking the OS to reschedule. It’s a bit
better than the Qt version, since it doesn’t spin just burning CPU,
but instead tells the kernel to execute something else, increasing the
chance that the thread currently holding the lock is scheduled. It’s a
bit friendlier, but it’s not great either because this might still delay
execution quite a bit. It’s better then the Qt version. And probably
one of the very few ligitimate occasions where using
sched_yield() is OK. It still doesn’t work for RT — because
sched_yield() in most cases is a NOP on for RT threads, so
you still get a machine lockup. And it still has the
one-lock-to-rule-them-all bottleneck. And it still is not compatible
with shared memory.

Then, there’s libatomic_ops. It’s
the most complex code, so I’ll spare you to paste it here. Basically
it uses the same spin loop. With three differences however:

16 lock variables instead of a single one are used. The variable
that is used is picked via simple hashing of the pointer to the atomic variable
that shall be modified. This removes the one-lock-to-rule-them-all
bottleneck.

Instead of pthread_yield() it uses select() with
a small timeval parameter to give the current holder of the lock some
time to give it up. To make sure that the select() is not
optimized away by the kernel and the thread thus never is preempted
the sleep time is increased on every loop iteration.

It explicitly disables signals before doing the atomic operation.

It’s certainly the best implementation of the ones discussed here:
It doesn’t suffer by the one-lock-to-rule-them-all bottleneck. It’s
(supposedly) signal handler safe (which however comes at the cost of
doing two syscalls on every atomic operation — probably a very high
price). It actually works on RT, due to sleeping for an explicit
time. However it still doesn’t deal with priority
inversion
problems — which is a big issue for real-time
programming. Also, the time slept in the select() call might
be relatively long, since at least on Linux the time passed to
select() is rounded up to 1/HZ — not good for RT either. And
then, it still doesn’t work for shared memory IPC.

So, what do we learn from this? At least one thing: better don’t do
real-time programming with ARMv5[2]. But more practically, how
could a good emulation for atomic ops, solely based on atomic swap
look like? Here are a few ideas:

Use an implementation inspired by libatomic_ops. Right
now it’s the best available. It’s probably a good idea, though, to
replace select() by a nanosleep(), since on recent
kernels the latter doesn’t round up to 1/HZ anymore, at least when you
have high-resolution timers[3] Then, if you can live
without signal handler safety, drop the signal mask changing.

If you use something based on libatomic_ops and want to
use it for shared memory IPC, then you have the option to move the
lock variables into shared memory too. Note however, that this allows
evil applications to lock up your process by taking the locks and
never giving them up. (Which however is always a problem if not all
atomic operations you need are available in hardware) So if you do
this, make sure that only trusted processes can attach to your memory
segment.

Alternatively, spend some time and investigate if it is possible
to use futexes to sleep on the lock variables. This is not trivial
though, since futexes right now expect the availability of an atomic
increment operation. But it might be possible to emulate this good
enough with the swap operation. There’s now even a FUTEX_LOCK_PI
operation which would allow priority inheritance.

Alternatively, find a a way to allow user space disabling
interrupts cheaply (requires kernel patching). Since enabling
RT scheduling is a priviliged operation already (since you may easily
lock up your machine with it), it might not be too problematic to
extend the ability to disable interrupts to user space: it’s just yet
another way to lock up your machine.

For the libatomic_ops based algorithm: if you’re lucky
and defined a struct type for your atomic integer types, like the
kernel does, or like I do in PulseAudio with pa_atomic_t,
then you can stick the lock variable directly into your
structure. This makes shared memory support transparent, and removes
the one-lock-to-rule-them-all bottleneck completely. Of course, OTOH it
increases the memory consumption a bit and increases cache pressure
(though I’d assume that this is neglible).

For the libatomic_ops based algorithm: start sleeping for
the time returned by clock_getres()
(cache the result!). You cannot sleep shorter than that anyway.

Yepp, that’s as good as it gets. Unfortunately I cannot serve you
the optimal solution on a silver platter. I never actually did
development for ARMv5, this blog story just sums up my thoughts on all
the code I saw which emulates atomic ops on ARMv5. But maybe someone
who actually cares about atomic operations on ARM finds this
interesting and maybe invests some time to prepare patches for Qt,
glib, glibc — and PulseAudio.

Update: I added two more ideas to the list above.

Update 2: Andrew Haley just posted something like the optimal solution for the problem. It would be great if people would start using this.

Footnotes

[1] The Nokia 770 has an ARMv5 chip, N800 has ARMv6. The OpenMoko phone apparently uses ARMv5.

[2] And let’s not even think about CPUs which don’t even have an atomic swap!

[3] Which however you probably won’t, given that they’re only available on x86 on stable Linux kernels for now — but still, it’s cleaner.

Emulated atomic operations and real-time scheduling

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/atomic-rt.html

Unfortunately not all CPU architectures have native support for atomic
operations
, or only support a very limited subset. Most
prominently ARMv5 (and
older) hasn’t any support besides the most basic atomic swap
operation[1]. Now, more and more free code is starting to
use atomic operations and lock-free
algorithms
, one being my own project, PulseAudio. If you have ever done real-time
programming
you probably know that you cannot really do it without
support for atomic operations. One question remains however: what to
do on CPUs which support only the most basic atomic operations
natively?

On the kernel side atomic ops are very easy to emulate: just disable
interrupts temporarily, then do your operation non-atomically, and afterwards
enable them again. That’s relatively cheap and works fine (unless you are on SMP — which
fortunately you usually are not for those CPUs). The Linux
kernel does it this way and it is good. But what to do in user-space, where you cannot just go and disable interrupts?

Let’s see how the different userspace libraries/frameworks do it
for ARMv5, a very relevant architecture that only knows an atomic swap (exchange)
but no CAS
or even atomic arithmetics. Let’s start with an excerpt from glibc’s
atomic operations implementation for ARM
:

/* Atomic compare and exchange.  These sequences are not actually atomic;
   there is a race if *MEM != OLDVAL and we are preempted between the two
   swaps.  However, they are very close to atomic, and are the best that a
   pre-ARMv6 implementation can do without operating system support.
   LinuxThreads has been using these sequences for many years.  */

This comment says it all. Not good. The more you make use of atomic
operations the more likely you’re going to to hit this race. Let’s
hope glibc is not a heavy user of atomic operations. PulseAudio however is, and
PulseAudio happens to be my focus.

Let’s have a look on how Qt4
does it:

extern Q_CORE_EXPORT char q_atomic_lock;

inline char q_atomic_swp(volatile char *ptr, char newval)
{
    register int ret;
    asm volatile("swpb %0,%1,[%2]"
                 : "=&r"(ret)
                 : "r"(newval), "r"(ptr)
                 : "cc", "memory");
    return ret;
}

inline int q_atomic_test_and_set_int(volatile int *ptr, int expected, int newval)
{
    int ret = 0;
    while (q_atomic_swp(&q_atomic_lock, ~0) != 0);
    if (*ptr == expected) {
	*ptr = newval;
	ret = 1;
    }
    q_atomic_swp(&q_atomic_lock, 0);
    return ret;
}

So, what do we have here? A slightly better version. In standard
situations it actually works. But it sucks big time, too. Why? It
contains a spin lock: the variable q_atomic_lock is used for
locking the atomic operation. The code tries to set it to non-zero,
and if that fails it tries again, until it succeeds, in the hope that
the other thread — which currently holds the lock — gives it up. The
big problem here is: it might take a while until that happens, up to
1/HZ time on Linux. Usually you want to use atomic operations to
minimize the need for mutexes and thus speed things up. Now, here you
got a lock, and it’s the worst kind: the spinning lock. Not
good. Also, if used from a real-time thread the machine simply locks
up when we enter the loop in contended state, because preemption is
disabled for RT threads and thus the loop will spin forever. Evil. And
then, there’s another problem: it’s a big bottleneck, because all
atomic operations are synchronized via a single variable which is
q_atomic_lock. Not good either. And let’s not forget that
only code that has access to q_atomic_lock actually can
execute this code safely. If you want to use it for
lock-free IPC via shared memory this is going to break. And let’s not
forget that it is unusable from signal handlers (which probably
doesn’t matter much, though). So, in summary: this code sucks,
too.

Next try, let’s have a look on how glib
does it:

static volatile int atomic_spin = 0;

static int atomic_spin_trylock (void)
{
  int result;

  asm volatile (
    "swp %0, %1, [%2]\n"
    : "=&r,&r" (result)
    : "r,0" (1), "r,r" (&atomic_spin)
    : "memory");
  if (result == 0)
    return 0;
  else
    return -1;
}

static void atomic_spin_lock (void)
{
  while (atomic_spin_trylock())
    sched_yield();
}

static void atomic_spin_unlock (void)
{
  atomic_spin = 0;
}

gint
g_atomic_int_exchange_and_add (volatile gint *atomic,
			       gint           val)
{
  gint result;

  atomic_spin_lock();
  result = *atomic;
  *atomic += val;
  atomic_spin_unlock();

  return result;
}

Once again, a spin loop. However, this implementation makes use of
sched_yield() for asking the OS to reschedule. It’s a bit
better than the Qt version, since it doesn’t spin just burning CPU,
but instead tells the kernel to execute something else, increasing the
chance that the thread currently holding the lock is scheduled. It’s a
bit friendlier, but it’s not great either because this might still delay
execution quite a bit. It’s better then the Qt version. And probably
one of the very few ligitimate occasions where using
sched_yield() is OK. It still doesn’t work for RT — because
sched_yield() in most cases is a NOP on for RT threads, so
you still get a machine lockup. And it still has the
one-lock-to-rule-them-all bottleneck. And it still is not compatible
with shared memory.

Then, there’s libatomic_ops. It’s
the most complex code, so I’ll spare you to paste it here. Basically
it uses the same spin loop. With three differences however:

  1. 16 lock variables instead of a single one are used. The variable
    that is used is picked via simple hashing of the pointer to the atomic variable
    that shall be modified. This removes the one-lock-to-rule-them-all
    bottleneck.
  2. Instead of pthread_yield() it uses select() with
    a small timeval parameter to give the current holder of the lock some
    time to give it up. To make sure that the select() is not
    optimized away by the kernel and the thread thus never is preempted
    the sleep time is increased on every loop iteration.
  3. It explicitly disables signals before doing the atomic operation.

It’s certainly the best implementation of the ones discussed here:
It doesn’t suffer by the one-lock-to-rule-them-all bottleneck. It’s
(supposedly) signal handler safe (which however comes at the cost of
doing two syscalls on every atomic operation — probably a very high
price). It actually works on RT, due to sleeping for an explicit
time. However it still doesn’t deal with priority
inversion
problems — which is a big issue for real-time
programming. Also, the time slept in the select() call might
be relatively long, since at least on Linux the time passed to
select() is rounded up to 1/HZ — not good for RT either. And
then, it still doesn’t work for shared memory IPC.

So, what do we learn from this? At least one thing: better don’t do
real-time programming with ARMv5[2]. But more practically, how
could a good emulation for atomic ops, solely based on atomic swap
look like? Here are a few ideas:

  • Use an implementation inspired by libatomic_ops. Right
    now it’s the best available. It’s probably a good idea, though, to
    replace select() by a nanosleep(), since on recent
    kernels the latter doesn’t round up to 1/HZ anymore, at least when you
    have high-resolution timers[3] Then, if you can live
    without signal handler safety, drop the signal mask changing.
  • If you use something based on libatomic_ops and want to
    use it for shared memory IPC, then you have the option to move the
    lock variables into shared memory too. Note however, that this allows
    evil applications to lock up your process by taking the locks and
    never giving them up. (Which however is always a problem if not all
    atomic operations you need are available in hardware) So if you do
    this, make sure that only trusted processes can attach to your memory
    segment.
  • Alternatively, spend some time and investigate if it is possible
    to use futexes to sleep on the lock variables. This is not trivial
    though, since futexes right now expect the availability of an atomic
    increment operation. But it might be possible to emulate this good
    enough with the swap operation. There’s now even a FUTEX_LOCK_PI
    operation which would allow priority inheritance.
  • Alternatively, find a a way to allow user space disabling
    interrupts cheaply (requires kernel patching). Since enabling
    RT scheduling is a priviliged operation already (since you may easily
    lock up your machine with it), it might not be too problematic to
    extend the ability to disable interrupts to user space: it’s just yet
    another way to lock up your machine.
  • For the libatomic_ops based algorithm: if you’re lucky
    and defined a struct type for your atomic integer types, like the
    kernel does, or like I do in PulseAudio with pa_atomic_t,
    then you can stick the lock variable directly into your
    structure. This makes shared memory support transparent, and removes
    the one-lock-to-rule-them-all bottleneck completely. Of course, OTOH it
    increases the memory consumption a bit and increases cache pressure
    (though I’d assume that this is neglible).
  • For the libatomic_ops based algorithm: start sleeping for
    the time returned by clock_getres()
    (cache the result!). You cannot sleep shorter than that anyway.

Yepp, that’s as good as it gets. Unfortunately I cannot serve you
the optimal solution on a silver platter. I never actually did
development for ARMv5, this blog story just sums up my thoughts on all
the code I saw which emulates atomic ops on ARMv5. But maybe someone
who actually cares about atomic operations on ARM finds this
interesting and maybe invests some time to prepare patches for Qt,
glib, glibc — and PulseAudio.

Update: I added two more ideas to the list above.

Update 2: Andrew Haley just posted something like the optimal solution for the problem. It would be great if people would start using this.

Footnotes

[1] The Nokia 770 has an ARMv5 chip, N800 has ARMv6. The OpenMoko phone apparently uses ARMv5.

[2] And let’s not even think about CPUs which don’t even have an atomic swap!

[3] Which however you probably won’t, given that they’re only available on x86 on stable Linux kernels for now — but still, it’s cleaner.

What I miss in GNOME

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/what-i-miss-in-gnome.html

A while back there has been a lot of noise about the GNOME
“platform” and what GNOME 3.0 should be. Personally — while I
certainly like the progress GNOME makes as a “platform” — I must say
that the platform is already quite good. In my opinion, what is
lacking right now are more the tools and utilities that are shipped
*with* the GNOME platform than the platform itself. More specifically
there are a set of (rather small) tools I am really missing in the standard set of
GNOME tools. So, here’s my wishlist, in case anybody is interested to
know:

<wishlist>

A simple, usable VNC/RFB client as counterpart to the VNC server
vino that has been shipped since early GNOME 2.0 times. Isn’t
it kind of awkward that we have been shipping a VNC server since ages,
but no VNC client? What I want is a client (maybe called
vinagre as a pun on vino) that is more than just a simple frontend to
xvncviewer, but not necessarily too fancy. Something that
integrates well into GNOME, i.e. uses D-Bus, gnome-keyring,
avahi-ui. There seems
to be a libvncclient library
that might make the implementation of
this tool easy.

I am one of the (apparently not so few) people who run their GNOME
session with LANG=de_DE and LC_MESSAGES=C, which
enables german dates and everything else, but uses english
messages. Right now it’s a PITA to configure GNOME that way. It’s not
really documented how to do that, AFAIK. The best way to do this I
found is to edit ~/.gnomerc and set the variables in there. A
simple capplet which allows setting these environment variables from
gnome-session would be a much better way to configure
this. Nothing to fancy again. Just two drop down lists, to choose
LANG and LC_MESSAGES and maybe a subset of the other
i18n variables, and possibly G_FILENAME_ENCODING (although I
might be the only one who still hasn’t switched his $HOME to
UTF-8)

There’s no world clock in GNOME. Sure, there are online tools for
this, but I am not always online with my laptop.

There is no simple tool to take photo snapshots or record short videos
from webcams. I want to see something like camorama in
gnome-media. Nothing too fancy again. No filters, no TV
functionality. Just a small but useful GStreamer frontend.

I’d like to see a simple BitTorrent client shipped with GNOME, which is
integrated well into the rest of GNOME/Epiphany, so that downloading files from
FTP or HTTP looks exactly like downloading them from Bittorrent.

</wishlist>

What I miss in GNOME

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/what-i-miss-in-gnome.html

A while back there has been a lot of noise about the GNOME
“platform” and what GNOME 3.0 should be. Personally — while I
certainly like the progress GNOME makes as a “platform” — I must say
that the platform is already quite good. In my opinion, what is
lacking right now are more the tools and utilities that are shipped
*with* the GNOME platform than the platform itself. More specifically
there are a set of (rather small) tools I am really missing in the standard set of
GNOME tools. So, here’s my wishlist, in case anybody is interested to
know:

<wishlist>

  • A simple, usable VNC/RFB client as counterpart to the VNC server
    vino that has been shipped since early GNOME 2.0 times. Isn’t
    it kind of awkward that we have been shipping a VNC server since ages,
    but no VNC client? What I want is a client (maybe called
    vinagre as a pun on vino) that is more than just a simple frontend to
    xvncviewer, but not necessarily too fancy. Something that
    integrates well into GNOME, i.e. uses D-Bus, gnome-keyring,
    avahi-ui. There seems
    to be a libvncclient library
    that might make the implementation of
    this tool easy.
  • I am one of the (apparently not so few) people who run their GNOME
    session with LANG=de_DE and LC_MESSAGES=C, which
    enables german dates and everything else, but uses english
    messages. Right now it’s a PITA to configure GNOME that way. It’s not
    really documented how to do that, AFAIK. The best way to do this I
    found is to edit ~/.gnomerc and set the variables in there. A
    simple capplet which allows setting these environment variables from
    gnome-session would be a much better way to configure
    this. Nothing to fancy again. Just two drop down lists, to choose
    LANG and LC_MESSAGES and maybe a subset of the other
    i18n variables, and possibly G_FILENAME_ENCODING (although I
    might be the only one who still hasn’t switched his $HOME to
    UTF-8)
  • There’s no world clock in GNOME. Sure, there are online tools for
    this, but I am not always online with my laptop.
  • There is no simple tool to take photo snapshots or record short videos
    from webcams. I want to see something like camorama in
    gnome-media. Nothing too fancy again. No filters, no TV
    functionality. Just a small but useful GStreamer frontend.
  • I’d like to see a simple BitTorrent client shipped with GNOME, which is
    integrated well into the rest of GNOME/Epiphany, so that downloading files from
    FTP or HTTP looks exactly like downloading them from Bittorrent.

</wishlist>