Tag Archives: bash

systemd for Administrators, Part XVII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already
posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement
.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

Lines of error priority (and higher) will be highlighted red.
Lines of notice/warning priority will be highlighted bold.
The timestamps are converted into your local time-zone.
The output is auto-paged with your pager of choice (defaults to less).
This will show all available data, including rotated logs.
Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl –since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl –since=2012-10-15 –until=”2011-10-16 23:59:59″

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd –since=00:00 –until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[…]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
PRIORITY=6
SYSLOG_FACILITY=3
_MACHINE_ID=a91663387a90b89f185d4e860000001a
_HOSTNAME=epsilon
_TRANSPORT=syslog
SYSLOG_IDENTIFIER=avahi-daemon
_COMM=avahi-daemon
_EXE=/usr/sbin/avahi-daemon
_SYSTEMD_CGROUP=/system/avahi-daemon.service
_SYSTEMD_UNIT=avahi-daemon.service
_SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
_UID=70
_GID=70
_CMDLINE=avahi-daemon: registering [epsilon.local]
MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
_BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
_PID=27937
SYSLOG_PID=27937
_SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page
.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel system_u:system_r:local_login_t:s0-s0:c0.c1023 system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0 system_u:system_r:lvm_t:s0 system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0 system_u:system_r:modemmanager_t:s0-s0:c0.c1023 system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0 system_u:system_r:NetworkManager_t:s0 system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023 system_u:system_r:policykit_t:s0 unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0 system_u:system_r:rtkit_daemon_t:s0 unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023 system_u:system_r:syslogd_t:s0 unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0 system_u:system_r:system_cronjob_t:s0-s0:c0.c1023 unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023 system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0 system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work
, and I hope it will make appearence in
F18.

systemd for Administrators, Part XVII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already
posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement
.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

  • Lines of error priority (and higher) will be highlighted red.
  • Lines of notice/warning priority will be highlighted bold.
  • The timestamps are converted into your local time-zone.
  • The output is auto-paged with your pager of choice (defaults to less).
  • This will show all available data, including rotated logs.
  • Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page
.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work
, and I hope it will make appearence in
F18.

systemd for Administrators, Part XVII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already
posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement
.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

  • Lines of error priority (and higher) will be highlighted red.
  • Lines of notice/warning priority will be highlighted bold.
  • The timestamps are converted into your local time-zone.
  • The output is auto-paged with your pager of choice (defaults to less).
  • This will show all available data, including rotated logs.
  • Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page
.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work
, and I hope it will make appearence in
F18.

Identi.ca Weekly Summary

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2011/06/26/identica-weekly.html

Identi.ca Summary, 2011-06-19 through 2011-06-26

The conversation
that I
mentioned last week

about GPL for Javascript
libraries continued in a new thread this week
. The thread was
rather long:

@fontana rather
strangely argued that no one
should use GPL for Javascript
, this seemed
like a generally
anti-copyleft position to me
, and @fontana went on further to say
he’s now anti-copyleft in
some situations
, when it relates
to proprietary
relicensing
.

I pointed out,
using OpenFOAM
as an example
that being
against illegitimate use of otherwise good things doesn’t mean you need
to be universally against the thing
.

There was
a subthread
discussing how GPL requirements work with Javascript
, but the
subthread diverged into a discussion of CLAs and Fedora,
wherein @fontana strangely
said that multiple copyright holders won’t solve proprietary relicensing
problem
.

@fontana asked for example of a
GPL’d Javascript library with multiple copyright holders (i.e., one that
isn’t using a proprietary recliensing business model)
. I’d much
appreciate if someone can look for an example of a GPL’d
Javascript library matching the criteria @fontana describes
; I haven’t had time to
look. I offered @fontana a
prop bet on this
, regardless.

Finally, in the same
thread, @jasonriedy mentioned
the so-called Lisp LGPL
, which I said was
a seemed unnecessary now that
we have LGPLv3
.

I noted
that I wrote a blog post on OpenFOAM
.

I complained
about the (lack of a) USA healthcare system
.

@fontana
and I had a discussion about crossposting on identi.ca
.

I
ack’ed that @fabsh had launched
the
oggcast, rantofabkuhn.

The biggest news this week was
that @kaz
is now Executive Director of the GNOME Foundation
, although
the thread discussing
it on identica was rather short
. OTOH, @fontana asked
if @kaz
would be required to use GNOME 3
.

The thread
about @allisonrandal’s
appearance on Linux Outlaws
continued:

@allisonrandal claimed to
have not said
that those
who chose strong copyleft were just as happy with weak copyleft
relicensing
.
I found the exact place
where she said that

in the LO 204 ogg
file
, wherein she says at 36:15 and 37:30:

Part of that reason is that when a developer develops code they want
their code to be used. They may have a general philosophy that
they want used. Most developers who contribute under a copyleft license
&mdash they’d be happy with any copyleft license — AGPL,
GPL, LGPL — they think — that’s my
“set”. …

You’re using GPL and we’re using LGPL, so we can’t use your code.
Hmmm, we can’t do that! … this just doesn’t fit the way
developers think! We want our code to be used — and we’re happy
to have — if I said GPL, it’s probably true that I’m happy to have
it under LGPL as well. It’s just too much work [without Harmony] to
make that happen.

@allisonrandal,
@fontana and I debated the differences between strong and weak
copyleft in a subthread
.

A
subthread discussed who the leadership of Harmony is
. I asked for a
definitive place where I
can find who are the decision-makers of Harmony
and no one
answered this, but @fontana
made some
speculations
, @allisonrandal
claims that Harmony has no leadership
(I wondered but didn’t dent:
should people really be adopting important documents from a group
with no leadership?).
Also, @fabsh pointed out
that he doubted that it was without
leaders
. @fontana
pointed out that SFLC was not previously leader of Harmony
;
@allisonrandal says she
thought they were and yet SFLC claims they weren’t
. I ended the subtread
by asking again how Harmony
governing works
and got no response.

In a subthread, @allisonrandal reiterated that FSF was wrong
to change the terms of GPL with GPLv3
(which she’d
previously stated on the LO
interview
. I
still believe her position
on this ironically contradicts the plans of Harmony
, which seeks
to empower companies to change licenses unilaterally. (Why should
companies have a right to change a license, but FSF shouldn’t?)

I pointed out to
@allisonrandal that GPLv2 already specified inside the license plans
for
GPLv3
. @allisonrandal
said in response that FSF updating GPL wasn’t helpful to Free
Software
developers
. She
further claimed that FSF’s update to GPLv3 constituted Manifest
Destiny
, which I
disputed
.

The conversation on that sub-thread descended
into a
discussion of @allisonrandal’s culturally relativistic attitude toward
Free Software
,
wherein @allisonrandal
admitted she’s primarily a cultural relativist
.

Finally, there
was subthread
discussing how one can be pro-copyleft, believe that proprietary
software is morally wrong, but also not believe permissive licensing is
morally wrong
. I would think such is obvious and well established
by, for example, RMS’ writings since 1984, but we nevertheless rehashed
that old debate. In this subthread, I
did point out that Harmony is
biased against copyleft
, and therefore is not merely an amoral
proposition of all options, as @allisonrandal has claimed. (Oh,
and this dent of mine in that
thread was redented a bit
.) I favorites and nearly
redented @mlinksva’s
contribution to the subthread
.

@fontana linked to a
Harmony list post

wherein @allisonrandal
attempts to make an 11th-hour effort to remove anti-strong-copyleft
parts of Harmony
.

There was
a rather pointlessly
lengthy thread about accents, mostly my Balmur accent (or adjusted
version thereof)
. That
discussion bled over
onto another thread that started when I left @fontana a voicemail in a
think Balmur accent
.

@fontana
doesn’t like it that I call Hitler a “dude”, even though I
said evil dude
.

I was
a guest on FLOSS Weekly on
Wednesday
. @joncruz
mentioned he enjoyed the show
.

I mentioned
again to @mcgrof my copyleft-by-guilt theory of OpenBSD
, which I’d
previously mentioned
publicly
,
which @chromatic
found amusing
.

FSF
intern @williamtheaker is working this summer on some historical GPLv3
data-gathering
.

@fontana
started a thread on a Fedora list and on identi.ca about Gilligan’s
Island copyright of the Fedora website
. This was previously
discussed
in two threads
about a month ago, wherein
I coined the phrase
“Gilligan’s Island
copyright”
. @fontana
gave me credit on the Fedora thread for coining the phrase
. I’m
working on a more complete blog post on Gilligan’s Island copyright.

dneary’s
blog post made me think of an old boss
.

There
was a discussion of my reasons for phoning @fontana
.

My
beloved plastic $2 Pretty neat travel soap dish (tray / holder) that I
got in 1991 is now cracked
.

@kraai
is registered to donate bone marrow. I’m considering it.

I’m
continuing to work on some patches for GNU Bash
.

Some
people apparently want
an @bkuhn GPL enforcement action figure
.

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-2.html

It has been a
while since my last status update on systemd
. Here’s another short,
incomprehensive status update on what we worked on for systemd since then.

Fedora F15 (Rawhide) now includes a split up
/etc/init.d/rc.sysinit (Bill Nottingham). This allows us to keep only
a minimal compatibility set of shell scripts around, and boot otherwise a
system without any shell scripts at all. In fact, shell scripts during early
boot are only used in exceptional cases, i.e. when you enabled autoswapping
(bad idea anyway), when a full SELinux relabel is necessary, during the first
boot after initialization, if you have static kernel modules to load (which are
not configured via the systemd-native way to do that), if you boot from a
read-only NFS server, or when you rely on LVM/RAID/Multipath. If nothing of
this applies to you can easily disable these parts of early boot and
save several seconds on boot. How to do this I will describe in a later blog
story.

We have a fully C coded shutdown logic that kills all remaining processes,
unmounts all remaining file systems, detaches all loop devices and DM volumes
and does that in the right way to ensure that all these things are properly
teared down even if they depend on each other in arbitrary ways. This is not
only considerably faster then the traditional shell hackery for this, but also
a lot safer, since we try to unmount/remount the remaining file systems with a
little bit of brains. This feature is available via systemctl –force
poweroff to the administrator. The –force controls whether the
usual shutdown of all services is run or whether this is skipped and we
immediately shall enter this final C shutdown logic. Using –force
hence is a much safer replacement for the old /sbin/reboot -f and does
not leave dirty file systems behind. (Thanks to Fabiano Fidencio has his
colleagues from ProFUSION for this).

systemd now includes a minmalistic readahead implementation, based on
fanotify(), fadvise() and mincore(). It supports btrfs defragmentation and both
SSD and HDD disks. While the effect on boots that are anyway fast (such as most
stuff involving SSD) is minimal, slower and older machines benefit from this
more substantially.

We now control fsck and quota during early boot with a C tool that ensure
maximum parallelization but properly implements the necessary high-level
administration logic.

Every service, every user and every user session now gets its own cgroup in
the ‘cpu’ hierarchy thus creating better fairness between the logged in users
and their sessions.

We now provide /dev/log logging from early boot to late shutdown.
If no syslog daemon is running the output is passed on to kmsg. As soon as a
proper syslog daemon starts up the kmsg buffer is flushed to syslog, and hence
we will have complete log coverage in syslog even for early boot.

systemctl kill was introduced, an easy command to send a signal to
all processes of a service. Expect a blog story with more details about this
shortly.

systemd gained the ability to load the SELinux policy if necessary, thus
supporting non-initrd boots and initrd boots from the same binary with no
duplicate work. This is in fact (and surprisingly) a first among Linux init
systems.

We now initialize and set the system locale inside PID 1 to be inherited by
all services and users.

systemd has native support for /etc/crypttab and can activate
encrypted LUKS/dm-crypt disks both at boot-up and during runtime. A minimal
password querying infrastructure is available, where multiple agents can be
used to present the password to the user. During boot the password is queried
either via Plymouth or directly on the console. If a system crypto disk is
plugged in after boot you are queried for the password via a GNOME agent, or a
wall(1) agent. Finally, while you run systemctl start (or a similar
command) a minimal TTY password agent is available which asks you for passwords
right-away if this is necessary. The password querying logic is very simple,
additional agents can be implemented in a trivial amount of code (Yupp, KDE folks, you
can add an agent for this, too). Note that the password querying logic in
systemd is only for non-user passwords, i.e. passwords that have no relation to
a specific user, but rather to specific hardware or system software. In future
we hope to extend this so that this can be used to query the password of SSL
certificates when Apache or other servers start.

We offer a minimal interface that external projects can use to extend the
dependency graph systemd manages. In fact, the cryptsetup logic mentioned above
is implemented via this ‘plugin’-like system. Since we did not want to add code
that deals with cryptographic disks into the systemd process itself we
introduced this interface (after all cryptographic volumes are not an essential
feature of a minimal OS, and unncessary on most embedded uses; also the future
might bring us STC which might make this at least partially obsolete). Simply
by dropping a generator binary into
/lib/systemd/system-generators which should write out systemd unit
files into a temporary directory third-party packages may extend the systemd
dependency tree dynamically. This could be useful for example to automatically
create a systemd service for each KVM machine or LXC container. With that in
place those containers/machines could be managed and supervised with the same
tools as the usual system services.

We integrated automatic clean-up of directories such as /tmp into
the tmpfiles logic we already had in place that recreates files and
directories on volatile file systems such as /var/run,
/var/lock or /tmp.

We now always measure and write to the log files the system startup time we
measured, broken up into how many time was spent on the kernel, the initrd and
the initialization of userspace.

We now safely destroy all user session before going down. This is a feature
long missing on Linux: since user processes were not killed until the very last
moment the unhealthy situation that user code was running at a time where no
other daemon was remaining was a normal part of shutdown.

systemd now understands an ‘extreme’ form of disabling a service: if you
symlink a service name in /etc/systemd/system to /dev/null
then systemd will mark it as masked and completely refuse starting it,
regardless if this is requested manually or automaticallly. Normally it should
be sufficient to simply call systemctl disable to disable a service
which still allows manual activation but no automatic activation. Masking a
service goes one step further.

There’s now a simple condition syntax in places which allows
skipping or enabling units depending on the existance of a file, whether a
directory is empty or whether a kernel command line option is set.

In addition to normal shutdowns for reboot, halt or poweroff we now
similarly support a kexec reboot, that reboots the machine without going though
the BIOS code again.

We have bash completion support for systemctl. (Ran Benita)

Andrew Edmunds contributed basic support to boot Ubuntu with systemd.

Michael Biebl and Tollef Fog Heen have worked on the systemd integration
into Debian to a level that it is now possible to boot a system without having
the old initscripts packaged installed. For more details see the Debian Wiki. Michael even
tested this integration on an Ubuntu Natty system and as it turns out this
works almost equally well on Ubuntu already. If you are interesting in playing
around with this, ping Michael.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

We have come quite far in the last year. systemd is about a year old now,
and we are now able to boot a system without legacy shell scripts remaining,
something that appeared to be a task for the distant future.

All of this is available in systemd 13 and in F15/Rawhide as I type
this. If you want to play around with this then consider installing Rawhide
(it’s fun!).

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-2.html

It has been a
while since my last status update on systemd
. Here’s another short,
incomprehensive status update on what we worked on for systemd since then.

  • Fedora F15 (Rawhide) now includes a split up
    /etc/init.d/rc.sysinit (Bill Nottingham). This allows us to keep only
    a minimal compatibility set of shell scripts around, and boot otherwise a
    system without any shell scripts at all. In fact, shell scripts during early
    boot are only used in exceptional cases, i.e. when you enabled autoswapping
    (bad idea anyway), when a full SELinux relabel is necessary, during the first
    boot after initialization, if you have static kernel modules to load (which are
    not configured via the systemd-native way to do that), if you boot from a
    read-only NFS server, or when you rely on LVM/RAID/Multipath. If nothing of
    this applies to you can easily disable these parts of early boot and
    save several seconds on boot. How to do this I will describe in a later blog
    story.
  • We have a fully C coded shutdown logic that kills all remaining processes,
    unmounts all remaining file systems, detaches all loop devices and DM volumes
    and does that in the right way to ensure that all these things are properly
    teared down even if they depend on each other in arbitrary ways. This is not
    only considerably faster then the traditional shell hackery for this, but also
    a lot safer, since we try to unmount/remount the remaining file systems with a
    little bit of brains. This feature is available via systemctl --force
    poweroff
    to the administrator. The --force controls whether the
    usual shutdown of all services is run or whether this is skipped and we
    immediately shall enter this final C shutdown logic. Using --force
    hence is a much safer replacement for the old /sbin/reboot -f and does
    not leave dirty file systems behind. (Thanks to Fabiano Fidencio has his
    colleagues from ProFUSION for this).
  • systemd now includes a minmalistic readahead implementation, based on
    fanotify(), fadvise() and mincore(). It supports btrfs defragmentation and both
    SSD and HDD disks. While the effect on boots that are anyway fast (such as most
    stuff involving SSD) is minimal, slower and older machines benefit from this
    more substantially.
  • We now control fsck and quota during early boot with a C tool that ensure
    maximum parallelization but properly implements the necessary high-level
    administration logic.
  • Every service, every user and every user session now gets its own cgroup in
    the ‘cpu’ hierarchy thus creating better fairness between the logged in users
    and their sessions.
  • We now provide /dev/log logging from early boot to late shutdown.
    If no syslog daemon is running the output is passed on to kmsg. As soon as a
    proper syslog daemon starts up the kmsg buffer is flushed to syslog, and hence
    we will have complete log coverage in syslog even for early boot.
  • systemctl kill was introduced, an easy command to send a signal to
    all processes of a service. Expect a blog story with more details about this
    shortly.
  • systemd gained the ability to load the SELinux policy if necessary, thus
    supporting non-initrd boots and initrd boots from the same binary with no
    duplicate work. This is in fact (and surprisingly) a first among Linux init
    systems.
  • We now initialize and set the system locale inside PID 1 to be inherited by
    all services and users.
  • systemd has native support for /etc/crypttab and can activate
    encrypted LUKS/dm-crypt disks both at boot-up and during runtime. A minimal
    password querying infrastructure is available, where multiple agents can be
    used to present the password to the user. During boot the password is queried
    either via Plymouth or directly on the console. If a system crypto disk is
    plugged in after boot you are queried for the password via a GNOME agent, or a
    wall(1) agent. Finally, while you run systemctl start (or a similar
    command) a minimal TTY password agent is available which asks you for passwords
    right-away if this is necessary. The password querying logic is very simple,
    additional agents can be implemented in a trivial amount of code (Yupp, KDE folks, you
    can add an agent for this, too). Note that the password querying logic in
    systemd is only for non-user passwords, i.e. passwords that have no relation to
    a specific user, but rather to specific hardware or system software. In future
    we hope to extend this so that this can be used to query the password of SSL
    certificates when Apache or other servers start.
  • We offer a minimal interface that external projects can use to extend the
    dependency graph systemd manages. In fact, the cryptsetup logic mentioned above
    is implemented via this ‘plugin’-like system. Since we did not want to add code
    that deals with cryptographic disks into the systemd process itself we
    introduced this interface (after all cryptographic volumes are not an essential
    feature of a minimal OS, and unncessary on most embedded uses; also the future
    might bring us STC which might make this at least partially obsolete). Simply
    by dropping a generator binary into
    /lib/systemd/system-generators which should write out systemd unit
    files into a temporary directory third-party packages may extend the systemd
    dependency tree dynamically. This could be useful for example to automatically
    create a systemd service for each KVM machine or LXC container. With that in
    place those containers/machines could be managed and supervised with the same
    tools as the usual system services.
  • We integrated automatic clean-up of directories such as /tmp into
    the tmpfiles logic we already had in place that recreates files and
    directories on volatile file systems such as /var/run,
    /var/lock or /tmp.
  • We now always measure and write to the log files the system startup time we
    measured, broken up into how many time was spent on the kernel, the initrd and
    the initialization of userspace.
  • We now safely destroy all user session before going down. This is a feature
    long missing on Linux: since user processes were not killed until the very last
    moment the unhealthy situation that user code was running at a time where no
    other daemon was remaining was a normal part of shutdown.
  • systemd now understands an ‘extreme’ form of disabling a service: if you
    symlink a service name in /etc/systemd/system to /dev/null
    then systemd will mark it as masked and completely refuse starting it,
    regardless if this is requested manually or automaticallly. Normally it should
    be sufficient to simply call systemctl disable to disable a service
    which still allows manual activation but no automatic activation. Masking a
    service goes one step further.
  • There’s now a simple condition syntax in places which allows
    skipping or enabling units depending on the existance of a file, whether a
    directory is empty or whether a kernel command line option is set.
  • In addition to normal shutdowns for reboot, halt or poweroff we now
    similarly support a kexec reboot, that reboots the machine without going though
    the BIOS code again.
  • We have bash completion support for systemctl. (Ran Benita)
  • Andrew Edmunds contributed basic support to boot Ubuntu with systemd.
  • Michael Biebl and Tollef Fog Heen have worked on the systemd integration
    into Debian to a level that it is now possible to boot a system without having
    the old initscripts packaged installed. For more details see the Debian Wiki. Michael even
    tested this integration on an Ubuntu Natty system and as it turns out this
    works almost equally well on Ubuntu already. If you are interesting in playing
    around with this, ping Michael.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

We have come quite far in the last year. systemd is about a year old now,
and we are now able to boot a system without legacy shell scripts remaining,
something that appeared to be a task for the distant future.

All of this is available in systemd 13 and in F15/Rawhide as I type
this. If you want to play around with this then consider installing Rawhide
(it’s fun!).

systemd for Administrators, Part II

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-for-admins-2.html

Here’s the second installment of my ongoing series about systemd for administrators.

Which Service Owns Which Processes?

On most Linux systems the number of processes that are running by
default is substantial. Knowing which process does what and where it
belongs to becomes increasingly difficult. Some services even maintain
a couple of worker processes which clutter the “ps” output with
many additional processes that are often not easy to recognize. This is
further complicated if daemons spawn arbitrary 3rd-party processes, as
Apache does with CGI processes, or cron does with user jobs.

A slight remedy for this is often the process inheritance tree, as
shown by “ps xaf”. However this is usually not reliable, as processes
whose parents die get reparented to PID 1, and hence all information
about inheritance gets lost. If a process “double forks” it hence loses
its relationships to the processes that started it. (This actually is
supposed to be a feature and is relied on for the traditional Unix
daemonizing logic.) Furthermore processes can freely change their names
with PR_SETNAME or by patching argv[0], thus making
it harder to recognize them. In fact they can play hide-and-seek with
the administrator pretty nicely this way.

In systemd we place every process that is spawned in a control
group named after its service. Control groups (or cgroups)
at their most basic are simply groups of processes that can be
arranged in a hierarchy and labelled individually. When processes
spawn other processes these children are automatically made members of
the parents cgroup. Leaving a cgroup is not possible for unprivileged
processes. Thus, cgroups can be used as an effective way to label
processes after the service they belong to and be sure that the
service cannot escape from the label, regardless how often it forks or
renames itself. Furthermore this can be used to safely kill a service
and all processes it created, again with no chance of escaping.

In today’s installment I want to introduce you to two commands you
may use to relate systemd services and processes. The first one, is
the well known ps command which has been updated to show
cgroup information along the other process details. And this is how it
looks:

$ ps xawf -eo pid,user,cgroup,args
PID USER CGROUP COMMAND
2 root – [kthreadd]
3 root – _ [ksoftirqd/0]
[…]
4281 root – _ [flush-8:0]
1 root name=systemd:/systemd-1 /sbin/init
455 root name=systemd:/systemd-1/sysinit.service /sbin/udevd -d
28188 root name=systemd:/systemd-1/sysinit.service _ /sbin/udevd -d
28191 root name=systemd:/systemd-1/sysinit.service _ /sbin/udevd -d
1096 dbus name=systemd:/systemd-1/dbus.service /bin/dbus-daemon –system –address=systemd: –nofork –systemd-activation
1131 root name=systemd:/systemd-1/auditd.service auditd
1133 root name=systemd:/systemd-1/auditd.service _ /sbin/audispd
1135 root name=systemd:/systemd-1/auditd.service _ /usr/sbin/sedispatch
1171 root name=systemd:/systemd-1/NetworkManager.service /usr/sbin/NetworkManager –no-daemon
4028 root name=systemd:/systemd-1/NetworkManager.service _ /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-wlan0.pid -lf /var/lib/dhclient/dhclient-7d32a784-ede9-4cf6-9ee3-60edc0bce5ff-wlan0.lease –
1175 avahi name=systemd:/systemd-1/avahi-daemon.service avahi-daemon: running [epsilon.local]
1194 avahi name=systemd:/systemd-1/avahi-daemon.service _ avahi-daemon: chroot helper
1193 root name=systemd:/systemd-1/rsyslog.service /sbin/rsyslogd -c 4
1195 root name=systemd:/systemd-1/cups.service cupsd -C /etc/cups/cupsd.conf
1207 root name=systemd:/systemd-1/mdmonitor.service mdadm –monitor –scan -f –pid-file=/var/run/mdadm/mdadm.pid
1210 root name=systemd:/systemd-1/irqbalance.service irqbalance
1216 root name=systemd:/systemd-1/dbus.service /usr/sbin/modem-manager
1219 root name=systemd:/systemd-1/dbus.service /usr/libexec/polkit-1/polkitd
1242 root name=systemd:/systemd-1/dbus.service /usr/sbin/wpa_supplicant -c /etc/wpa_supplicant/wpa_supplicant.conf -B -u -f /var/log/wpa_supplicant.log -P /var/run/wpa_supplicant.pid
1249 68 name=systemd:/systemd-1/haldaemon.service hald
1250 root name=systemd:/systemd-1/haldaemon.service _ hald-runner
1273 root name=systemd:/systemd-1/haldaemon.service _ hald-addon-input: Listening on /dev/input/event3 /dev/input/event9 /dev/input/event1 /dev/input/event7 /dev/input/event2 /dev/input/event0 /dev/input/event8
1275 root name=systemd:/systemd-1/haldaemon.service _ /usr/libexec/hald-addon-rfkill-killswitch
1284 root name=systemd:/systemd-1/haldaemon.service _ /usr/libexec/hald-addon-leds
1285 root name=systemd:/systemd-1/haldaemon.service _ /usr/libexec/hald-addon-generic-backlight
1287 68 name=systemd:/systemd-1/haldaemon.service _ /usr/libexec/hald-addon-acpi
1317 root name=systemd:/systemd-1/abrtd.service /usr/sbin/abrtd -d -s
1332 root name=systemd:/systemd-1/[email protected]/tty2 /sbin/mingetty tty2
1339 root name=systemd:/systemd-1/[email protected]/tty3 /sbin/mingetty tty3
1342 root name=systemd:/systemd-1/[email protected]/tty5 /sbin/mingetty tty5
1343 root name=systemd:/systemd-1/[email protected]/tty4 /sbin/mingetty tty4
1344 root name=systemd:/systemd-1/crond.service crond
1346 root name=systemd:/systemd-1/[email protected]/tty6 /sbin/mingetty tty6
1362 root name=systemd:/systemd-1/sshd.service /usr/sbin/sshd
1376 root name=systemd:/systemd-1/prefdm.service /usr/sbin/gdm-binary -nodaemon
1391 root name=systemd:/systemd-1/prefdm.service _ /usr/libexec/gdm-simple-slave –display-id /org/gnome/DisplayManager/Display1 –force-active-vt
1394 root name=systemd:/systemd-1/prefdm.service _ /usr/bin/Xorg :0 -nr -verbose -auth /var/run/gdm/auth-for-gdm-f2KUOh/database -nolisten tcp vt1
1495 root name=systemd:/user/lennart/1 _ pam: gdm-password
1521 lennart name=systemd:/user/lennart/1 _ gnome-session
1621 lennart name=systemd:/user/lennart/1 _ metacity
1635 lennart name=systemd:/user/lennart/1 _ gnome-panel
1638 lennart name=systemd:/user/lennart/1 _ nautilus
1640 lennart name=systemd:/user/lennart/1 _ /usr/libexec/polkit-gnome-authentication-agent-1
1641 lennart name=systemd:/user/lennart/1 _ /usr/bin/seapplet
1644 lennart name=systemd:/user/lennart/1 _ gnome-volume-control-applet
1646 lennart name=systemd:/user/lennart/1 _ /usr/sbin/restorecond -u
1652 lennart name=systemd:/user/lennart/1 _ /usr/bin/devilspie
1662 lennart name=systemd:/user/lennart/1 _ nm-applet –sm-disable
1664 lennart name=systemd:/user/lennart/1 _ gnome-power-manager
1665 lennart name=systemd:/user/lennart/1 _ /usr/libexec/gdu-notification-daemon
1670 lennart name=systemd:/user/lennart/1 _ /usr/libexec/evolution/2.32/evolution-alarm-notify
1672 lennart name=systemd:/user/lennart/1 _ /usr/bin/python /usr/share/system-config-printer/applet.py
1674 lennart name=systemd:/user/lennart/1 _ /usr/lib64/deja-dup/deja-dup-monitor
1675 lennart name=systemd:/user/lennart/1 _ abrt-applet
1677 lennart name=systemd:/user/lennart/1 _ bluetooth-applet
1678 lennart name=systemd:/user/lennart/1 _ gpk-update-icon
1408 root name=systemd:/systemd-1/console-kit-daemon.service /usr/sbin/console-kit-daemon –no-daemon
1419 gdm name=systemd:/systemd-1/prefdm.service /usr/bin/dbus-launch –exit-with-session
1453 root name=systemd:/systemd-1/dbus.service /usr/libexec/upowerd
1473 rtkit name=systemd:/systemd-1/rtkit-daemon.service /usr/libexec/rtkit-daemon
1496 root name=systemd:/systemd-1/accounts-daemon.service /usr/libexec/accounts-daemon
1499 root name=systemd:/systemd-1/systemd-logger.service /lib/systemd/systemd-logger
1511 lennart name=systemd:/systemd-1/prefdm.service /usr/bin/gnome-keyring-daemon –daemonize –login
1534 lennart name=systemd:/user/lennart/1 dbus-launch –sh-syntax –exit-with-session
1535 lennart name=systemd:/user/lennart/1 /bin/dbus-daemon –fork –print-pid 5 –print-address 7 –session
1603 lennart name=systemd:/user/lennart/1 /usr/libexec/gconfd-2
1612 lennart name=systemd:/user/lennart/1 /usr/libexec/gnome-settings-daemon
1615 lennart name=systemd:/user/lennart/1 /usr/libexec/gvfsd
1626 lennart name=systemd:/user/lennart/1 /usr/libexec//gvfs-fuse-daemon /home/lennart/.gvfs
1634 lennart name=systemd:/user/lennart/1 /usr/bin/pulseaudio –start –log-target=syslog
1649 lennart name=systemd:/user/lennart/1 _ /usr/libexec/pulse/gconf-helper
1645 lennart name=systemd:/user/lennart/1 /usr/libexec/bonobo-activation-server –ac-activate –ior-output-fd=24
1668 lennart name=systemd:/user/lennart/1 /usr/libexec/im-settings-daemon
1701 lennart name=systemd:/user/lennart/1 /usr/libexec/gvfs-gdu-volume-monitor
1707 lennart name=systemd:/user/lennart/1 /usr/bin/gnote –panel-applet –oaf-activate-iid=OAFIID:GnoteApplet_Factory –oaf-ior-fd=22
1725 lennart name=systemd:/user/lennart/1 /usr/libexec/clock-applet
1727 lennart name=systemd:/user/lennart/1 /usr/libexec/wnck-applet
1729 lennart name=systemd:/user/lennart/1 /usr/libexec/notification-area-applet
1733 root name=systemd:/systemd-1/dbus.service /usr/libexec/udisks-daemon
1747 root name=systemd:/systemd-1/dbus.service _ udisks-daemon: polling /dev/sr0
1759 lennart name=systemd:/user/lennart/1 gnome-screensaver
1780 lennart name=systemd:/user/lennart/1 /usr/libexec/gvfsd-trash –spawner :1.9 /org/gtk/gvfs/exec_spaw/0
1864 lennart name=systemd:/user/lennart/1 /usr/libexec/gvfs-afc-volume-monitor
1874 lennart name=systemd:/user/lennart/1 /usr/libexec/gconf-im-settings-daemon
1903 lennart name=systemd:/user/lennart/1 /usr/libexec/gvfsd-burn –spawner :1.9 /org/gtk/gvfs/exec_spaw/1
1909 lennart name=systemd:/user/lennart/1 gnome-terminal
1913 lennart name=systemd:/user/lennart/1 _ gnome-pty-helper
1914 lennart name=systemd:/user/lennart/1 _ bash
29231 lennart name=systemd:/user/lennart/1 | _ ssh tango
2221 lennart name=systemd:/user/lennart/1 _ bash
4193 lennart name=systemd:/user/lennart/1 | _ ssh tango
2461 lennart name=systemd:/user/lennart/1 _ bash
29219 lennart name=systemd:/user/lennart/1 | _ emacs systemd-for-admins-1.txt
15113 lennart name=systemd:/user/lennart/1 _ bash
27251 lennart name=systemd:/user/lennart/1 _ empathy
29504 lennart name=systemd:/user/lennart/1 _ ps xawf -eo pid,user,cgroup,args
1968 lennart name=systemd:/user/lennart/1 ssh-agent
1994 lennart name=systemd:/user/lennart/1 gpg-agent –daemon –write-env-file
18679 lennart name=systemd:/user/lennart/1 /bin/sh /usr/lib64/firefox-3.6/run-mozilla.sh /usr/lib64/firefox-3.6/firefox
18741 lennart name=systemd:/user/lennart/1 _ /usr/lib64/firefox-3.6/firefox
28900 lennart name=systemd:/user/lennart/1 _ /usr/lib64/nspluginwrapper/npviewer.bin –plugin /usr/lib64/mozilla/plugins/libflashplayer.so –connection /org/wrapper/NSPlugins/libflashplayer.so/18741-6
4016 root name=systemd:/systemd-1/sysinit.service /usr/sbin/bluetoothd –udev
4094 smmsp name=systemd:/systemd-1/sendmail.service sendmail: Queue [email protected]:00:00 for /var/spool/clientmqueue
4096 root name=systemd:/systemd-1/sendmail.service sendmail: accepting connections
4112 ntp name=systemd:/systemd-1/ntpd.service /usr/sbin/ntpd -n -u ntp:ntp -g
27262 lennart name=systemd:/user/lennart/1 /usr/libexec/mission-control-5
27265 lennart name=systemd:/user/lennart/1 /usr/libexec/telepathy-haze
27268 lennart name=systemd:/user/lennart/1 /usr/libexec/telepathy-logger
27270 lennart name=systemd:/user/lennart/1 /usr/libexec/dconf-service
27280 lennart name=systemd:/user/lennart/1 /usr/libexec/notification-daemon
27284 lennart name=systemd:/user/lennart/1 /usr/libexec/telepathy-gabble
27285 lennart name=systemd:/user/lennart/1 /usr/libexec/telepathy-salut
27297 lennart name=systemd:/user/lennart/1 /usr/libexec/geoclue-yahoo

(Note that this output is shortened, I have removed most of the
kernel threads here, since they are not relevant in the context of
this blog story)

In the third column you see the cgroup systemd assigned to each
process. You’ll find that the udev processes are in the
name=systemd:/systemd-1/sysinit.service cgroup, which is
where systemd places all processes started by the
sysinit.service service, which covers early boot.

My personal recommendation is to set the shell alias psc
to the ps command line shown above:

alias psc=’ps xawf -eo pid,user,cgroup,args’

With this service information of processes is just four keypresses
away!

A different way to present the same information is the
systemd-cgls tool we ship with systemd. It shows the cgroup
hierarchy in a pretty tree. Its output looks like this:

$ systemd-cgls
+ 2 [kthreadd]
[…]
+ 4281 [flush-8:0]
+ user
| lennart
| 1
| + 1495 pam: gdm-password
| + 1521 gnome-session
| + 1534 dbus-launch –sh-syntax –exit-with-session
| + 1535 /bin/dbus-daemon –fork –print-pid 5 –print-address 7 –session
| + 1603 /usr/libexec/gconfd-2
| + 1612 /usr/libexec/gnome-settings-daemon
| + 1615 /ushr/libexec/gvfsd
| + 1621 metacity
| + 1626 /usr/libexec//gvfs-fuse-daemon /home/lennart/.gvfs
| + 1634 /usr/bin/pulseaudio –start –log-target=syslog
| + 1635 gnome-panel
| + 1638 nautilus
| + 1640 /usr/libexec/polkit-gnome-authentication-agent-1
| + 1641 /usr/bin/seapplet
| + 1644 gnome-volume-control-applet
| + 1645 /usr/libexec/bonobo-activation-server –ac-activate –ior-output-fd=24
| + 1646 /usr/sbin/restorecond -u
| + 1649 /usr/libexec/pulse/gconf-helper
| + 1652 /usr/bin/devilspie
| + 1662 nm-applet –sm-disable
| + 1664 gnome-power-manager
| + 1665 /usr/libexec/gdu-notification-daemon
| + 1668 /usr/libexec/im-settings-daemon
| + 1670 /usr/libexec/evolution/2.32/evolution-alarm-notify
| + 1672 /usr/bin/python /usr/share/system-config-printer/applet.py
| + 1674 /usr/lib64/deja-dup/deja-dup-monitor
| + 1675 abrt-applet
| + 1677 bluetooth-applet
| + 1678 gpk-update-icon
| + 1701 /usr/libexec/gvfs-gdu-volume-monitor
| + 1707 /usr/bin/gnote –panel-applet –oaf-activate-iid=OAFIID:GnoteApplet_Factory –oaf-ior-fd=22
| + 1725 /usr/libexec/clock-applet
| + 1727 /usr/libexec/wnck-applet
| + 1729 /usr/libexec/notification-area-applet
| + 1759 gnome-screensaver
| + 1780 /usr/libexec/gvfsd-trash –spawner :1.9 /org/gtk/gvfs/exec_spaw/0
| + 1864 /usr/libexec/gvfs-afc-volume-monitor
| + 1874 /usr/libexec/gconf-im-settings-daemon
| + 1882 /usr/libexec/gvfs-gphoto2-volume-monitor
| + 1903 /usr/libexec/gvfsd-burn –spawner :1.9 /org/gtk/gvfs/exec_spaw/1
| + 1909 gnome-terminal
| + 1913 gnome-pty-helper
| + 1914 bash
| + 1968 ssh-agent
| + 1994 gpg-agent –daemon –write-env-file
| + 2221 bash
| + 2461 bash
| + 4193 ssh tango
| + 15113 bash
| + 18679 /bin/sh /usr/lib64/firefox-3.6/run-mozilla.sh /usr/lib64/firefox-3.6/firefox
| + 18741 /usr/lib64/firefox-3.6/firefox
| + 27251 empathy
| + 27262 /usr/libexec/mission-control-5
| + 27265 /usr/libexec/telepathy-haze
| + 27268 /usr/libexec/telepathy-logger
| + 27270 /usr/libexec/dconf-service
| + 27280 /usr/libexec/notification-daemon
| + 27284 /usr/libexec/telepathy-gabble
| + 27285 /usr/libexec/telepathy-salut
| + 27297 /usr/libexec/geoclue-yahoo
| + 28900 /usr/lib64/nspluginwrapper/npviewer.bin –plugin /usr/lib64/mozilla/plugins/libflashplayer.so –connection /org/wrapper/NSPlugins/libflashplayer.so/18741-6
| + 29219 emacs systemd-for-admins-1.txt
| + 29231 ssh tango
| 29519 systemd-cgls
systemd-1
+ 1 /sbin/init
+ ntpd.service
| 4112 /usr/sbin/ntpd -n -u ntp:ntp -g
+ systemd-logger.service
| 1499 /lib/systemd/systemd-logger
+ accounts-daemon.service
| 1496 /usr/libexec/accounts-daemon
+ rtkit-daemon.service
| 1473 /usr/libexec/rtkit-daemon
+ console-kit-daemon.service
| 1408 /usr/sbin/console-kit-daemon –no-daemon
+ prefdm.service
| + 1376 /usr/sbin/gdm-binary -nodaemon
| + 1391 /usr/libexec/gdm-simple-slave –display-id /org/gnome/DisplayManager/Display1 –force-active-vt
| + 1394 /usr/bin/Xorg :0 -nr -verbose -auth /var/run/gdm/auth-for-gdm-f2KUOh/database -nolisten tcp vt1
| + 1419 /usr/bin/dbus-launch –exit-with-session
| 1511 /usr/bin/gnome-keyring-daemon –daemonize –login
+ [email protected]
| + tty6
| | 1346 /sbin/mingetty tty6
| + tty4
| | 1343 /sbin/mingetty tty4
| + tty5
| | 1342 /sbin/mingetty tty5
| + tty3
| | 1339 /sbin/mingetty tty3
| tty2
| 1332 /sbin/mingetty tty2
+ abrtd.service
| 1317 /usr/sbin/abrtd -d -s
+ crond.service
| 1344 crond
+ sshd.service
| 1362 /usr/sbin/sshd
+ sendmail.service
| + 4094 sendmail: Queue [email protected]:00:00 for /var/spool/clientmqueue
| 4096 sendmail: accepting connections
+ haldaemon.service
| + 1249 hald
| + 1250 hald-runner
| + 1273 hald-addon-input: Listening on /dev/input/event3 /dev/input/event9 /dev/input/event1 /dev/input/event7 /dev/input/event2 /dev/input/event0 /dev/input/event8
| + 1275 /usr/libexec/hald-addon-rfkill-killswitch
| + 1284 /usr/libexec/hald-addon-leds
| + 1285 /usr/libexec/hald-addon-generic-backlight
| 1287 /usr/libexec/hald-addon-acpi
+ irqbalance.service
| 1210 irqbalance
+ avahi-daemon.service
| + 1175 avahi-daemon: running [epsilon.local]
+ NetworkManager.service
| + 1171 /usr/sbin/NetworkManager –no-daemon
| 4028 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-wlan0.pid -lf /var/lib/dhclient/dhclient-7d32a784-ede9-4cf6-9ee3-60edc0bce5ff-wlan0.lease -cf /var/run/nm-dhclient-wlan0.conf wlan0
+ rsyslog.service
| 1193 /sbin/rsyslogd -c 4
+ mdmonitor.service
| 1207 mdadm –monitor –scan -f –pid-file=/var/run/mdadm/mdadm.pid
+ cups.service
| 1195 cupsd -C /etc/cups/cupsd.conf
+ auditd.service
| + 1131 auditd
| + 1133 /sbin/audispd
| 1135 /usr/sbin/sedispatch
+ dbus.service
| + 1096 /bin/dbus-daemon –system –address=systemd: –nofork –systemd-activation
| + 1216 /usr/sbin/modem-manager
| + 1219 /usr/libexec/polkit-1/polkitd
| + 1242 /usr/sbin/wpa_supplicant -c /etc/wpa_supplicant/wpa_supplicant.conf -B -u -f /var/log/wpa_supplicant.log -P /var/run/wpa_supplicant.pid
| + 1453 /usr/libexec/upowerd
| + 1733 /usr/libexec/udisks-daemon
| + 1747 udisks-daemon: polling /dev/sr0
| 29509 /usr/libexec/packagekitd
+ dev-mqueue.mount
+ dev-hugepages.mount
sysinit.service
+ 455 /sbin/udevd -d
+ 4016 /usr/sbin/bluetoothd –udev
+ 28188 /sbin/udevd -d
28191 /sbin/udevd -d

(This too is shortened, the same way)

As you can see, this command shows the processes by their cgroup
and hence service, as systemd labels the cgroups after the
services. For example, you can easily see that the auditing service
auditd.service spawns three individual processes,
auditd, audisp and sedispatch.

If you look closely you will notice that a number of processes have
been assigned to the cgroup /user/1. At this point let’s
simply leave it at that systemd not only maintains services in cgroups,
but user session processes as well. In a later installment we’ll discuss in
more detail what this about.

So much for now, come back soon for the next installment!

systemd for Administrators, Part II

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-for-admins-2.html

Here’s the second installment of my ongoing series about systemd for administrators.

Which Service Owns Which Processes?

On most Linux systems the number of processes that are running by
default is substantial. Knowing which process does what and where it
belongs to becomes increasingly difficult. Some services even maintain
a couple of worker processes which clutter the “ps” output with
many additional processes that are often not easy to recognize. This is
further complicated if daemons spawn arbitrary 3rd-party processes, as
Apache does with CGI processes, or cron does with user jobs.

A slight remedy for this is often the process inheritance tree, as
shown by “ps xaf“. However this is usually not reliable, as processes
whose parents die get reparented to PID 1, and hence all information
about inheritance gets lost. If a process “double forks” it hence loses
its relationships to the processes that started it. (This actually is
supposed to be a feature and is relied on for the traditional Unix
daemonizing logic.) Furthermore processes can freely change their names
with PR_SETNAME or by patching argv[0], thus making
it harder to recognize them. In fact they can play hide-and-seek with
the administrator pretty nicely this way.

In systemd we place every process that is spawned in a control
group
named after its service. Control groups (or cgroups)
at their most basic are simply groups of processes that can be
arranged in a hierarchy and labelled individually. When processes
spawn other processes these children are automatically made members of
the parents cgroup. Leaving a cgroup is not possible for unprivileged
processes. Thus, cgroups can be used as an effective way to label
processes after the service they belong to and be sure that the
service cannot escape from the label, regardless how often it forks or
renames itself. Furthermore this can be used to safely kill a service
and all processes it created, again with no chance of escaping.

In today’s installment I want to introduce you to two commands you
may use to relate systemd services and processes. The first one, is
the well known ps command which has been updated to show
cgroup information along the other process details. And this is how it
looks:

$ ps xawf -eo pid,user,cgroup,args
  PID USER     CGROUP                              COMMAND
    2 root     -                                   [kthreadd]
    3 root     -                                    \_ [ksoftirqd/0]
[...]
 4281 root     -                                    \_ [flush-8:0]
    1 root     name=systemd:/systemd-1             /sbin/init
  455 root     name=systemd:/systemd-1/sysinit.service /sbin/udevd -d
28188 root     name=systemd:/systemd-1/sysinit.service  \_ /sbin/udevd -d
28191 root     name=systemd:/systemd-1/sysinit.service  \_ /sbin/udevd -d
 1096 dbus     name=systemd:/systemd-1/dbus.service /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
 1131 root     name=systemd:/systemd-1/auditd.service auditd
 1133 root     name=systemd:/systemd-1/auditd.service  \_ /sbin/audispd
 1135 root     name=systemd:/systemd-1/auditd.service      \_ /usr/sbin/sedispatch
 1171 root     name=systemd:/systemd-1/NetworkManager.service /usr/sbin/NetworkManager --no-daemon
 4028 root     name=systemd:/systemd-1/NetworkManager.service  \_ /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-wlan0.pid -lf /var/lib/dhclient/dhclient-7d32a784-ede9-4cf6-9ee3-60edc0bce5ff-wlan0.lease -
 1175 avahi    name=systemd:/systemd-1/avahi-daemon.service avahi-daemon: running [epsilon.local]
 1194 avahi    name=systemd:/systemd-1/avahi-daemon.service  \_ avahi-daemon: chroot helper
 1193 root     name=systemd:/systemd-1/rsyslog.service /sbin/rsyslogd -c 4
 1195 root     name=systemd:/systemd-1/cups.service cupsd -C /etc/cups/cupsd.conf
 1207 root     name=systemd:/systemd-1/mdmonitor.service mdadm --monitor --scan -f --pid-file=/var/run/mdadm/mdadm.pid
 1210 root     name=systemd:/systemd-1/irqbalance.service irqbalance
 1216 root     name=systemd:/systemd-1/dbus.service /usr/sbin/modem-manager
 1219 root     name=systemd:/systemd-1/dbus.service /usr/libexec/polkit-1/polkitd
 1242 root     name=systemd:/systemd-1/dbus.service /usr/sbin/wpa_supplicant -c /etc/wpa_supplicant/wpa_supplicant.conf -B -u -f /var/log/wpa_supplicant.log -P /var/run/wpa_supplicant.pid
 1249 68       name=systemd:/systemd-1/haldaemon.service hald
 1250 root     name=systemd:/systemd-1/haldaemon.service  \_ hald-runner
 1273 root     name=systemd:/systemd-1/haldaemon.service      \_ hald-addon-input: Listening on /dev/input/event3 /dev/input/event9 /dev/input/event1 /dev/input/event7 /dev/input/event2 /dev/input/event0 /dev/input/event8
 1275 root     name=systemd:/systemd-1/haldaemon.service      \_ /usr/libexec/hald-addon-rfkill-killswitch
 1284 root     name=systemd:/systemd-1/haldaemon.service      \_ /usr/libexec/hald-addon-leds
 1285 root     name=systemd:/systemd-1/haldaemon.service      \_ /usr/libexec/hald-addon-generic-backlight
 1287 68       name=systemd:/systemd-1/haldaemon.service      \_ /usr/libexec/hald-addon-acpi
 1317 root     name=systemd:/systemd-1/abrtd.service /usr/sbin/abrtd -d -s
 1332 root     name=systemd:/systemd-1/[email protected]/tty2 /sbin/mingetty tty2
 1339 root     name=systemd:/systemd-1/[email protected]/tty3 /sbin/mingetty tty3
 1342 root     name=systemd:/systemd-1/[email protected]/tty5 /sbin/mingetty tty5
 1343 root     name=systemd:/systemd-1/[email protected]/tty4 /sbin/mingetty tty4
 1344 root     name=systemd:/systemd-1/crond.service crond
 1346 root     name=systemd:/systemd-1/[email protected]/tty6 /sbin/mingetty tty6
 1362 root     name=systemd:/systemd-1/sshd.service /usr/sbin/sshd
 1376 root     name=systemd:/systemd-1/prefdm.service /usr/sbin/gdm-binary -nodaemon
 1391 root     name=systemd:/systemd-1/prefdm.service  \_ /usr/libexec/gdm-simple-slave --display-id /org/gnome/DisplayManager/Display1 --force-active-vt
 1394 root     name=systemd:/systemd-1/prefdm.service      \_ /usr/bin/Xorg :0 -nr -verbose -auth /var/run/gdm/auth-for-gdm-f2KUOh/database -nolisten tcp vt1
 1495 root     name=systemd:/user/lennart/1             \_ pam: gdm-password
 1521 lennart  name=systemd:/user/lennart/1                 \_ gnome-session
 1621 lennart  name=systemd:/user/lennart/1                     \_ metacity
 1635 lennart  name=systemd:/user/lennart/1                     \_ gnome-panel
 1638 lennart  name=systemd:/user/lennart/1                     \_ nautilus
 1640 lennart  name=systemd:/user/lennart/1                     \_ /usr/libexec/polkit-gnome-authentication-agent-1
 1641 lennart  name=systemd:/user/lennart/1                     \_ /usr/bin/seapplet
 1644 lennart  name=systemd:/user/lennart/1                     \_ gnome-volume-control-applet
 1646 lennart  name=systemd:/user/lennart/1                     \_ /usr/sbin/restorecond -u
 1652 lennart  name=systemd:/user/lennart/1                     \_ /usr/bin/devilspie
 1662 lennart  name=systemd:/user/lennart/1                     \_ nm-applet --sm-disable
 1664 lennart  name=systemd:/user/lennart/1                     \_ gnome-power-manager
 1665 lennart  name=systemd:/user/lennart/1                     \_ /usr/libexec/gdu-notification-daemon
 1670 lennart  name=systemd:/user/lennart/1                     \_ /usr/libexec/evolution/2.32/evolution-alarm-notify
 1672 lennart  name=systemd:/user/lennart/1                     \_ /usr/bin/python /usr/share/system-config-printer/applet.py
 1674 lennart  name=systemd:/user/lennart/1                     \_ /usr/lib64/deja-dup/deja-dup-monitor
 1675 lennart  name=systemd:/user/lennart/1                     \_ abrt-applet
 1677 lennart  name=systemd:/user/lennart/1                     \_ bluetooth-applet
 1678 lennart  name=systemd:/user/lennart/1                     \_ gpk-update-icon
 1408 root     name=systemd:/systemd-1/console-kit-daemon.service /usr/sbin/console-kit-daemon --no-daemon
 1419 gdm      name=systemd:/systemd-1/prefdm.service /usr/bin/dbus-launch --exit-with-session
 1453 root     name=systemd:/systemd-1/dbus.service /usr/libexec/upowerd
 1473 rtkit    name=systemd:/systemd-1/rtkit-daemon.service /usr/libexec/rtkit-daemon
 1496 root     name=systemd:/systemd-1/accounts-daemon.service /usr/libexec/accounts-daemon
 1499 root     name=systemd:/systemd-1/systemd-logger.service /lib/systemd/systemd-logger
 1511 lennart  name=systemd:/systemd-1/prefdm.service /usr/bin/gnome-keyring-daemon --daemonize --login
 1534 lennart  name=systemd:/user/lennart/1        dbus-launch --sh-syntax --exit-with-session
 1535 lennart  name=systemd:/user/lennart/1        /bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
 1603 lennart  name=systemd:/user/lennart/1        /usr/libexec/gconfd-2
 1612 lennart  name=systemd:/user/lennart/1        /usr/libexec/gnome-settings-daemon
 1615 lennart  name=systemd:/user/lennart/1        /usr/libexec/gvfsd
 1626 lennart  name=systemd:/user/lennart/1        /usr/libexec//gvfs-fuse-daemon /home/lennart/.gvfs
 1634 lennart  name=systemd:/user/lennart/1        /usr/bin/pulseaudio --start --log-target=syslog
 1649 lennart  name=systemd:/user/lennart/1         \_ /usr/libexec/pulse/gconf-helper
 1645 lennart  name=systemd:/user/lennart/1        /usr/libexec/bonobo-activation-server --ac-activate --ior-output-fd=24
 1668 lennart  name=systemd:/user/lennart/1        /usr/libexec/im-settings-daemon
 1701 lennart  name=systemd:/user/lennart/1        /usr/libexec/gvfs-gdu-volume-monitor
 1707 lennart  name=systemd:/user/lennart/1        /usr/bin/gnote --panel-applet --oaf-activate-iid=OAFIID:GnoteApplet_Factory --oaf-ior-fd=22
 1725 lennart  name=systemd:/user/lennart/1        /usr/libexec/clock-applet
 1727 lennart  name=systemd:/user/lennart/1        /usr/libexec/wnck-applet
 1729 lennart  name=systemd:/user/lennart/1        /usr/libexec/notification-area-applet
 1733 root     name=systemd:/systemd-1/dbus.service /usr/libexec/udisks-daemon
 1747 root     name=systemd:/systemd-1/dbus.service  \_ udisks-daemon: polling /dev/sr0
 1759 lennart  name=systemd:/user/lennart/1        gnome-screensaver
 1780 lennart  name=systemd:/user/lennart/1        /usr/libexec/gvfsd-trash --spawner :1.9 /org/gtk/gvfs/exec_spaw/0
 1864 lennart  name=systemd:/user/lennart/1        /usr/libexec/gvfs-afc-volume-monitor
 1874 lennart  name=systemd:/user/lennart/1        /usr/libexec/gconf-im-settings-daemon
 1903 lennart  name=systemd:/user/lennart/1        /usr/libexec/gvfsd-burn --spawner :1.9 /org/gtk/gvfs/exec_spaw/1
 1909 lennart  name=systemd:/user/lennart/1        gnome-terminal
 1913 lennart  name=systemd:/user/lennart/1         \_ gnome-pty-helper
 1914 lennart  name=systemd:/user/lennart/1         \_ bash
29231 lennart  name=systemd:/user/lennart/1         |   \_ ssh tango
 2221 lennart  name=systemd:/user/lennart/1         \_ bash
 4193 lennart  name=systemd:/user/lennart/1         |   \_ ssh tango
 2461 lennart  name=systemd:/user/lennart/1         \_ bash
29219 lennart  name=systemd:/user/lennart/1         |   \_ emacs systemd-for-admins-1.txt
15113 lennart  name=systemd:/user/lennart/1         \_ bash
27251 lennart  name=systemd:/user/lennart/1             \_ empathy
29504 lennart  name=systemd:/user/lennart/1             \_ ps xawf -eo pid,user,cgroup,args
 1968 lennart  name=systemd:/user/lennart/1        ssh-agent
 1994 lennart  name=systemd:/user/lennart/1        gpg-agent --daemon --write-env-file
18679 lennart  name=systemd:/user/lennart/1        /bin/sh /usr/lib64/firefox-3.6/run-mozilla.sh /usr/lib64/firefox-3.6/firefox
18741 lennart  name=systemd:/user/lennart/1         \_ /usr/lib64/firefox-3.6/firefox
28900 lennart  name=systemd:/user/lennart/1             \_ /usr/lib64/nspluginwrapper/npviewer.bin --plugin /usr/lib64/mozilla/plugins/libflashplayer.so --connection /org/wrapper/NSPlugins/libflashplayer.so/18741-6
 4016 root     name=systemd:/systemd-1/sysinit.service /usr/sbin/bluetoothd --udev
 4094 smmsp    name=systemd:/systemd-1/sendmail.service sendmail: Queue [email protected]:00:00 for /var/spool/clientmqueue
 4096 root     name=systemd:/systemd-1/sendmail.service sendmail: accepting connections
 4112 ntp      name=systemd:/systemd-1/ntpd.service /usr/sbin/ntpd -n -u ntp:ntp -g
27262 lennart  name=systemd:/user/lennart/1        /usr/libexec/mission-control-5
27265 lennart  name=systemd:/user/lennart/1        /usr/libexec/telepathy-haze
27268 lennart  name=systemd:/user/lennart/1        /usr/libexec/telepathy-logger
27270 lennart  name=systemd:/user/lennart/1        /usr/libexec/dconf-service
27280 lennart  name=systemd:/user/lennart/1        /usr/libexec/notification-daemon
27284 lennart  name=systemd:/user/lennart/1        /usr/libexec/telepathy-gabble
27285 lennart  name=systemd:/user/lennart/1        /usr/libexec/telepathy-salut
27297 lennart  name=systemd:/user/lennart/1        /usr/libexec/geoclue-yahoo

(Note that this output is shortened, I have removed most of the
kernel threads here, since they are not relevant in the context of
this blog story)

In the third column you see the cgroup systemd assigned to each
process. You’ll find that the udev processes are in the
name=systemd:/systemd-1/sysinit.service cgroup, which is
where systemd places all processes started by the
sysinit.service service, which covers early boot.

My personal recommendation is to set the shell alias psc
to the ps command line shown above:

alias psc='ps xawf -eo pid,user,cgroup,args'

With this service information of processes is just four keypresses
away!

A different way to present the same information is the
systemd-cgls tool we ship with systemd. It shows the cgroup
hierarchy in a pretty tree. Its output looks like this:

$ systemd-cgls
+    2 [kthreadd]
[...]
+ 4281 [flush-8:0]
+ user
| \ lennart
|   \ 1
|     +  1495 pam: gdm-password
|     +  1521 gnome-session
|     +  1534 dbus-launch --sh-syntax --exit-with-session
|     +  1535 /bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
|     +  1603 /usr/libexec/gconfd-2
|     +  1612 /usr/libexec/gnome-settings-daemon
|     +  1615 /ushr/libexec/gvfsd
|     +  1621 metacity
|     +  1626 /usr/libexec//gvfs-fuse-daemon /home/lennart/.gvfs
|     +  1634 /usr/bin/pulseaudio --start --log-target=syslog
|     +  1635 gnome-panel
|     +  1638 nautilus
|     +  1640 /usr/libexec/polkit-gnome-authentication-agent-1
|     +  1641 /usr/bin/seapplet
|     +  1644 gnome-volume-control-applet
|     +  1645 /usr/libexec/bonobo-activation-server --ac-activate --ior-output-fd=24
|     +  1646 /usr/sbin/restorecond -u
|     +  1649 /usr/libexec/pulse/gconf-helper
|     +  1652 /usr/bin/devilspie
|     +  1662 nm-applet --sm-disable
|     +  1664 gnome-power-manager
|     +  1665 /usr/libexec/gdu-notification-daemon
|     +  1668 /usr/libexec/im-settings-daemon
|     +  1670 /usr/libexec/evolution/2.32/evolution-alarm-notify
|     +  1672 /usr/bin/python /usr/share/system-config-printer/applet.py
|     +  1674 /usr/lib64/deja-dup/deja-dup-monitor
|     +  1675 abrt-applet
|     +  1677 bluetooth-applet
|     +  1678 gpk-update-icon
|     +  1701 /usr/libexec/gvfs-gdu-volume-monitor
|     +  1707 /usr/bin/gnote --panel-applet --oaf-activate-iid=OAFIID:GnoteApplet_Factory --oaf-ior-fd=22
|     +  1725 /usr/libexec/clock-applet
|     +  1727 /usr/libexec/wnck-applet
|     +  1729 /usr/libexec/notification-area-applet
|     +  1759 gnome-screensaver
|     +  1780 /usr/libexec/gvfsd-trash --spawner :1.9 /org/gtk/gvfs/exec_spaw/0
|     +  1864 /usr/libexec/gvfs-afc-volume-monitor
|     +  1874 /usr/libexec/gconf-im-settings-daemon
|     +  1882 /usr/libexec/gvfs-gphoto2-volume-monitor
|     +  1903 /usr/libexec/gvfsd-burn --spawner :1.9 /org/gtk/gvfs/exec_spaw/1
|     +  1909 gnome-terminal
|     +  1913 gnome-pty-helper
|     +  1914 bash
|     +  1968 ssh-agent
|     +  1994 gpg-agent --daemon --write-env-file
|     +  2221 bash
|     +  2461 bash
|     +  4193 ssh tango
|     + 15113 bash
|     + 18679 /bin/sh /usr/lib64/firefox-3.6/run-mozilla.sh /usr/lib64/firefox-3.6/firefox
|     + 18741 /usr/lib64/firefox-3.6/firefox
|     + 27251 empathy
|     + 27262 /usr/libexec/mission-control-5
|     + 27265 /usr/libexec/telepathy-haze
|     + 27268 /usr/libexec/telepathy-logger
|     + 27270 /usr/libexec/dconf-service
|     + 27280 /usr/libexec/notification-daemon
|     + 27284 /usr/libexec/telepathy-gabble
|     + 27285 /usr/libexec/telepathy-salut
|     + 27297 /usr/libexec/geoclue-yahoo
|     + 28900 /usr/lib64/nspluginwrapper/npviewer.bin --plugin /usr/lib64/mozilla/plugins/libflashplayer.so --connection /org/wrapper/NSPlugins/libflashplayer.so/18741-6
|     + 29219 emacs systemd-for-admins-1.txt
|     + 29231 ssh tango
|     \ 29519 systemd-cgls
\ systemd-1
  + 1 /sbin/init
  + ntpd.service
  | \ 4112 /usr/sbin/ntpd -n -u ntp:ntp -g
  + systemd-logger.service
  | \ 1499 /lib/systemd/systemd-logger
  + accounts-daemon.service
  | \ 1496 /usr/libexec/accounts-daemon
  + rtkit-daemon.service
  | \ 1473 /usr/libexec/rtkit-daemon
  + console-kit-daemon.service
  | \ 1408 /usr/sbin/console-kit-daemon --no-daemon
  + prefdm.service
  | + 1376 /usr/sbin/gdm-binary -nodaemon
  | + 1391 /usr/libexec/gdm-simple-slave --display-id /org/gnome/DisplayManager/Display1 --force-active-vt
  | + 1394 /usr/bin/Xorg :0 -nr -verbose -auth /var/run/gdm/auth-for-gdm-f2KUOh/database -nolisten tcp vt1
  | + 1419 /usr/bin/dbus-launch --exit-with-session
  | \ 1511 /usr/bin/gnome-keyring-daemon --daemonize --login
  + [email protected]
  | + tty6
  | | \ 1346 /sbin/mingetty tty6
  | + tty4
  | | \ 1343 /sbin/mingetty tty4
  | + tty5
  | | \ 1342 /sbin/mingetty tty5
  | + tty3
  | | \ 1339 /sbin/mingetty tty3
  | \ tty2
  |   \ 1332 /sbin/mingetty tty2
  + abrtd.service
  | \ 1317 /usr/sbin/abrtd -d -s
  + crond.service
  | \ 1344 crond
  + sshd.service
  | \ 1362 /usr/sbin/sshd
  + sendmail.service
  | + 4094 sendmail: Queue [email protected]:00:00 for /var/spool/clientmqueue
  | \ 4096 sendmail: accepting connections
  + haldaemon.service
  | + 1249 hald
  | + 1250 hald-runner
  | + 1273 hald-addon-input: Listening on /dev/input/event3 /dev/input/event9 /dev/input/event1 /dev/input/event7 /dev/input/event2 /dev/input/event0 /dev/input/event8
  | + 1275 /usr/libexec/hald-addon-rfkill-killswitch
  | + 1284 /usr/libexec/hald-addon-leds
  | + 1285 /usr/libexec/hald-addon-generic-backlight
  | \ 1287 /usr/libexec/hald-addon-acpi
  + irqbalance.service
  | \ 1210 irqbalance
  + avahi-daemon.service
  | + 1175 avahi-daemon: running [epsilon.local]
  + NetworkManager.service
  | + 1171 /usr/sbin/NetworkManager --no-daemon
  | \ 4028 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-wlan0.pid -lf /var/lib/dhclient/dhclient-7d32a784-ede9-4cf6-9ee3-60edc0bce5ff-wlan0.lease -cf /var/run/nm-dhclient-wlan0.conf wlan0
  + rsyslog.service
  | \ 1193 /sbin/rsyslogd -c 4
  + mdmonitor.service
  | \ 1207 mdadm --monitor --scan -f --pid-file=/var/run/mdadm/mdadm.pid
  + cups.service
  | \ 1195 cupsd -C /etc/cups/cupsd.conf
  + auditd.service
  | + 1131 auditd
  | + 1133 /sbin/audispd
  | \ 1135 /usr/sbin/sedispatch
  + dbus.service
  | +  1096 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
  | +  1216 /usr/sbin/modem-manager
  | +  1219 /usr/libexec/polkit-1/polkitd
  | +  1242 /usr/sbin/wpa_supplicant -c /etc/wpa_supplicant/wpa_supplicant.conf -B -u -f /var/log/wpa_supplicant.log -P /var/run/wpa_supplicant.pid
  | +  1453 /usr/libexec/upowerd
  | +  1733 /usr/libexec/udisks-daemon
  | +  1747 udisks-daemon: polling /dev/sr0
  | \ 29509 /usr/libexec/packagekitd
  + dev-mqueue.mount
  + dev-hugepages.mount
  \ sysinit.service
    +   455 /sbin/udevd -d
    +  4016 /usr/sbin/bluetoothd --udev
    + 28188 /sbin/udevd -d
    \ 28191 /sbin/udevd -d

(This too is shortened, the same way)

As you can see, this command shows the processes by their cgroup
and hence service, as systemd labels the cgroups after the
services. For example, you can easily see that the auditing service
auditd.service spawns three individual processes,
auditd, audisp and sedispatch.

If you look closely you will notice that a number of processes have
been assigned to the cgroup /user/1. At this point let’s
simply leave it at that systemd not only maintains services in cgroups,
but user session processes as well. In a later installment we’ll discuss in
more detail what this about.

So much for now, come back soon for the next installment!

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

  • To start less.
  • And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate‘s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo)
in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

  1. service: these are the most obvious kind of unit:
    daemons that can be started, stopped, restarted, reloaded. For
    compatibility with SysV we not only support our own
    configuration files for services, but also are able to read
    classic SysV init scripts, in particular we parse the LSB
    header, if it exists. /etc/init.d is hence not much
    more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the
    file-system or on the Internet. We currently support AF_INET,
    AF_INET6, AF_UNIX sockets of the types stream, datagram, and
    sequential packet. We also support classic FIFOs as
    transport. Each socket unit has a matching
    service unit, that is started if the first connection
    comes in on the socket or FIFO. Example: nscd.socket
    starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the
    Linux device tree. If a device is marked for this via udev
    rules, it will be exposed as a device unit in
    systemd. Properties set with udev can be used as
    configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the
    file system hierarchy. systemd monitors all mount points how
    they come and go, and can also be used to mount or
    unmount mount-points. /etc/fstab is used here as an
    additional configuration source for these mount points, similar to
    how SysV init scripts can be used as additional configuration
    source for service units.
  5. automount: this unit type encapsulates an automount
    point in the file system hierarchy. Each automount
    unit has a matching mount unit, which is started
    (i.e. mounted) as soon as the automount directory is
    accessed.
  6. target: this unit type is used for logical
    grouping of units: instead of actually doing anything by itself
    it simply references other units, which thereby can be controlled
    together. Examples for this are: multi-user.target,
    which is a target that basically plays the role of run-level 5 on
    classic SysV system, or bluetooth.target which is
    requested as soon as a bluetooth dongle becomes available and
    which simply pulls in bluetooth related services that otherwise
    would not need to be started: bluetoothd and
    obexd and suchlike.
  7. snapshot: similar to target units
    snapshots do not actually do anything themselves and their only
    purpose is to reference other units. Snapshots can be used to
    save/rollback the state of all services and units of the init
    system. Primarily it has two intended use cases: to allow the
    user to temporarily enter a specific state such as “Emergency
    Shell”, terminating current services, and provide an easy way to
    return to the state before, pulling up all services again that
    got temporarily pulled down. And to ease support for system
    suspending: still many services cannot correctly deal with
    system suspend, and it is often a better idea to shut them down
    before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the
    environment, resource limits, working and root directory, umask,
    OOM killer adjustment, nice level, IO class and priority, CPU policy
    and priority, CPU affinity, timer slack, user id, group id,
    supplementary group ids, readable/writable/inaccessible
    directories, shared/private/slave mount flags,
    capabilities/bounding set, secure bits, CPU scheduler reset of
    fork, private /tmp name-space, cgroup control for
    various subsystems. Also, you can easily connect
    stdin/stdout/stderr of services to syslog, /dev/kmsg,
    arbitrary TTYs. If connected to a TTY for input systemd will make
    sure a process gets exclusive access, optionally waiting or enforcing
    it.
  2. Every executed process gets its own cgroup (currently by
    default in the debug subsystem, since that subsystem is not
    otherwise used and does not much more than the most basic
    process grouping), and it is very easy to configure systemd to
    place services in cgroups that have been configured externally,
    for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely
    follows the well-known .desktop files. It is a simple syntax for
    which parsers exist already in many software frameworks. Also, this
    allows us to rely on existing tools for i18n for service
    descriptions, and similar. Administrators and developers don’t
    need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init
    scripts. We take advantages of LSB and Red Hat chkconfig headers
    if they are available. If they aren’t we try to make the best of
    the otherwise available information, such as the start
    priorities in /etc/rc.d. These init scripts are simply
    considered a different source of configuration, hence an easy
    upgrade path to proper systemd services is available. Optionally
    we can read classic PID files for services to identify the main
    pid of a daemon. Note that we make use of the dependency
    information from the LSB init script headers, and translate
    those into native systemd dependencies. Side note: Upstart is
    unable to harvest and make use of that information. Boot-up on a
    plain Upstart system with mostly LSB SysV init scripts will
    hence not be parallelized, a similar system running systemd
    however will. In fact, for Upstart all SysV scripts together
    make one job that is executed, they are not treated
    individually, again in contrast to systemd where SysV init
    scripts are just another source of configuration and are all
    treated and controlled individually, much like any other native
    systemd service.
  5. Similarly, we read the existing /etc/fstab
    configuration file, and consider it just another source of
    configuration. Using the comment= fstab option you can
    even mark /etc/fstab entries to become systemd
    controlled automount points.
  6. If the same unit is configured in multiple configuration
    sources (e.g. /etc/systemd/system/avahi.service exists,
    and /etc/init.d/avahi too), then the native
    configuration will always take precedence, the legacy format is
    ignored, allowing an easy upgrade path and packages to carry
    both a SysV init script and a systemd service file for a
    while.
  7. We support a simple templating/instance mechanism. Example:
    instead of having six configuration files for six gettys, we
    only have one [email protected] file which gets instantiated to
    [email protected] and suchlike. The interface part can
    even be inherited by dependency expressions, i.e. it is easy to
    encode that a service [email protected] pulls in
    [email protected], while leaving the
    eth0 string wild-carded.
  8. For socket activation we support full compatibility with the
    traditional inetd modes, as well as a very simple mode that
    tries to mimic launchd socket activation and is recommended for
    new services. The inetd mode only allows passing one socket to
    the started daemon, while the native mode supports passing
    arbitrary numbers of file descriptors. We also support one
    instance per connection, as well as one instance for all
    connections modes. In the former mode we name the cgroup the
    daemon will be started in after the connection parameters, and
    utilize the templating logic mentioned above for this. Example:
    sshd.socket might spawn services
    [email protected] with a
    cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
    (i.e. the IP address and port numbers are used in the instance
    names. For AF_UNIX sockets we use PID and user id of the
    connecting client). This provides a nice way for the
    administrator to identify the various instances of a daemon and
    control their runtime individually. The native socket passing
    mode is very easily implementable in applications: if
    $LISTEN_FDS is set it contains the number of sockets
    passed and the daemon will find them sorted as listed in the
    .service file, starting from file descriptor 3 (a
    nicely written daemon could also use fstat() and
    getsockname() to identify the sockets in case it
    receives more than one). In addition we set $LISTEN_PID
    to the PID of the daemon that shall receive the fds, because
    environment variables are normally inherited by sub-processes and
    hence could confuse processes further down the chain. Even
    though this socket passing logic is very simple to implement in
    daemons, we will provide a BSD-licensed reference implementation
    that shows how to do this. We have ported a couple of existing
    daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a
    certain extent. This compatibility is in fact implemented with a
    FIFO-activated service, which simply translates these legacy
    requests to D-Bus requests. Effectively this means the old
    shutdown, poweroff and similar commands from
    Upstart and sysvinit continue to work with
    systemd.
  10. We also provide compatibility with utmp and
    wtmp. Possibly even to an extent that is far more
    than healthy, given how crufty utmp and wtmp
    are.
  11. systemd supports several kinds of
    dependencies between units. After/Before can be used to fix
    the ordering how units are activated. It is completely
    orthogonal to Requires and Wants, which
    express a positive requirement dependency, either mandatory, or
    optional. Then, there is Conflicts which
    expresses a negative requirement dependency. Finally, there are
    three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit
    is requested to start up or shut down we will add it and all its
    dependencies to a temporary transaction. Then, we will
    verify if the transaction is consistent (i.e. whether the
    ordering via After/Before of all units is
    cycle-free). If it is not, systemd will try to fix it up, and
    removes non-essential jobs from the transaction that might
    remove the loop. Also, systemd tries to suppress non-essential
    jobs in the transaction that would stop a running
    service. Non-essential jobs are those which the original request
    did not directly include but which where pulled in by
    Wants type of dependencies. Finally we check whether
    the jobs of the transaction contradict jobs that have already
    been queued, and optionally the transaction is aborted then. If
    all worked out and the transaction is consistent and minimized
    in its impact it is merged with all already outstanding jobs and
    added to the run queue. Effectively this means that before
    executing a requested operation, we will verify that it makes
    sense, fixing it if possible, and only failing if it really cannot
    work.
  13. We record start/exit time as well as the PID and exit status
    of every process we spawn and supervise. This data can be used
    to cross-link daemons with their data in abrtd, auditd and
    syslog. Think of an UI that will highlight crashed daemons for
    you, and allows you to easily navigate to the respective UIs for
    syslog, abrt, and auditd that will show the data generated from
    and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any
    time. The daemon state is serialized before the reexecution and
    deserialized afterwards. That way we provide a simple way to
    facilitate init system upgrades as well as handover from an
    initrd daemon to the final daemon. Open sockets and autofs
    mounts are properly serialized away, so that they stay
    connectible all the time, in a way that clients will not even
    notice that the init system reexecuted itself. Also, the fact
    that a big part of the service state is encoded anyway in the
    cgroup virtual file system would even allow us to resume
    execution without access to the serialization data. The
    reexecution code paths are actually mostly the same as the init
    system configuration reloading code paths, which
    guarantees that reexecution (which is probably more seldom
    triggered) gets similar testing as reloading (which is probably
    more common).
  15. Starting the work of removing shell scripts from the boot
    process we have recoded part of the basic system setup in C and
    moved it directly into systemd. Among that is mounting of the API
    file systems (i.e. virtual file systems such as /proc,
    /sys and /dev.) and setting of the
    host-name.
  16. Server state is introspectable and controllable via
    D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based
    activation, and we hence support dependencies between sockets and
    services, we also support traditional inter-service
    dependencies. We support multiple ways how such a service can
    signal its readiness: by forking and having the start process
    exit (i.e. traditional daemonize() behaviour), as well
    as by watching the bus until a configured service name appears.
  18. There’s an interactive mode which asks for confirmation each
    time a process is spawned by systemd. You may enable it by
    passing systemd.confirm_spawn=1 on the kernel command
    line.
  19. With the systemd.default= kernel command line
    parameter you can specify which unit systemd should start on
    boot-up. Normally you’d specify something like
    multi-user.target here, but another choice could even
    be a single service instead of a target, for example
    out-of-the-box we ship a service emergency.service that
    is similar in its usefulness as init=/bin/bash, however
    has the advantage of actually running the init system, hence
    offering the option to boot up the full system from the
    emergency shell.
  20. There’s a minimal UI that allows you to
    start/stop/introspect services. It’s far from complete but
    useful as a debugging tool. It’s written in Vala (yay!) and goes
    by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup
showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

  • We ask daemon writers not to fork or even double fork
    in their processes, but run their event loop from the initial process
    systemd starts for you. Also, don’t call setsid().
  • Don’t drop user privileges in the daemon itself, leave this
    to systemd and configure it in systemd service configuration
    files. (There are exceptions here. For example, for some daemons
    there are good reasons to drop privileges inside the daemon
    code, after an initialization phase that requires elevated
    privileges.)
  • Don’t write PID files
  • Grab a name on the bus
  • You may rely on systemd for logging, you are welcome to log
    whatever you need to log to stderr.
  • Let systemd create and watch sockets for you, so that socket
    activation works. Hence, interpret $LISTEN_FDS and
    $LISTEN_PID as described above.
  • Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?
Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.
Will this come to Fedora?
If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay’s pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That’s up to them. We’d certainly welcome their interest, and help with the integration.
Why didn’t you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.
Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.
If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.

Having fun with bzr

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/bizarre-fun.html

#nocomments y

So I wanted to hack proper channel mapping query support into libsndfile, something I have had
on my TODO list for years. The first step was to find the source code
repository for it
. That was easy. Alas the VCS used is bzr. There are some
very vocal folks on the Internet who claim that the bzr user interface is
stupendously easy to use in contrast to git which apparantly is the very
definition of complexity. And if it is stated on the Internet it must be true.
I think I mastered git quite well, so yeah, checking out the sources with bzr
can’t be that difficult for my limited brain capacity.

So let’s do what Erik suggests for checking out the sources:

$ bzr get http://www.mega-nerd.com/Bzr/libsndfile-pub/

Calling this I get a nice percentage counter that starts at 0% and ends at, … uh, 0%. That gives me a real feeling of progress. It takes a while, and then I get an error:

bzr: ERROR: Not a branch: “http://www.mega-nerd.com/Bzr/libsndfile-pub/”.

Now that’s a useful error message. They even include an all-caps word! I guess that error message is right — it’s not a branch, it is a repository. Or is it not?

So what do we do about this? Maybe get is not actually the right verb. Let’s try to play around a bit. Let’s use the verb I use to get sources with in git:

$ bzr clone http://www.mega-nerd.com/Bzr/libsndfile-pub/

Hmm, this results in exactly same 0% to 0% progress counter, and the same useless error message.

Now I remember that bzr is actually more inspired by Subversion’s UI than by git’s, so let’s try it the SVN way.

$ bzr checkout http://www.mega-nerd.com/Bzr/libsndfile-pub/

Hmm, and of course, I get exactly the same results again. A counter that counts from 0% to 0% and the same useless error message.

Ok, maybe that error is bzr’s standard reply? Let’s check this out:

$ bzr waldo http://www.mega-nerd.com/Bzr/libsndfile-pub/
bzr: ERROR: unknown command “waldo”

Apparently not. bzr actually knows more than one error message.

Ok, I admit doing this by trial-and-error is a rather lame approach. RTFM! So let’s try this.

$ man bzr-get
No manual entry for bzr-get

Ouch. No man page? How awesome. Ah, wait, maybe they have only a single unreadable mega man page for everything. Let’s try this:

$ man bzr

Wow, this actually worked. Seems to list all commands. Now let’s look for the help on bzr get:

/bzr get
Pattern not found (press RETURN)

Hmm, no documentation for their most important command? That’s weird! Ok, let’s try it again with our git vocabulary:

/bzr clone
Pattern not found (press RETURN)

Ok, this not funny anymore. Apparently the verbs are listed in alphabetical order.
So let’s browse to the letter g as in get. However it doesn’t
exist. There’s bzr export, and then the next entry is bzr
help (Oh, irony!) — but no get in-between.

Ok, enough of this shit. Maybe the message wants to tell us that the repo
actually doesn’t exist (even though it confusingly calls it a “branch”). Let’s
go back to the original page at Erik’s site and read things again. Aha, the
“main archive archive can be found at (yes, the directory looks empty, but
it isn’t): http://www.mega-nerd.com/Bzr/libsndfile-pub/“.
Hmm, indeed — that URL looks very empty when it is accessed. How weird though
that in bzr a repo is an empty directory!

And at this point I gave up and downloaded the tarball to make my patches
against. I have still not managed to check out the sources from the repo.
Somehow I get the feeling the actual repo really isn’t available anymore under that address.

So why am I blogging about this? Not so much to start another flamefest, to
nourish the fanboys, nor because it is so much fun to bash other people’s work or
simply to piss people off. It’s more for two reasons:

Firstly, simply to make
the point that folks can claim a thousand times that git’s UI sucks and bzr’s
UI is awesome. It’s simply not true. From what I experienced it is not the
tiniest bit better. The error messages useless, the documentation incomplete,
the interfaces surprising and exactly as redundant as git’s. The only
effective difference I noticed is that it takes a bit longer to show those
error messages with bzr — the Python tax. To summarize this more positively: git excels as much as bzr does. Both’ documentation, their error messages and their user interface are the best in their class. And they have all the best chances for future improvement.

And the second reason of course is that I’d still like to know what the correct way to get the sources is. But for that I should probably ask Erik himself.

Having fun with bzr

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/bizarre-fun.html

#nocomments y

So I wanted to hack proper channel mapping query support into libsndfile, something I have had
on my TODO list for years. The first step was to find the source code
repository for it
. That was easy. Alas the VCS used is bzr. There are some
very vocal folks on the Internet who claim that the bzr user interface is
stupendously easy to use in contrast to git which apparantly is the very
definition of complexity. And if it is stated on the Internet it must be true.
I think I mastered git quite well, so yeah, checking out the sources with bzr
can’t be that difficult for my limited brain capacity.

So let’s do what Erik suggests for checking out the sources:

$ bzr get http://www.mega-nerd.com/Bzr/libsndfile-pub/

Calling this I get a nice percentage counter that starts at 0% and ends at, … uh, 0%. That gives me a real feeling of progress. It takes a while, and then I get an error:

bzr: ERROR: Not a branch: "http://www.mega-nerd.com/Bzr/libsndfile-pub/".

Now that’s a useful error message. They even include an all-caps word! I guess that error message is right — it’s not a branch, it is a repository. Or is it not?

So what do we do about this? Maybe get is not actually the right verb. Let’s try to play around a bit. Let’s use the verb I use to get sources with in git:

$ bzr clone http://www.mega-nerd.com/Bzr/libsndfile-pub/

Hmm, this results in exactly same 0% to 0% progress counter, and the same useless error message.

Now I remember that bzr is actually more inspired by Subversion’s UI than by git’s, so let’s try it the SVN way.

$ bzr checkout http://www.mega-nerd.com/Bzr/libsndfile-pub/

Hmm, and of course, I get exactly the same results again. A counter that counts from 0% to 0% and the same useless error message.

Ok, maybe that error is bzr’s standard reply? Let’s check this out:

$ bzr waldo http://www.mega-nerd.com/Bzr/libsndfile-pub/
bzr: ERROR: unknown command "waldo"

Apparently not. bzr actually knows more than one error message.

Ok, I admit doing this by trial-and-error is a rather lame approach. RTFM! So let’s try this.

$ man bzr-get
No manual entry for bzr-get

Ouch. No man page? How awesome. Ah, wait, maybe they have only a single unreadable mega man page for everything. Let’s try this:

$ man bzr

Wow, this actually worked. Seems to list all commands. Now let’s look for the help on bzr get:

/bzr get
Pattern not found  (press RETURN)

Hmm, no documentation for their most important command? That’s weird! Ok, let’s try it again with our git vocabulary:

/bzr clone
Pattern not found  (press RETURN)

Ok, this not funny anymore. Apparently the verbs are listed in alphabetical order.
So let’s browse to the letter g as in get. However it doesn’t
exist. There’s bzr export, and then the next entry is bzr
help
(Oh, irony!) — but no get in-between.

Ok, enough of this shit. Maybe the message wants to tell us that the repo
actually doesn’t exist (even though it confusingly calls it a “branch”). Let’s
go back to the original page at Erik’s site and read things again. Aha, the
main archive archive can be found at (yes, the directory looks empty, but
it isn’t): http://www.mega-nerd.com/Bzr/libsndfile-pub/“.

Hmm, indeed — that URL looks very empty when it is accessed. How weird though
that in bzr a repo is an empty directory!

And at this point I gave up and downloaded the tarball to make my patches
against. I have still not managed to check out the sources from the repo.
Somehow I get the feeling the actual repo really isn’t available anymore under that address.

So why am I blogging about this? Not so much to start another flamefest, to
nourish the fanboys, nor because it is so much fun to bash other people’s work or
simply to piss people off. It’s more for two reasons:

Firstly, simply to make
the point that folks can claim a thousand times that git’s UI sucks and bzr’s
UI is awesome. It’s simply not true. From what I experienced it is not the
tiniest bit better. The error messages useless, the documentation incomplete,
the interfaces surprising and exactly as redundant as git’s. The only
effective difference I noticed is that it takes a bit longer to show those
error messages with bzr — the Python tax. To summarize this more positively: git excels as much as bzr does. Both’ documentation, their error messages and their user interface are the best in their class. And they have all the best chances for future improvement.

And the second reason of course is that I’d still like to know what the correct way to get the sources is. But for that I should probably ask Erik himself.

Automatic Backtrace Generation

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/automatic-backtrace.html

Ubuntu has Apport. Fedora has nothing. That sucks big time.

Here’s the result of a few minutes of hacking up something similar to Apport based on the awesome (and much underused) Frysk debugging tool kit. It doesn’t post any backtraces on any Internet servers and has no fancy UI — but it automatically dumps a stacktrace of every crashing process on the system to syslog and stores all kinds of data in /tmp/core.*/ for later inspection.

#!/bin/bash
set -e
export PATH=/sbin:/bin:/usr/sbin:/usr/bin
DIR=”/tmp/core.$1.$2″
umask 077
mkdir “$DIR”
cat > “$DIR/core”
exec &> “$DIR/dump.log”
set +e
echo “$1” > “$DIR/pid”
echo “$2” > “$DIR/timestamp”
echo “$3” > “$DIR/uid”
echo “$4” > “$DIR/gid”
echo “$5” > “$DIR/signal”
echo “$6” > “$DIR/hostname”
set -x
fauxv “$DIR/core” > “$DIR/auxv”
fexe “$DIR/core” > “$DIR/exe”
fmaps “$DIR/core” > “$DIR/maps”
PKGS=`/usr/bin/fdebuginfo “$DIR/core” | grep “—” | cut -d ‘ ‘ -f 1 | sort | uniq | grep ‘^/’| xargs rpm -qf | sort | uniq`
[ “x$PKGS” != x ] && debuginfo-install -y $PKGS
fstack -rich “$DIR/core” > “$DIR/fstack”
set +x
(
echo “Application `cat “$DIR/exe”` (pid=$1,uid=$3,gid=$4) crashed with signal $5.”
echo “Stack trace follows:”
cat “$DIR/fstack”
echo “Auxiliary vector:”
cat “$DIR/auxv”
echo “Maps:”
cat “$DIR/maps”
echo “For details check $DIR”
) | logger -p local6.info -t “frysk-core-dump-$1”

Copy that into a file $SOMEWHERE/frysk-core-dump. Then do a chmod +x $SOMEWHERE/frysk-core-dump and a chown root:root $SOMEWHERE/frysk-core-dump. Now, tell the kernel that core dumps should be handed to this script:

# echo “|$SOMEWHERE/frysk-core-dump %p %t %u %g %s %h” > /proc/sys/kernel/core_pattern

Finally, increase RLIMIT_CORE to actually enable core dumps. ulimit -c
unlimited is a good idea. This will enable them only for your shell and
everything it spawns. In /etc/security/limits.conf you can enable
them for all users. I haven’t found out yet how to enable them globally
in Fedora though, i.e. for every single process that is started after boot including system daemons.

You can test this with running sleep 4711 and then dumping core with C-. The stacktrace should appear right-away in /var/log/messages.

This script will automatically try to install the debugging symbols for the crashing application via yum. In some cases it hence might take a while until the backtrace appears in syslog.

Don’t forget to install Frysk before trying this script!

You can’t believe how useful this script is. Something crashed and the backtrace is already waiting for you! It’s a bugfixer’s wet dream.

I am a bit surprised though that noone else came up with this before me. Or maybe I am just too dumb to use Google properly?

Automatic Backtrace Generation

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/automatic-backtrace.html

Ubuntu has Apport. Fedora has nothing. That sucks big time.

Here’s the result of a few minutes of hacking up something similar to Apport based on the awesome (and much underused) Frysk debugging tool kit. It doesn’t post any backtraces on any Internet servers and has no fancy UI — but it automatically dumps a stacktrace of every crashing process on the system to syslog and stores all kinds of data in /tmp/core.*/ for later inspection.

#!/bin/bash
set -e
export PATH=/sbin:/bin:/usr/sbin:/usr/bin
DIR="/tmp/core.$1.$2"
umask 077
mkdir "$DIR"
cat > "$DIR/core"
exec &> "$DIR/dump.log"
set +e
echo "$1" > "$DIR/pid"
echo "$2" > "$DIR/timestamp"
echo "$3" > "$DIR/uid"
echo "$4" > "$DIR/gid"
echo "$5" > "$DIR/signal"
echo "$6" > "$DIR/hostname"
set -x
fauxv "$DIR/core" > "$DIR/auxv"
fexe "$DIR/core" > "$DIR/exe"
fmaps "$DIR/core" > "$DIR/maps"
PKGS=`/usr/bin/fdebuginfo "$DIR/core" | grep "\-\-\-" | cut -d ' ' -f 1 | sort | uniq | grep '^/'| xargs rpm -qf | sort | uniq`
[ "x$PKGS" != x ] && debuginfo-install -y $PKGS
fstack -rich "$DIR/core" > "$DIR/fstack"
set +x
(
	echo "Application `cat "$DIR/exe"` (pid=$1,uid=$3,gid=$4) crashed with signal $5."
	echo "Stack trace follows:"
	cat "$DIR/fstack"
	echo "Auxiliary vector:"
	cat "$DIR/auxv"
	echo "Maps:"
	cat "$DIR/maps"
	echo "For details check $DIR"
) | logger -p local6.info -t "frysk-core-dump-$1"

Copy that into a file $SOMEWHERE/frysk-core-dump. Then do a chmod +x $SOMEWHERE/frysk-core-dump and a chown root:root $SOMEWHERE/frysk-core-dump. Now, tell the kernel that core dumps should be handed to this script:

# echo "|$SOMEWHERE/frysk-core-dump %p %t %u %g %s %h" > /proc/sys/kernel/core_pattern

Finally, increase RLIMIT_CORE to actually enable core dumps. ulimit -c
unlimited
is a good idea. This will enable them only for your shell and
everything it spawns. In /etc/security/limits.conf you can enable
them for all users. I haven’t found out yet how to enable them globally
in Fedora though, i.e. for every single process that is started after boot including system daemons.

You can test this with running sleep 4711 and then dumping core with C-\. The stacktrace should appear right-away in /var/log/messages.

This script will automatically try to install the debugging symbols for the crashing application via yum. In some cases it hence might take a while until the backtrace appears in syslog.

Don’t forget to install Frysk before trying this script!

You can’t believe how useful this script is. Something crashed and the backtrace is already waiting for you! It’s a bugfixer’s wet dream.

I am a bit surprised though that noone else came up with this before me. Or maybe I am just too dumb to use Google properly?

PulseAudio FUD

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/jeffrey-stedfast.html

Jeffrey Stedfast

Jeffrey Stedfast seems to have made it his new hobby
to
bash
PulseAudio.
In a series of very negative blog postings he flamed my software and hence me
in best NotZed-like fashion. Particularly interesting in this case is the
fact that he apologized to me privately on IRC for this behaviour shortly
after his first posting when he was critizised on #gnome-hackers —
only to continue flaming and bashing in more blog posts shortly after. Flaming
is very much part of the Free Software community I guess. A lot of people do
it from time to time (including me). But maybe there are better places for
this than Planet Gnome. And maybe doing it for days is not particularly nice.
And maybe flaming sucks in the first place anyway.

Regardless what I think about Jeffrey and his behaviour on Planet Gnome,
let’s have a look on his trophies, the five “bugs” he posted:

Not directly related to PulseAudio itself. Also, finding errors in code that is related to esd is not exactly the most difficult thing in the world.
The same theme.
Fixed 3 months ago. It is certainly not my fault that this isn’t available in Jeffrey’s distro.
A real, valid bug report. Fixed in git a while back, but not available in any released version. May only be triggered under heavy load or with a bad high-latency scheduler.
A valid bug, but not really in PulseAudio. Mostly caused because the ALSA API and PA API don’t really match 100%.

OK, Jeffrey found a real bug, but I wouldn’t say this is really enough to make all the fuss about. Or is it?

Why PulseAudio?

Jeffrey wrote something about ‘solution looking for a problem’ when
speaking of PulseAudio. While that was certainly not a nice thing to say it
however tells me one thing: I apparently didn’t manage to communicate well
enough why I am doing PulseAudio in the first place. So, why am I doing it then?

There’s so much more a good audio system needs to provide than just the
most basic mixing functionality. Per-application volumes, moving streams
between devices during playback, positional event sounds (i.e. click on the
left side of the screen, have the sound event come out through the left
speakers), secure session-switching support, monitoring of sound playback
levels, rescuing playback streams to other audio devices on hot unplug,
automatic hotplug configuration, automatic up/downmixing stereo/surround,
high-quality resampling, network transparency, sound effects, simultaneous
output to multiple sound devices are all features PA provides right now, and
what you don’t get without it. It also provides the infrastructure for
upcoming features like volume-follows-focus, automatic attenuation of music on
signal on VoIP stream, UPnP media renderer support, Apple RAOP support,
mixing/volume adjustments with dynamic range compression, adaptive volume of
event sounds based on the volume of music streams, jack sensing, switching
between stereo/surround/spdif during runtime, …

And even for the most basic mixing functionality plain ALSA/dmix is not
really everlasting happiness. Due to the way it works all clients are forced
to use the same buffering metrics all the time, that means all clients are
limited in their wakeup/latency settings. You will burn more CPU than
necessary this way, keep the risk of drop-outs unnecessarily high and still
not be able to make clients with low-latency requirements happy. ‘Glitch-Free’
PulseAudio
fixes all this. Quite frankly I believe that ‘glitch-free’
PulseAudio is the single most important killer feature that should be enough
to convince everyone why PulseAudio is the right thing to do. Maybe people
actually don’t know that they want this. But they absolutely do, especially
the embedded people — if used properly it is a must for power-saving during
audio playback. It’s a pity that how awesome this feature is you cannot
directly see from the user interface.[1]

PulseAudio provides compatibility with a lot of sound systems/APIs that bare ALSA
or bare OSS don’t provide.

And last but not least, I love breaking Jeffrey’s audio. It’s just soo much fun, you really have to try it! 😉

If you want to know more about why I think that PulseAudio is an important part of the modern Linux desktop audio stack, please read my slides from FOSS.in 2007.

Misconceptions

Many people (like Jeffrey) wonder why have software mixing at all if you
have hardware mixing? The thing is, hardware mixing is a thing of the past,
modern soundcards don’t do it anymore. Precisely for doing things like mixing
in software SIMD CPU extensions like SSE have been invented. Modern sound
cards these days are kind of “dumbed” down, high-quality DACs. They don’t do
mixing anymore, many modern chips don’t even do volume control anymore.
Remember the days where having a Wavetable chip was a killer feature of a
sound card? Those days are gone, today wavetable synthesizing is done almost
exlcusively in software — and that’s exactly what happened to hardware mixing
too. And it is good that way. In software mixing is is much easier to do
fancier stuff like DRC which will increase quality of mixing. And modern CPUs provide
all the necessary SIMD command sets to implement this efficiently.

Other people believe that JACK would be a better solution for the problem.
This is nonsense. JACK has been designed for a very different purpose. It is
optimized for low latency inter-application communication. It requires
floating point samples, it knows nothing about channel mappings, it depends on
every client to behave correctly. And so on, and so on. It is a sound server
for audio production. For desktop applications it is however not well suited.
For a desktop saving power is very important, one application misbehaving
shouldn’t have an effect on other application’s playback; converting from/to
FP all the time is not going to help battery life either. Please understand
that for the purpose of pro audio you can make completely different
compromises than you can do on the desktop. For example, while having
‘glitch-free’ is great for embedded and desktop use, it makes no sense at all
for pro audio, and would only have a drawback on performance. So, please stop
bringing up JACK again and again. It’s just not the right tool for desktop
audio, and this opinion is shared by the JACK developers themselves.

Jeffrey thinks that audio mixing is nothing for userspace. Which is
basically what OSS4 tries to do: mixing in kernel space. However, the future
of PCM audio is floating points. Mixing them in kernel space is problematic because (at least on Linux) FP in kernel space is a no-no.
Also, the kernel people made clear more than once that maths/decoding/encoding like this
should happen in userspace. Quite honestly, doing the mixing in kernel space
is probably one of the primary reasons why I think that OSS4 is a bad idea.
The fancier your mixing gets (i.e. including resampling, upmixing, downmixing,
DRC, …) the more difficulties you will have to move such a complex,
time-intensive code into the kernel.

Not everytime your audio breaks it is alone PulseAudio’s fault. For
example, the original flame of Jeffrey’s was about the low volume that he
experienced when running PA. This is mostly due to the suckish way we
initialize the default volumes of ALSA sound cards. Most distributions have
simple scripts that initialize ALSA sound card volumes to fixed values like
75% of the available range, without understanding what the range or the
controls actually mean. This is actually a very bad thing to do. Integrated
USB speakers for example tend export the full amplification range via the
mixer controls. 75% for them is incredibly loud. For other hardware (like
apparently Jeffrey’s) it is too low in volume. How to fix this has been
discussed on the ALSA mailing list, but no final solution has been presented
yet. Nonetheless, the fact that the volume was too low, is completely
unrelated to PulseAudio.

PulseAudio interfaces with lower-level technologies like ALSA on one hand,
and with high-level applications on the other hand. Those systems are not
perfect. Especially closed-source applications tend to do very evil things
with the audio APIs (Flash!) that are very hard to support on virtualized
sound systems such as PulseAudio [2]. However, things are getting better. My list of issues I found in
ALSA
is getting shorter. Many applications have already been fixed.

The reflex “my audio is broken it must be PulseAudio’s fault” is certainly
easy to come up with, but it certainly is not always right.

Also note that — like many areas in Free Software — development of the
desktop audio stack on Linux is a bit understaffed. AFAIK there are only two
people working on ALSA full-time and only me working on PulseAudio and other
userspace audio infrastructure, assisted by a few others who supply code and patches
from time to time, some more and some less.

More Breakage to Come

I now tried to explain why the audio experience on systems with PulseAudio
might not be as good as some people hoped, but what about the future? To be
frank: the next version of PulseAudio (0.9.11) will break even more things.
The ‘glitch-free’ stuff mentioned above uses quite a few features of the
underlying ALSA infrastructure that apparently noone has been using before —
and which just don’t work properly yet on all drivers. And there are quite a
few drivers around, and I only have a very limited set of hardware to test
with. Already I know that the some of the most popular drivers (USB and HDA)
do not work entirely correctly with ‘glitch-free’.

So you ask why I plan to release this code knowing that it will break
things? Well, it works on some hardware/drivers properly, and for the others I
know work-arounds to get things to work. And 0.9.11 has been delayed for too
long already. Also I need testing from a bigger audience. And it is not so
much 0.9.11 that is buggy, it is the code it is based on. ‘Glitch-free’ PA
0.9.11 is going to part of Fedora 10. Fedora has always been more bleeding
edge than other other distributions. Picking 0.9.11 just like that for an
‘LTS’ release might however be a not a good idea.

So, please bear with me when I release 0.9.11. Snapshots have already
been available in Rawhide for a while, and hell didn’t freeze over.

The Distributions’ Role in the Game

Some distributions did a better job adopting PulseAudio than others. On the
good side I certainly have to list Mandriva, Debian[3], and
Fedora[4]. OTOH Ubuntu didn’t exactly do a stellar job. They didn’t
do their homework. Adopting PA in a distribution is a fair amount of work,
given that it interfaces with so many different things at so many different
places. The integration with other systems is crucial. The information was all
out there, communicated on the wiki, the mailing lists and on the PA IRC
channel. But if you join and hang around on neither, then you won’t get the
memo. To my surprise when Ubuntu adopted PulseAudio they moved into one of their
‘LTS’ releases rightaway [5]. Which I guess can be called gutsy —
on the background that I work for Red Hat and PulseAudio is not part of RHEL
at this time. I get a lot of flak from Ubuntu users, and I am pretty sure the
vast amount of it is undeserving and not my fault.

Why Jeffrey’s distro of choice (SUSE?) didn’t package pavucontrol 0.9.6
although it has been released months ago I don’t know. But there’s certainly no reason to whine about
that to me
and bash me for it.

Having said all this — it’s easy to point to other software’s faults or
other people’s failures. So, admitting this, PulseAudio is certainly not
bug-free, far from that. It’s a relatively complex piece of software
(threading, real-time, lock-free, sensitive to timing, …), and every
software has its bugs. In some workloads they might be easier to find than it
others. And I am working on fixing those which are found. I won’t forget any
bug report, but the order and priority I work on them is still mostly up to me
I guess, right? There’s still a lot of work to do in desktop audio, it will
take some time to get things completely right and complete.

Calls for “audio should just work ™” are often heard. But if you don’t
want to stick with a sound system that was state of the art in the 90’s for
all times, then I fear things *will have* to break from time to time. And
Jeffrey, I have no idea what you are actually hacking on. Some people
mentioned something with Evolution. If that’s true, then quite honestly,
“email should just work”, too, shouldn’t it? Evolution is not exactly
famous for it’s legendary bug-freeness and stability, or did I miss something?
Maybe you should be the one to start with making things “just work”, especially since
Evolution has been around for much longer already.

Back to Work

Now that I responded to Jeffrey’s FUD I think we all can go back to work
and end this flamefest! I wish people would actually try to understand
things before writing an insulting rant — without the slightest clue — but
with words like “clusterfuck”. I’d like to thank all the people who commented
on Jeffrey’s blog and basically already said what I wrote here
now.

So, and now I am off hacking a bit on PulseAudio a bit more — or should
I say in Jeffrey’s words: on my clusterfuck that is an epic fail and that no desktop user needs?

Footnotes

[1] BTW ‘glitch-free’ is nothing I invented, other OS have been doing something
like this for quite a while (Vista, Mac OS). On Linux however, PulseAudio is
the first and only implementation (at least to my knowledge).

[2] In fact, Flash 9 can not be made fully working on PulseAudio.
This is because the way Flash destructs it’s driver backends is racy.
Unfixably racy, from external code. Jeffrey complained about Flash instability
in his second post. This is unfair to PulseAudio, because I cannot fix this.
This is like complaining that X crashes when you use binary-only
fglrx.

[3] To Debian’s standards at least. Since development of Debian is
very distributed the integration of such a system as PulseAudio is much more
difficult since in touches so many different packages in the system that are
kind of private property by a lot of different maintainers with different
views on things.

[4] I maintain the Fedora stuff myself, so I might be a bit biased on this one… 😉

[5] I guess Ubuntu sees that this was a bit too much too early, too.
At least that’s how I understood my invitation to UDS in Prague. Since that
summit I haven’t heard anything from them anymore, though.

PulseAudio FUD

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/jeffrey-stedfast.html

Jeffrey Stedfast

Jeffrey Stedfast seems to have made it his new hobby
to
bash
PulseAudio.
In a series of very negative blog postings he flamed my software and hence me
in best NotZed-like fashion. Particularly interesting in this case is the
fact that he apologized to me privately on IRC for this behaviour shortly
after his first posting when he was critizised on #gnome-hackers
only to continue flaming and bashing in more blog posts shortly after. Flaming
is very much part of the Free Software community I guess. A lot of people do
it from time to time (including me). But maybe there are better places for
this than Planet Gnome. And maybe doing it for days is not particularly nice.
And maybe flaming sucks in the first place anyway.

Regardless what I think about Jeffrey and his behaviour on Planet Gnome,
let’s have a look on his trophies, the five “bugs” he posted:

  1. Not directly related to PulseAudio itself. Also, finding errors in code that is related to esd is not exactly the most difficult thing in the world.
  2. The same theme.
  3. Fixed 3 months ago. It is certainly not my fault that this isn’t available in Jeffrey’s distro.
  4. A real, valid bug report. Fixed in git a while back, but not available in any released version. May only be triggered under heavy load or with a bad high-latency scheduler.
  5. A valid bug, but not really in PulseAudio. Mostly caused because the ALSA API and PA API don’t really match 100%.

OK, Jeffrey found a real bug, but I wouldn’t say this is really enough to make all the fuss about. Or is it?

Why PulseAudio?

Jeffrey wrote something about ‘solution looking for a problem‘ when
speaking of PulseAudio. While that was certainly not a nice thing to say it
however tells me one thing: I apparently didn’t manage to communicate well
enough why I am doing PulseAudio in the first place. So, why am I doing it then?

  • There’s so much more a good audio system needs to provide than just the
    most basic mixing functionality. Per-application volumes, moving streams
    between devices during playback, positional event sounds (i.e. click on the
    left side of the screen, have the sound event come out through the left
    speakers), secure session-switching support, monitoring of sound playback
    levels, rescuing playback streams to other audio devices on hot unplug,
    automatic hotplug configuration, automatic up/downmixing stereo/surround,
    high-quality resampling, network transparency, sound effects, simultaneous
    output to multiple sound devices are all features PA provides right now, and
    what you don’t get without it. It also provides the infrastructure for
    upcoming features like volume-follows-focus, automatic attenuation of music on
    signal on VoIP stream, UPnP media renderer support, Apple RAOP support,
    mixing/volume adjustments with dynamic range compression, adaptive volume of
    event sounds based on the volume of music streams, jack sensing, switching
    between stereo/surround/spdif during runtime, …
  • And even for the most basic mixing functionality plain ALSA/dmix is not
    really everlasting happiness. Due to the way it works all clients are forced
    to use the same buffering metrics all the time, that means all clients are
    limited in their wakeup/latency settings. You will burn more CPU than
    necessary this way, keep the risk of drop-outs unnecessarily high and still
    not be able to make clients with low-latency requirements happy. ‘Glitch-Free’
    PulseAudio
    fixes all this. Quite frankly I believe that ‘glitch-free’
    PulseAudio is the single most important killer feature that should be enough
    to convince everyone why PulseAudio is the right thing to do. Maybe people
    actually don’t know that they want this. But they absolutely do, especially
    the embedded people — if used properly it is a must for power-saving during
    audio playback. It’s a pity that how awesome this feature is you cannot
    directly see from the user interface.[1]
  • PulseAudio provides compatibility with a lot of sound systems/APIs that bare ALSA
    or bare OSS don’t provide.
  • And last but not least, I love breaking Jeffrey’s audio. It’s just soo much fun, you really have to try it! 😉

If you want to know more about why I think that PulseAudio is an important part of the modern Linux desktop audio stack, please read my slides from FOSS.in 2007.

Misconceptions

Many people (like Jeffrey) wonder why have software mixing at all if you
have hardware mixing? The thing is, hardware mixing is a thing of the past,
modern soundcards don’t do it anymore. Precisely for doing things like mixing
in software SIMD CPU extensions like SSE have been invented. Modern sound
cards these days are kind of “dumbed” down, high-quality DACs. They don’t do
mixing anymore, many modern chips don’t even do volume control anymore.
Remember the days where having a Wavetable chip was a killer feature of a
sound card? Those days are gone, today wavetable synthesizing is done almost
exlcusively in software — and that’s exactly what happened to hardware mixing
too. And it is good that way. In software mixing is is much easier to do
fancier stuff like DRC which will increase quality of mixing. And modern CPUs provide
all the necessary SIMD command sets to implement this efficiently.

Other people believe that JACK would be a better solution for the problem.
This is nonsense. JACK has been designed for a very different purpose. It is
optimized for low latency inter-application communication. It requires
floating point samples, it knows nothing about channel mappings, it depends on
every client to behave correctly. And so on, and so on. It is a sound server
for audio production. For desktop applications it is however not well suited.
For a desktop saving power is very important, one application misbehaving
shouldn’t have an effect on other application’s playback; converting from/to
FP all the time is not going to help battery life either. Please understand
that for the purpose of pro audio you can make completely different
compromises than you can do on the desktop. For example, while having
‘glitch-free’ is great for embedded and desktop use, it makes no sense at all
for pro audio, and would only have a drawback on performance. So, please stop
bringing up JACK again and again. It’s just not the right tool for desktop
audio, and this opinion is shared by the JACK developers themselves.

Jeffrey thinks that audio mixing is nothing for userspace. Which is
basically what OSS4 tries to do: mixing in kernel space. However, the future
of PCM audio is floating points. Mixing them in kernel space is problematic because (at least on Linux) FP in kernel space is a no-no.
Also, the kernel people made clear more than once that maths/decoding/encoding like this
should happen in userspace. Quite honestly, doing the mixing in kernel space
is probably one of the primary reasons why I think that OSS4 is a bad idea.
The fancier your mixing gets (i.e. including resampling, upmixing, downmixing,
DRC, …) the more difficulties you will have to move such a complex,
time-intensive code into the kernel.

Not everytime your audio breaks it is alone PulseAudio’s fault. For
example, the original flame of Jeffrey’s was about the low volume that he
experienced when running PA. This is mostly due to the suckish way we
initialize the default volumes of ALSA sound cards. Most distributions have
simple scripts that initialize ALSA sound card volumes to fixed values like
75% of the available range, without understanding what the range or the
controls actually mean. This is actually a very bad thing to do. Integrated
USB speakers for example tend export the full amplification range via the
mixer controls. 75% for them is incredibly loud. For other hardware (like
apparently Jeffrey’s) it is too low in volume. How to fix this has been
discussed on the ALSA mailing list, but no final solution has been presented
yet. Nonetheless, the fact that the volume was too low, is completely
unrelated to PulseAudio.

PulseAudio interfaces with lower-level technologies like ALSA on one hand,
and with high-level applications on the other hand. Those systems are not
perfect. Especially closed-source applications tend to do very evil things
with the audio APIs (Flash!) that are very hard to support on virtualized
sound systems such as PulseAudio [2]. However, things are getting better. My list of issues I found in
ALSA
is getting shorter. Many applications have already been fixed.

The reflex “my audio is broken it must be PulseAudio’s fault” is certainly
easy to come up with, but it certainly is not always right.

Also note that — like many areas in Free Software — development of the
desktop audio stack on Linux is a bit understaffed. AFAIK there are only two
people working on ALSA full-time and only me working on PulseAudio and other
userspace audio infrastructure, assisted by a few others who supply code and patches
from time to time, some more and some less.

More Breakage to Come

I now tried to explain why the audio experience on systems with PulseAudio
might not be as good as some people hoped, but what about the future? To be
frank: the next version of PulseAudio (0.9.11) will break even more things.
The ‘glitch-free’ stuff mentioned above uses quite a few features of the
underlying ALSA infrastructure that apparently noone has been using before —
and which just don’t work properly yet on all drivers. And there are quite a
few drivers around, and I only have a very limited set of hardware to test
with. Already I know that the some of the most popular drivers (USB and HDA)
do not work entirely correctly with ‘glitch-free’.

So you ask why I plan to release this code knowing that it will break
things? Well, it works on some hardware/drivers properly, and for the others I
know work-arounds to get things to work. And 0.9.11 has been delayed for too
long already. Also I need testing from a bigger audience. And it is not so
much 0.9.11 that is buggy, it is the code it is based on. ‘Glitch-free’ PA
0.9.11 is going to part of Fedora 10. Fedora has always been more bleeding
edge than other other distributions. Picking 0.9.11 just like that for an
‘LTS’ release might however be a not a good idea.

So, please bear with me when I release 0.9.11. Snapshots have already
been available in Rawhide for a while, and hell didn’t freeze over.

The Distributions’ Role in the Game

Some distributions did a better job adopting PulseAudio than others. On the
good side I certainly have to list Mandriva, Debian[3], and
Fedora[4]. OTOH Ubuntu didn’t exactly do a stellar job. They didn’t
do their homework. Adopting PA in a distribution is a fair amount of work,
given that it interfaces with so many different things at so many different
places. The integration with other systems is crucial. The information was all
out there, communicated on the wiki, the mailing lists and on the PA IRC
channel. But if you join and hang around on neither, then you won’t get the
memo. To my surprise when Ubuntu adopted PulseAudio they moved into one of their
‘LTS’ releases rightaway [5]. Which I guess can be called gutsy —
on the background that I work for Red Hat and PulseAudio is not part of RHEL
at this time. I get a lot of flak from Ubuntu users, and I am pretty sure the
vast amount of it is undeserving and not my fault.

Why Jeffrey’s distro of choice (SUSE?) didn’t package pavucontrol 0.9.6
although it has been released months ago I don’t know. But there’s certainly no reason to whine about
that to me
and bash me for it.

Having said all this — it’s easy to point to other software’s faults or
other people’s failures. So, admitting this, PulseAudio is certainly not
bug-free, far from that. It’s a relatively complex piece of software
(threading, real-time, lock-free, sensitive to timing, …), and every
software has its bugs. In some workloads they might be easier to find than it
others. And I am working on fixing those which are found. I won’t forget any
bug report, but the order and priority I work on them is still mostly up to me
I guess, right? There’s still a lot of work to do in desktop audio, it will
take some time to get things completely right and complete.

Calls for “audio should just work ™” are often heard. But if you don’t
want to stick with a sound system that was state of the art in the 90’s for
all times, then I fear things *will have* to break from time to time. And
Jeffrey, I have no idea what you are actually hacking on. Some people
mentioned something with Evolution. If that’s true, then quite honestly,
“email should just work”, too, shouldn’t it? Evolution is not exactly
famous for it’s legendary bug-freeness and stability, or did I miss something?
Maybe you should be the one to start with making things “just work”, especially since
Evolution has been around for much longer already.

Back to Work

Now that I responded to Jeffrey’s FUD I think we all can go back to work
and end this flamefest! I wish people would actually try to understand
things before writing an insulting rant — without the slightest clue — but
with words like “clusterfuck”. I’d like to thank all the people who commented
on Jeffrey’s blog and basically already said what I wrote here
now.

So, and now I am off hacking a bit on PulseAudio a bit more — or should
I say in Jeffrey’s words: on my clusterfuck that is an epic fail and that no desktop user needs?

Footnotes

[1] BTW ‘glitch-free’ is nothing I invented, other OS have been doing something
like this for quite a while (Vista, Mac OS). On Linux however, PulseAudio is
the first and only implementation (at least to my knowledge).

[2] In fact, Flash 9 can not be made fully working on PulseAudio.
This is because the way Flash destructs it’s driver backends is racy.
Unfixably racy, from external code. Jeffrey complained about Flash instability
in his second post. This is unfair to PulseAudio, because I cannot fix this.
This is like complaining that X crashes when you use binary-only
fglrx.

[3] To Debian’s standards at least. Since development of Debian is
very distributed the integration of such a system as PulseAudio is much more
difficult since in touches so many different packages in the system that are
kind of private property by a lot of different maintainers with different
views on things.

[4] I maintain the Fedora stuff myself, so I might be a bit biased on this one… 😉

[5] I guess Ubuntu sees that this was a bit too much too early, too.
At least that’s how I understood my invitation to UDS in Prague. Since that
summit I haven’t heard anything from them anymore, though.

Avahi/Zeroconf patch for distcc updated

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/avahi-distcc.html

I finally found them time to sit down and update my venerable Avahi/Zeroconf patch for
distcc
. A patched distcc
automatically discovers suitable compiler servers on the local network, without
the need to manually configure them. (Announcement).

Here’s a quick HOWTO for using a patched distcc like this:

Make sure to start distccd (the server) with the new
–zeroconf switch, This will make it announce its services on the
network.

Edit your $HOME/.distcc/hosts and add +zeroconf. This
magic string will enable Zeroconf support in the client, i.e. will be expanded
to the list of available suitable distcc servers on your LAN.

Now set $CC to distcc gcc globally for your login
sessions. This will tell all well-behaving build systems to use distcc
for compilation (this doesn’t work for the kernel, as one notable exception).
Even better than setting $CC to distcc gcc is setting it to
ccache distcc gcc which will enable ccache in addition to distcc. i.e. stick something like this in your ~/.bash_profile: export CC=”ccache distcc gcc”

And finally use make -j `distcc -j` instead of plain make
to enable parallel building with the right number of concurrent processes.
Setting $MAKEFLAGS properly is an alternative option, however is suboptimal if
the evalutation is only done once at login time.

If this doesn’t work for you than it is a good idea to run distcc
–show-hosts to get a list of discovered distcc servers. If this list
isn’t complete then this is most likely due to mismatching GCC versions or
architectures. To check if that’s the case use avahi-browse -r
_distcc._tcp and compare the values of the cc_machine and
cc_version fields. Please note that different Linux distributions use
different GCC machine strings. Which is expected since GCC is usually patched quite
a bit on the different distributions. This means that a Fedora distcc
(the client) will not find a Debian distccd (the server) and vice
versa. But again: that’s a feature, not a bug.

The new -j and –show-hosts options for distcc are useful for non-zeroconf setups, too.

The patch will automatically discover the number of CPUs on remote machines
and make use of that information to better distribute jobs.

In short: Zeroconf support in distcc is totally hot, everyone should have it!

For more information have a look on the announcement of my original
patch from 2004
(at that time for the historic HOWL Zeroconf daemon), or read the new
announcement linked above.

Distribution packagers! Please merge this new patch into your packages! It
would be a pity to withhold Zeroconf support in distcc from your users any
longer!

Unfortunately, Fedora doesn’t include any distcc packages. Someone should be
changing that (who’s not me ;-)).

You like this patch? Then give me a kudo on ohloh.net. Now that I earned a golden 10 (after kicking Larry Ewing from position 64. Ha, take that Mr. Ewing!), I need to make sure I don’t fall into silver oblivion again. 😉

Avahi/Zeroconf patch for distcc updated

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/avahi-distcc.html

I finally found them time to sit down and update my venerable Avahi/Zeroconf patch for
distcc
. A patched distcc
automatically discovers suitable compiler servers on the local network, without
the need to manually configure them. (Announcement).

Here’s a quick HOWTO for using a patched distcc like this:

  • Make sure to start distccd (the server) with the new
    --zeroconf switch, This will make it announce its services on the
    network.
  • Edit your $HOME/.distcc/hosts and add +zeroconf. This
    magic string will enable Zeroconf support in the client, i.e. will be expanded
    to the list of available suitable distcc servers on your LAN.
  • Now set $CC to distcc gcc globally for your login
    sessions. This will tell all well-behaving build systems to use distcc
    for compilation (this doesn’t work for the kernel, as one notable exception).
    Even better than setting $CC to distcc gcc is setting it to
    ccache distcc gcc which will enable ccache in addition to distcc. i.e. stick something like this in your ~/.bash_profile: export CC="ccache distcc gcc"
  • And finally use make -j `distcc -j` instead of plain make
    to enable parallel building with the right number of concurrent processes.
    Setting $MAKEFLAGS properly is an alternative option, however is suboptimal if
    the evalutation is only done once at login time.

If this doesn’t work for you than it is a good idea to run distcc
--show-hosts
to get a list of discovered distcc servers. If this list
isn’t complete then this is most likely due to mismatching GCC versions or
architectures. To check if that’s the case use avahi-browse -r
_distcc._tcp
and compare the values of the cc_machine and
cc_version fields. Please note that different Linux distributions use
different GCC machine strings. Which is expected since GCC is usually patched quite
a bit on the different distributions. This means that a Fedora distcc
(the client) will not find a Debian distccd (the server) and vice
versa. But again: that’s a feature, not a bug.

The new -j and --show-hosts options for distcc are useful for non-zeroconf setups, too.

The patch will automatically discover the number of CPUs on remote machines
and make use of that information to better distribute jobs.

In short: Zeroconf support in distcc is totally hot, everyone should have it!

For more information have a look on the announcement of my original
patch from 2004
(at that time for the historic HOWL Zeroconf daemon), or read the new
announcement linked above.

Distribution packagers! Please merge this new patch into your packages! It
would be a pity to withhold Zeroconf support in distcc from your users any
longer!

Unfortunately, Fedora doesn’t include any distcc packages. Someone should be
changing that (who’s not me ;-)).

You like this patch? Then give me a kudo on ohloh.net. Now that I earned a golden 10 (after kicking Larry Ewing from position 64. Ha, take that Mr. Ewing!), I need to make sure I don’t fall into silver oblivion again. 😉

Enforcing a Whitespace Regime

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/whitespace-regime.html

So, you want to be as tough as the kernel guys and enforce a strict
whitespace regime on your project? But you lack the whitespace
fascists with too many free time lurking on your mailing list who
might do all the bitching about badly formatted patches for you?
Salvation is here:

Stick this
pre-commit file
in your SVN repository as
hooks/pre-commit and give it a chmod +x and your
SVN server will do all the bitching for you — for free:

#!/bin/bash -e

REPOS=”$1″
TXN=”$2″

SVNLOOK=/usr/bin/svnlook

# Require some text in the log
$SVNLOOK log -t “$TXN” “$REPOS” | grep -q ‘[a-zA-Z0-9]’ || exit 1

# Block commits with tabs or trailing whitespace
$SVNLOOK diff -t “$TXN” “$REPOS” | python /dev/fd/3 3<<‘EOF’
import sys
ignore = True
SUFFIXES = [ “.c”, “.h”, “.cc”, “.C”, “.cpp”, “.hh”, “.H”, “.hpp”, “.java” ]
filename = None

for ln in sys.stdin:

if ignore and ln.startswith(“+++ “):
filename = ln[4:ln.find(“t”)].strip()
ignore = not reduce(lambda x, y: x or y, map(lambda x: filename.endswith(x), SUFFIXES))

elif not ignore:
if ln.startswith(“+”):

if ln.count(“t”) > 0:
sys.stderr.write(“n*** Transaction blocked, %s contains tab character:nn%s” % (filename, ln))
sys.exit(1)

if ln.endswith(” n”):
sys.stderr.write(“n*** Transaction blocked, %s contains lines with trailing whitespace:nn%s<EOL>n” % (filename, ln.rstrip(“n”)))
sys.exit(1)

if not (ln.startswith(“@”) or
ln.startswith(“-“) or
ln.startswith(“+”) or
ln.startswith(” “)):

ignore = True

sys.exit(0)
EOF

exit “$?”

This will cause all commits to be blocked that don’t follow my personal tase of whitespace rules.

Of course, it is up to you to adjust this script to your personal
taste of fascism. If you hate tabs like I do, and fear trailing
whitespace like I do, than you can use this script without any
changes. Otherwise, learn Python and do some trivial patching.

Hmm, so you wonder why anyone would enforce a whitespace regime
like this? First of all, it’s a chance to be part of a regime —
where you are the dictator! Secondly, if people use tabs source files
look like Kraut und Rüben, different in every
editor[1]. Thirdly, trailing whitespace make clean diffs
difficult[2]. And think of the hard disk space savings!

I wonder how this might translate into GIT. I have a couple of GIT
repositories where I’d like to enforce a similar regime as in my SVN repositories. Suggestions welcome!

Oh, and to make it bearable to live under such a regime, configure
your $EDITOR properly, for example by hooking
nuke-trailing-whitespace.el to ‘write-file-hooks in
Emacs.

Footnotes

[1] Yes, some people think this is a feature. I don’t. But talk to /dev/null if you want to discuss this with me.

[2] Yes, there is diff -b, but it is still a PITA.

Enforcing a Whitespace Regime

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/whitespace-regime.html

So, you want to be as tough as the kernel guys and enforce a strict
whitespace regime on your project? But you lack the whitespace
fascists with too many free time lurking on your mailing list who
might do all the bitching about badly formatted patches for you?
Salvation is here:

Stick this
pre-commit file
in your SVN repository as
hooks/pre-commit and give it a chmod +x and your
SVN server will do all the bitching for you — for free:

#!/bin/bash -e

REPOS="$1"
TXN="$2"

SVNLOOK=/usr/bin/svnlook

# Require some text in the log
$SVNLOOK log -t "$TXN" "$REPOS" | grep -q '[a-zA-Z0-9]' || exit 1

# Block commits with tabs or trailing whitespace
$SVNLOOK diff -t "$TXN" "$REPOS" | python /dev/fd/3 3<<'EOF'
import sys
ignore = True
SUFFIXES = [ ".c", ".h", ".cc", ".C", ".cpp", ".hh", ".H", ".hpp", ".java" ]
filename = None

for ln in sys.stdin:

        if ignore and ln.startswith("+++ "):
                filename = ln[4:ln.find("\t")].strip()
                ignore = not reduce(lambda x, y: x or y, map(lambda x: filename.endswith(x), SUFFIXES))

        elif not ignore:
		if ln.startswith("+"):

			if ln.count("\t") > 0:
                        	sys.stderr.write("\n*** Transaction blocked, %s contains tab character:\n\n%s" % (filename, ln))
                        	sys.exit(1)

                	if ln.endswith(" \n"):
                        	sys.stderr.write("\n*** Transaction blocked, %s contains lines with trailing whitespace:\n\n%s<EOL>\n" % (filename, ln.rstrip("\n")))
                        	sys.exit(1)

		if not (ln.startswith("@") or \
			ln.startswith("-") or \
			ln.startswith("+") or \
			ln.startswith(" ")):

			ignore = True

sys.exit(0)
EOF

exit "$?"

This will cause all commits to be blocked that don’t follow my personal tase of whitespace rules.

Of course, it is up to you to adjust this script to your personal
taste of fascism. If you hate tabs like I do, and fear trailing
whitespace like I do, than you can use this script without any
changes. Otherwise, learn Python and do some trivial patching.

Hmm, so you wonder why anyone would enforce a whitespace regime
like this? First of all, it’s a chance to be part of a regime —
where you are the dictator! Secondly, if people use tabs source files
look like Kraut und Rüben, different in every
editor[1]. Thirdly, trailing whitespace make clean diffs
difficult[2]. And think of the hard disk space savings!

I wonder how this might translate into GIT. I have a couple of GIT
repositories where I’d like to enforce a similar regime as in my SVN repositories. Suggestions welcome!

Oh, and to make it bearable to live under such a regime, configure
your $EDITOR properly, for example by hooking
nuke-trailing-whitespace.el to 'write-file-hooks in
Emacs.

Footnotes

[1] Yes, some people think this is a feature. I don’t. But talk to /dev/null if you want to discuss this with me.

[2] Yes, there is diff -b, but it is still a PITA.

More Xen Tricks

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2007/08/24/more-xen.html

In
my previous
post about Xen
, I talked about how easy Xen is to configure and
set up, particularly on Ubuntu and Debian. I’m still grateful that
Xen remains easy; however, I’ve lately had a few Xen-related
challenges that needed attention. In particular, I’ve needed to
create some surprisingly messy solutions when using vif-route to
route multiple IP numbers on the same network through the dom0 to a
domU.

I tend to use vif-route rather than vif-bridge, as I like the control
it gives me in the dom0. The dom0 becomes a very traditional
packet-forwarding firewall that can decide whether or not to forward
packets to each domU host. However, I recently found some deep
weirdness in IP routing when I use this approach while needing
multiple Ethernet interfaces on the domU. Here’s an example:

Multiple IP numbers for Apache

Suppose the domU host, called webserv, hosts a number of
websites, each with a different IP number, so that I have Apache
doing something like1:

Listen 192.168.0.200:80
Listen 192.168.0.201:80
Listen 192.168.0.202:80

NameVirtualHost 192.168.0.200:80
<VirtualHost 192.168.0.200:80>

NameVirtualHost 192.168.0.201:80
<VirtualHost 192.168.0.201:80>

NameVirtualHost 192.168.0.202:80
<VirtualHost 192.168.0.202:80>

The Xen Configuration for the Interfaces

Since I’m serving all three of those sites from webserv, I
need all those IP numbers to be real, live IP numbers on the local
machine as far as the webserv is concerned. So, in
dom0:/etc/xen/webserv.cfg I list something like:

vif = [ ‘mac=de:ad:be:ef:00:00, ip=192.168.0.200’,
‘mac=de:ad:be:ef:00:01, ip=192.168.0.201’,
‘mac=de:ad:be:ef:00:02, ip=192.168.0.202’ ]

… And then make webserv:/etc/iftab look like:

eth0 mac de:ad:be:ef:00:00 arp 1
eth1 mac de:ad:be:ef:00:01 arp 1
eth2 mac de:ad:be:ef:00:02 arp 1

… And make webserv:/etc/network/interfaces (this is
probably Ubuntu/Debian-specific, BTW) look like:

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 192.168.0.200
netmask 255.255.255.0
auto eth1
iface eth1 inet static
address 192.168.0.201
netmask 255.255.255.0
auto eth2
iface eth2 inet static
address 192.168.0.202
netmask 255.255.255.0

Packet Forwarding from the Dom0

But, this doesn’t get me the whole way there. My next step is to make
sure that the dom0 is routing the packets properly to
webserv. Since my dom0 is heavily locked down, all
packets are dropped by default, so I have to let through explicitly
anything I’d like webserv to be able to process. So, I
add some code to my firewall script on the dom0 that looks like:2

webIpAddresses=”192.168.0.200 192.168.0.201 192.168.0.202″
UNPRIVPORTS=”1024:65535″

for dport in 80 443;
do
for sport in $UNPRIVPORTS 80 443 8080;
do
for ip in $webIpAddresses;
do
/sbin/iptables -A FORWARD -i eth0 -p tcp -d $ip
–syn -m state –state NEW
–sport $sport –dport $dport -j ACCEPT

/sbin/iptables -A FORWARD -i eth0 -p tcp -d $ip
–sport $sport –dport $dport
-m state –state ESTABLISHED,RELATED -j ACCEPT

/sbin/iptables -A FORWARD -o eth0 -s $ip
-p tcp –dport $sport –sport $dport
-m state –state NEW,ESTABLISHED,RELATED -j ACCEPT
done
done
done

Phew! So at this point, I thought I was done. The packets should find
their way forwarded through the dom0 to the Apache instance running on
the domU, webserv. While that much was true, I now have
the additional problem that packets got lost in a bit of a black hole
on webserv. When I discovered the black hole, I quickly
realized why. It was somewhat atypical, from webserv’s
point of view, to have three “real” and different Ethernet
devices with three different IP numbers, which all talk to the exact
same network. There was more intelligent routing
needed.3

Routing in the domU

While most non-sysadmins still use the route command to
set up local IP routes on a GNU/Linux host, iproute2
(available via the ip command) has been a standard part
of GNU/Linux distributions and supported by Linux for nearly ten
years. To properly support the situation of multiple (from
webserv’s point of view, at least) physical interfaces on
the same network, some special iproute2 code is needed.
Specifically, I set up separate route tables for each device. I first
encoded their names in /etc/iproute2/rt_tables (the
numbers 16-18 are arbitrary, BTW):

16 eth0-200
17 eth1-201
18 eth2-202

And here are the ip commands that I thought would work
(but didn’t, as you’ll see next):

/sbin/ip route del default via 192.168.0.1

for table in eth0-200 eth1-201 eth2-202;
do
iface=`echo $table | perl -pe ‘s/^(S+)-.*$/$1/;’`
ipEnding=`echo $table | perl -pe ‘s/^.*-(S+)$/$1/;’`
ip=192.168.0.$ipEnding
/sbin/ip route add 192.168.0.0/24 dev $iface table $table

/sbin/ip route add default via 192.168.0.1 table $table
/sbin/ip rule add from $ip table $table
/sbin/ip rule add to 0.0.0.0 dev $iface table $table
done

/sbin/ip route add default via 192.168.0.1

The idea is that each table will use rules to force all traffic coming
in on the given IP number and/or interface to always go back out on
the same, and vice versa. The key is these two lines:

/sbin/ip rule add from $ip table $table
/sbin/ip rule add to 0.0.0.0 dev $iface table $table

The first rule says that when traffic is coming from the given IP number,
$ip, the routing rules in table, $table should
be used. The second says that traffic to anywhere when bound for
interface, $iface should use table,
$table.

The tables themselves are set up to always make sure the local network
traffic goes through the proper associated interface, and that the
network router (in this case, 192.168.0.1) is always
used for foreign networks, but that it is reached via the correct
interface.

This is all well and good, but it doesn’t work. Certain instructions
fail with the message, RTNETLINK answers: Network is
unreachable, because the 192.168.0.0 network cannot be found
while the instructions are running. Perhaps there is an
elegant solution; I couldn’t find one. Instead, I temporarily set
up “dummy” global routes in the main route table and
deleted them once the table-specific ones were created. Here’s the
new bash script that does that (lines that are added are emphasized
and in bold):

/sbin/ip route del default via 192.168.0.1
for table in eth0-200 eth1-201 eth2-202;
do
iface=`echo $table | perl -pe ‘s/^(S+)-.*$/$1/;’`
ipEnding=`echo $table | perl -pe ‘s/^.*-(S+)$/$1/;’`
ip=192.168.0.$ipEnding
/sbin/ip route add 192.168.0.0/24 dev $iface table $table

/sbin/ip route add 192.168.0.0/24 dev $iface src $ip

/sbin/ip route add default via 192.168.0.1 table $table
/sbin/ip rule add from $ip table $table

/sbin/ip rule add to 0.0.0.0 dev $iface table $table

/sbin/ip route del 192.168.0.0/24 dev $iface src $ip
done
/sbin/ip route add 192.168.0.0/24 dev eth0 src 192.168.0.200
/sbin/ip route add default via 192.168.0.1
/sbin/ip route del 192.168.0.0/24 dev eth0 src 192.168.0.200

I am pretty sure I’m missing something here — there must be a
better way to do this, but the above actually works, even if it’s
ugly.

Alas, Only Three

There was one additional confusion I put myself through while
implementing the solution. I was actually trying to route four
separate IP addresses into webserv, but discovered that
I got found this error message (found via dmesg on the
domU):
netfront can’t alloc rx grant refs. A quick google
around showed me
that the
XenFaq, which says that Xen 3 cannot handled more than three network
interfaces per domU
. Seems strangely arbitrary to me; I’d love
to hear why cuts it off at three. I can imagine limits at one and
two, but it seems that once you can do three, n should be
possible (perhaps still with linear slowdown or some such). I’ll
have to ask the Xen developers (or UTSL) some day to find out what
makes it possible to have three work but not four.

1Yes, I know I
could rely on client-provided Host: headers and do this with full
name-based virtual hosting, but I don’t
like to do that for good reason (as outlined in the Apache
docs)
.

2Note that the
above firewall code must run on dom0, which has one real
Ethernet device (its eth0) that is connected properly to
the wide 192.168.0.0/24 network, and should have some IP
number of its own there — say 192.168.0.100. And,
don’t forget that dom0 is configured for vif-route, not
vif-bridge. Finally, for brevity, I’ve left out some of the
firewall code that FORWARDs through key stuff like DNS. If you are
interested in it, email me or look it up in a firewall book.

3I was actually a
bit surprised at this, because I often have multiple IP numbers
serviced from the same computer and physical Ethernet interface.
However, in those cases, I use virtual interfaces
(eth0:0, eth0:1, etc.). On a normal system,
Linux does the work of properly routing the IP numbers when you attach
multiple IP numbers virtually to the same physical interface.
However, in Xen domUs, the physical interfaces are locked by Xen to
only permit specific IP numbers to come through, and while you can set
up all the virtual interfaces you want in the domU, it will only get
packets destine for the IP number specified in the vif
section of the configuration file. That’s why I added my three
different “actual” interfaces in the domU.