Tag Archives: wd

systemd for Administrators, Part XVII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already
posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement
.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

Lines of error priority (and higher) will be highlighted red.
Lines of notice/warning priority will be highlighted bold.
The timestamps are converted into your local time-zone.
The output is auto-paged with your pager of choice (defaults to less).
This will show all available data, including rotated logs.
Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl –since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl –since=2012-10-15 –until=”2011-10-16 23:59:59″

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd –since=00:00 –until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[…]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
PRIORITY=6
SYSLOG_FACILITY=3
_MACHINE_ID=a91663387a90b89f185d4e860000001a
_HOSTNAME=epsilon
_TRANSPORT=syslog
SYSLOG_IDENTIFIER=avahi-daemon
_COMM=avahi-daemon
_EXE=/usr/sbin/avahi-daemon
_SYSTEMD_CGROUP=/system/avahi-daemon.service
_SYSTEMD_UNIT=avahi-daemon.service
_SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
_UID=70
_GID=70
_CMDLINE=avahi-daemon: registering [epsilon.local]
MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
_BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
_PID=27937
SYSLOG_PID=27937
_SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page
.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel system_u:system_r:local_login_t:s0-s0:c0.c1023 system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0 system_u:system_r:lvm_t:s0 system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0 system_u:system_r:modemmanager_t:s0-s0:c0.c1023 system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0 system_u:system_r:NetworkManager_t:s0 system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023 system_u:system_r:policykit_t:s0 unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0 system_u:system_r:rtkit_daemon_t:s0 unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023 system_u:system_r:syslogd_t:s0 unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0 system_u:system_r:system_cronjob_t:s0-s0:c0.c1023 unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023 system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0 system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work
, and I hope it will make appearence in
F18.

systemd for Administrators, Part XVII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already
posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement
.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

  • Lines of error priority (and higher) will be highlighted red.
  • Lines of notice/warning priority will be highlighted bold.
  • The timestamps are converted into your local time-zone.
  • The output is auto-paged with your pager of choice (defaults to less).
  • This will show all available data, including rotated logs.
  • Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page
.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work
, and I hope it will make appearence in
F18.

systemd for Administrators, Part XVII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already
posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement
.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

  • Lines of error priority (and higher) will be highlighted red.
  • Lines of notice/warning priority will be highlighted bold.
  • The timestamps are converted into your local time-zone.
  • The output is auto-paged with your pager of choice (defaults to less).
  • This will show all available data, including rotated logs.
  • Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page
.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work
, and I hope it will make appearence in
F18.

WinConn

Post Syndicated from RealEnder original http://alex.stanev.org/blog/?p=302

От години използвам Linux като основна работна среда.
От горното изречение, особено думата “основна”, личи, че понякога се налага и да се пали win/mac, където се стартират ред приложения, било поради недобра съвместимост, било защото просто са специализиран софтуер, писан само за конкретната операционна система.
Години наред решението бе да се закачиш на отдалечена машина по VNC/RDP/whatever и да си свършиш работата по най-бързия начин, преди да си се издразнил от разликите в интерфейса и от това да падне продуктивността. Възможностите за seamless интеграция (използване на отдалечените приложения въвместно с останалите в работната среда) бяха ограничени и трудни за конфигуриране и използване – SeamlessRDP на rdesktop, Seamless mode на VirtualBox, VMware Fusion и т.н.
Благодарение на развитието на проекта FreeRDP, вече имаме свободна имплементация на RemoteApp. Липсващото парченце от пъзела е приложение, което да улесни конфигурирането на отдалечените приложения.
За целта последните дни работих върху WinConn, графичен мениджър за RemoteApp приложения с отворен код. На страницата можете да видите задължителните снимки на екрана, видео в действие и да го инсталирате от моето PPA в Ubuntu.
Разработката на WinConn съвпадна с Ubuntu App Showdown – триседмичен конкурс за програмиране на приложения в Ubuntu. Ако ви хареса, след 10ти юли ще можете да гласувате чрез рейтинговата система в Ubuntu Software Centre. Разбира се, приемам всякакви предложения за подобрения, бъгове и т.н. в Launchpad.

systemd for Administrators, Part XV

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/watchdog.html

Quickly
following the previous iteration
, here’s
now the fifteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd:
the embedded/mobile folks, the desktop people and the server
folks. While the systems used by embedded/mobile tend to be
underpowered and have few resources are available, desktops tend to be
much more powerful machines — but still much less resourceful than
servers. Nonetheless there are surprisingly many features that matter
to both extremes of this axis (embedded and servers), but not the
center (desktops). On of them is support for watchdogs in
hardware and software.

Embedded devices frequently rely on watchdog hardware that resets
it automatically if software stops responding (more specifically,
stops signalling the hardware in fixed intervals that it is still
alive). This is required to increase reliability and make sure that
regardless what happens the best is attempted to get the system
working again. Functionality like this makes little sense on the
desktop[1]. However, on
high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for
hardware watchdogs (as exposed in /dev/watchdog to
userspace), as well as supervisor (software) watchdog support for
invidual system services. The basic idea is the following: if enabled,
systemd will regularly ping the watchdog hardware. If systemd or the
kernel hang this ping will not happen anymore and the hardware will
automatically reset the system. This way systemd and the kernel are
protected from boundless hangs — by the hardware. To make the chain
complete, systemd then exposes a software watchdog interface for
individual services so that they can also be restarted (or some other
action taken) if they begin to hang. This software watchdog logic can
be configured individually for each service in the ping frequency and
the action to take. Putting both parts together (i.e. hardware
watchdogs supervising systemd and the kernel, as well as systemd
supervising all other services) we have a reliable way to watchdog
every single component of the system.

To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec= option in
/etc/systemd/system.conf. It defaults to 0 (i.e. no hardware
watchdog use). Set it to a value like 20s and the watchdog is
enabled. After 20s of no keep-alive pings the hardware will reset
itself. Note that systemd will send a ping to the hardware at half the
specified interval, i.e. every 10s. And that’s already all there is to
it. By enabling this single, simple option you have turned on
supervision by the hardware of systemd and the kernel beneath
it.[2]

Note that the hardware watchdog device (/dev/watchdog) is
single-user only. That means that you can either enable this
functionality in systemd, or use a separate external watchdog daemon,
such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be
configured in /etc/systemd/system.conf. It controls the
watchdog interval to use during reboots. It defaults to 10min, and
adds extra reliability to the system reboot logic: if a clean reboot
is not possible and shutdown hangs, we rely on the watchdog hardware
to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are
really everything that is necessary to make use of the hardware
watchdogs. Now, let’s have a look how to add watchdog logic to
individual services.

First of all, to make software watchdog-supervisable it needs to be
patched to send out “I am alive” signals in regular intervals in its
event loop. Patching this is relatively easy. First, a daemon needs to
read the WATCHDOG_USEC= environment variable. If it is set,
it will contain the watchdog interval in usec formatted as ASCII text
string, as it is configured for the service. The daemon should then
issue sd_notify(“WATCHDOG=1”)
calls every half of that interval. A daemon patched this way should
transparently support watchdog functionality by checking whether the
environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been
patched to support the logic pointed out above) it is sufficient to
set the WatchdogSec= to the desired failure latency. See systemd.service(5)
for details on this setting. This causes WATCHDOG_USEC= to be
set for the service’s processes and will cause the service to enter a
failure state as soon as no keep-alive ping is received within the
configured interval.

If a service enters a failure state as soon as the watchdog logic
detects a hang, then this is hardly sufficient to build a reliable
system. The next step is to configure whether the service shall be
restarted and how often, and what to do if it then still fails. To
enable automatic service restarts on failure set
Restart=on-failure for the service. To configure how many
times a service shall be attempted to be restarted use the combination
of StartLimitBurst= and StartLimitInterval= which
allow you to configure how often a service may restart within a time
interval. If that limit is reached, a special action can be
taken. This action is configured with StartLimitAction=. The
default is a none, i.e. that no further action is taken and
the service simply remains in the failure state without any further
attempted restarts. The other three possible values are
reboot, reboot-force and
reboot-immediate. reboot attempts a clean reboot,
going through the usual, clean shutdown logic. reboot-force
is more abrupt: it will not actually try to cleanly shutdown any
services, but immediately kills all remaining services and unmounts
all file systems and then forcibly reboots (this way all file systems
will be clean but reboot will still be very fast). Finally,
reboot-immediate does not attempt to kill any process or
unmount any file systems. Instead it just hard reboots the machine
without delay. reboot-immediate hence comes closest to a
reboot triggered by a hardware watchdog. All these settings are
documented in systemd.service(5).

Putting this all together we now have pretty flexible options to
watchdog-supervise a specific service and configure automatic restarts
of the service if it hangs, plus take ultimate action if that doesn’t
help.

Here’s an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn’t pinged
the system manager for longer than 30s or if it fails otherwise. If it
is restarted this way more often than 4 times in 5min action is taken
and the system quickly rebooted, with all file systems being clean
when it comes up again.

And that’s already all I wanted to tell you about! With hardware
watchdog support right in PID 1, as well as supervisor watchdog
support for individual services we should provide everything you need
for most watchdog usecases. Regardless if you are building an embedded
or mobile applience, or if your are working with high-availability
servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with
/dev/watchdog, and why this shouldn’t be kept in a separate
daemon, then please read this again and try to understand that this is
all about the supervisor chain we are building here, where the hardware watchdog
supervises systemd, and systemd supervises the individual
services. Also, we believe that a service not responding should be
treated in a similar way as any other service error. Finally, pinging
/dev/watchdog is one of the most trivial operations in the OS
(basically little more than a ioctl() call), to the support for this
is not more than a handful lines of code. Maintaining this externally
with complex IPC between PID 1 (and the daemons) and this watchdog
daemon would be drastically more complex, error-prone and resource
intensive.)

Note that the built-in hardware watchdog support of systemd does
not conflict with other watchdog software by default. systemd does not
make use of /dev/watchdog by default, and you are welcome to
use external watchdog daemons in conjunction with systemd, if this
better suits your needs.

And one last thing: if you wonder whether your hardware has a
watchdog, then the answer is: almost definitely yes — if it is anything more
recent than a few years. If you want to verify this, try the wdctl
tool from recent util-linux, which shows you everything you need to
know about your watchdog hardware.

I’d like to thank the great folks from Pengutronix for contributing
most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog
hardware these days too, as this is cheap to build and available in
most modern PC chipsets.

[2] So, here’s a free tip for you if you hack on the core
OS: don’t enable this feature while you hack. Otherwise your system
might suddenly reboot if you are in the middle of tracing through PID
1 with gdb and cause it to be stopped for a moment, so that no
hardware ping can be done…

systemd for Administrators, Part XV

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/watchdog.html

Quickly
following the previous iteration
, here’s
now the fifteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd:
the embedded/mobile folks, the desktop people and the server
folks. While the systems used by embedded/mobile tend to be
underpowered and have few resources are available, desktops tend to be
much more powerful machines — but still much less resourceful than
servers. Nonetheless there are surprisingly many features that matter
to both extremes of this axis (embedded and servers), but not the
center (desktops). On of them is support for watchdogs in
hardware and software.

Embedded devices frequently rely on watchdog hardware that resets
it automatically if software stops responding (more specifically,
stops signalling the hardware in fixed intervals that it is still
alive). This is required to increase reliability and make sure that
regardless what happens the best is attempted to get the system
working again. Functionality like this makes little sense on the
desktop[1]. However, on
high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for
hardware watchdogs (as exposed in /dev/watchdog to
userspace), as well as supervisor (software) watchdog support for
invidual system services. The basic idea is the following: if enabled,
systemd will regularly ping the watchdog hardware. If systemd or the
kernel hang this ping will not happen anymore and the hardware will
automatically reset the system. This way systemd and the kernel are
protected from boundless hangs — by the hardware. To make the chain
complete, systemd then exposes a software watchdog interface for
individual services so that they can also be restarted (or some other
action taken) if they begin to hang. This software watchdog logic can
be configured individually for each service in the ping frequency and
the action to take. Putting both parts together (i.e. hardware
watchdogs supervising systemd and the kernel, as well as systemd
supervising all other services) we have a reliable way to watchdog
every single component of the system.

To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec= option in
/etc/systemd/system.conf. It defaults to 0 (i.e. no hardware
watchdog use). Set it to a value like 20s and the watchdog is
enabled. After 20s of no keep-alive pings the hardware will reset
itself. Note that systemd will send a ping to the hardware at half the
specified interval, i.e. every 10s. And that’s already all there is to
it. By enabling this single, simple option you have turned on
supervision by the hardware of systemd and the kernel beneath
it.[2]

Note that the hardware watchdog device (/dev/watchdog) is
single-user only. That means that you can either enable this
functionality in systemd, or use a separate external watchdog daemon,
such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be
configured in /etc/systemd/system.conf. It controls the
watchdog interval to use during reboots. It defaults to 10min, and
adds extra reliability to the system reboot logic: if a clean reboot
is not possible and shutdown hangs, we rely on the watchdog hardware
to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are
really everything that is necessary to make use of the hardware
watchdogs. Now, let’s have a look how to add watchdog logic to
individual services.

First of all, to make software watchdog-supervisable it needs to be
patched to send out “I am alive” signals in regular intervals in its
event loop. Patching this is relatively easy. First, a daemon needs to
read the WATCHDOG_USEC= environment variable. If it is set,
it will contain the watchdog interval in usec formatted as ASCII text
string, as it is configured for the service. The daemon should then
issue sd_notify("WATCHDOG=1")
calls every half of that interval. A daemon patched this way should
transparently support watchdog functionality by checking whether the
environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been
patched to support the logic pointed out above) it is sufficient to
set the WatchdogSec= to the desired failure latency. See systemd.service(5)
for details on this setting. This causes WATCHDOG_USEC= to be
set for the service’s processes and will cause the service to enter a
failure state as soon as no keep-alive ping is received within the
configured interval.

If a service enters a failure state as soon as the watchdog logic
detects a hang, then this is hardly sufficient to build a reliable
system. The next step is to configure whether the service shall be
restarted and how often, and what to do if it then still fails. To
enable automatic service restarts on failure set
Restart=on-failure for the service. To configure how many
times a service shall be attempted to be restarted use the combination
of StartLimitBurst= and StartLimitInterval= which
allow you to configure how often a service may restart within a time
interval. If that limit is reached, a special action can be
taken. This action is configured with StartLimitAction=. The
default is a none, i.e. that no further action is taken and
the service simply remains in the failure state without any further
attempted restarts. The other three possible values are
reboot, reboot-force and
reboot-immediate. reboot attempts a clean reboot,
going through the usual, clean shutdown logic. reboot-force
is more abrupt: it will not actually try to cleanly shutdown any
services, but immediately kills all remaining services and unmounts
all file systems and then forcibly reboots (this way all file systems
will be clean but reboot will still be very fast). Finally,
reboot-immediate does not attempt to kill any process or
unmount any file systems. Instead it just hard reboots the machine
without delay. reboot-immediate hence comes closest to a
reboot triggered by a hardware watchdog. All these settings are
documented in systemd.service(5).

Putting this all together we now have pretty flexible options to
watchdog-supervise a specific service and configure automatic restarts
of the service if it hangs, plus take ultimate action if that doesn’t
help.

Here’s an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn’t pinged
the system manager for longer than 30s or if it fails otherwise. If it
is restarted this way more often than 4 times in 5min action is taken
and the system quickly rebooted, with all file systems being clean
when it comes up again.

And that’s already all I wanted to tell you about! With hardware
watchdog support right in PID 1, as well as supervisor watchdog
support for individual services we should provide everything you need
for most watchdog usecases. Regardless if you are building an embedded
or mobile applience, or if your are working with high-availability
servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with
/dev/watchdog, and why this shouldn’t be kept in a separate
daemon, then please read this again and try to understand that this is
all about the supervisor chain we are building here, where the hardware watchdog
supervises systemd, and systemd supervises the individual
services. Also, we believe that a service not responding should be
treated in a similar way as any other service error. Finally, pinging
/dev/watchdog is one of the most trivial operations in the OS
(basically little more than a ioctl() call), to the support for this
is not more than a handful lines of code. Maintaining this externally
with complex IPC between PID 1 (and the daemons) and this watchdog
daemon would be drastically more complex, error-prone and resource
intensive.)

Note that the built-in hardware watchdog support of systemd does
not conflict with other watchdog software by default. systemd does not
make use of /dev/watchdog by default, and you are welcome to
use external watchdog daemons in conjunction with systemd, if this
better suits your needs.

And one last thing: if you wonder whether your hardware has a
watchdog, then the answer is: almost definitely yes — if it is anything more
recent than a few years. If you want to verify this, try the wdctl
tool from recent util-linux, which shows you everything you need to
know about your watchdog hardware.

I’d like to thank the great folks from Pengutronix for contributing
most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog
hardware these days too, as this is cheap to build and available in
most modern PC chipsets.

[2] So, here’s a free tip for you if you hack on the core
OS: don’t enable this feature while you hack. Otherwise your system
might suddenly reboot if you are in the middle of tracing through PID
1 with gdb and cause it to be stopped for a moment, so that no
hardware ping can be done…

systemd for Administrators, Part XV

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/watchdog.html

Quickly
following the previous iteration
, here’s
now the fifteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd:
the embedded/mobile folks, the desktop people and the server
folks. While the systems used by embedded/mobile tend to be
underpowered and have few resources are available, desktops tend to be
much more powerful machines — but still much less resourceful than
servers. Nonetheless there are surprisingly many features that matter
to both extremes of this axis (embedded and servers), but not the
center (desktops). On of them is support for watchdogs in
hardware and software.

Embedded devices frequently rely on watchdog hardware that resets
it automatically if software stops responding (more specifically,
stops signalling the hardware in fixed intervals that it is still
alive). This is required to increase reliability and make sure that
regardless what happens the best is attempted to get the system
working again. Functionality like this makes little sense on the
desktop[1]. However, on
high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for
hardware watchdogs (as exposed in /dev/watchdog to
userspace), as well as supervisor (software) watchdog support for
invidual system services. The basic idea is the following: if enabled,
systemd will regularly ping the watchdog hardware. If systemd or the
kernel hang this ping will not happen anymore and the hardware will
automatically reset the system. This way systemd and the kernel are
protected from boundless hangs — by the hardware. To make the chain
complete, systemd then exposes a software watchdog interface for
individual services so that they can also be restarted (or some other
action taken) if they begin to hang. This software watchdog logic can
be configured individually for each service in the ping frequency and
the action to take. Putting both parts together (i.e. hardware
watchdogs supervising systemd and the kernel, as well as systemd
supervising all other services) we have a reliable way to watchdog
every single component of the system.

To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec= option in
/etc/systemd/system.conf. It defaults to 0 (i.e. no hardware
watchdog use). Set it to a value like 20s and the watchdog is
enabled. After 20s of no keep-alive pings the hardware will reset
itself. Note that systemd will send a ping to the hardware at half the
specified interval, i.e. every 10s. And that’s already all there is to
it. By enabling this single, simple option you have turned on
supervision by the hardware of systemd and the kernel beneath
it.[2]

Note that the hardware watchdog device (/dev/watchdog) is
single-user only. That means that you can either enable this
functionality in systemd, or use a separate external watchdog daemon,
such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be
configured in /etc/systemd/system.conf. It controls the
watchdog interval to use during reboots. It defaults to 10min, and
adds extra reliability to the system reboot logic: if a clean reboot
is not possible and shutdown hangs, we rely on the watchdog hardware
to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are
really everything that is necessary to make use of the hardware
watchdogs. Now, let’s have a look how to add watchdog logic to
individual services.

First of all, to make software watchdog-supervisable it needs to be
patched to send out “I am alive” signals in regular intervals in its
event loop. Patching this is relatively easy. First, a daemon needs to
read the WATCHDOG_USEC= environment variable. If it is set,
it will contain the watchdog interval in usec formatted as ASCII text
string, as it is configured for the service. The daemon should then
issue sd_notify("WATCHDOG=1")
calls every half of that interval. A daemon patched this way should
transparently support watchdog functionality by checking whether the
environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been
patched to support the logic pointed out above) it is sufficient to
set the WatchdogSec= to the desired failure latency. See systemd.service(5)
for details on this setting. This causes WATCHDOG_USEC= to be
set for the service’s processes and will cause the service to enter a
failure state as soon as no keep-alive ping is received within the
configured interval.

If a service enters a failure state as soon as the watchdog logic
detects a hang, then this is hardly sufficient to build a reliable
system. The next step is to configure whether the service shall be
restarted and how often, and what to do if it then still fails. To
enable automatic service restarts on failure set
Restart=on-failure for the service. To configure how many
times a service shall be attempted to be restarted use the combination
of StartLimitBurst= and StartLimitInterval= which
allow you to configure how often a service may restart within a time
interval. If that limit is reached, a special action can be
taken. This action is configured with StartLimitAction=. The
default is a none, i.e. that no further action is taken and
the service simply remains in the failure state without any further
attempted restarts. The other three possible values are
reboot, reboot-force and
reboot-immediate. reboot attempts a clean reboot,
going through the usual, clean shutdown logic. reboot-force
is more abrupt: it will not actually try to cleanly shutdown any
services, but immediately kills all remaining services and unmounts
all file systems and then forcibly reboots (this way all file systems
will be clean but reboot will still be very fast). Finally,
reboot-immediate does not attempt to kill any process or
unmount any file systems. Instead it just hard reboots the machine
without delay. reboot-immediate hence comes closest to a
reboot triggered by a hardware watchdog. All these settings are
documented in systemd.service(5).

Putting this all together we now have pretty flexible options to
watchdog-supervise a specific service and configure automatic restarts
of the service if it hangs, plus take ultimate action if that doesn’t
help.

Here’s an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn’t pinged
the system manager for longer than 30s or if it fails otherwise. If it
is restarted this way more often than 4 times in 5min action is taken
and the system quickly rebooted, with all file systems being clean
when it comes up again.

And that’s already all I wanted to tell you about! With hardware
watchdog support right in PID 1, as well as supervisor watchdog
support for individual services we should provide everything you need
for most watchdog usecases. Regardless if you are building an embedded
or mobile applience, or if your are working with high-availability
servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with
/dev/watchdog, and why this shouldn’t be kept in a separate
daemon, then please read this again and try to understand that this is
all about the supervisor chain we are building here, where the hardware watchdog
supervises systemd, and systemd supervises the individual
services. Also, we believe that a service not responding should be
treated in a similar way as any other service error. Finally, pinging
/dev/watchdog is one of the most trivial operations in the OS
(basically little more than a ioctl() call), to the support for this
is not more than a handful lines of code. Maintaining this externally
with complex IPC between PID 1 (and the daemons) and this watchdog
daemon would be drastically more complex, error-prone and resource
intensive.)

Note that the built-in hardware watchdog support of systemd does
not conflict with other watchdog software by default. systemd does not
make use of /dev/watchdog by default, and you are welcome to
use external watchdog daemons in conjunction with systemd, if this
better suits your needs.

And one last thing: if you wonder whether your hardware has a
watchdog, then the answer is: almost definitely yes — if it is anything more
recent than a few years. If you want to verify this, try the wdctl
tool from recent util-linux, which shows you everything you need to
know about your watchdog hardware.

I’d like to thank the great folks from Pengutronix for contributing
most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog
hardware these days too, as this is cheap to build and available in
most modern PC chipsets.

[2] So, here’s a free tip for you if you hack on the core
OS: don’t enable this feature while you hack. Otherwise your system
might suddenly reboot if you are in the middle of tracing through PID
1 with gdb and cause it to be stopped for a moment, so that no
hardware ping can be done…

Germany Trip: Samba XP Keynote and LinuxTag Keynote

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2011/05/18/germany.html

I just returned a few days ago to the USA after one week in Germany. I
visited Göttingen for my keynote at Samba XP (which
I already
blogged about
).
Attending Samba XP was
an excellent experience, and I
thank SerNet for sponsoring my
trip there. Since going full-time at Conservancy last year, I have been
trying to visit the conferences of each of Conservancy’s member
projects. It will probably take me years to do this, but given that
Samba is one of Conservancy’s
charter members,
it’s good that I have finally visited Samba’s annual conference. It was
even better that they asked me to give
a keynote talk
at Samba XP
.

I must admit that I didn’t follow the details many of the talks other
than Tridge’s Samba 4 Status Report talk and
Jeremy’s The Death of File Protocols. This time I really mean
it! talk. The rest, unsurprisingly, were highly specific and
detailed about Samba, and since I haven’t been a regular Samba user
myself since 1996, I didn’t have the background information required to
grok the talks fully. But I did see a lot of excited developers, and it
was absolutely wonderful to meet the entire Samba Team for the first
time after exchanging email with them for so many years.

It’s funny to see how different communities tend to standardize around
the same kinds of practices with minor tweaks. Having visited a lot of
project-specific conferences for Conservancy’s members, I’m seeing how
each community does their conference, and one key thing all projects
have in common is the same final conference session: a panel discussion
with all the core developers.

The Samba Team has their own little tweak on this.
First, John Terpstra asks all
speakers at the conference (which included me this year) to join the
Samba Team and stand up in front of the audience. Then, the audience
can ask any final questions of all speakers (this year, the attendees
had none). Then, the Samba Team stands up in front of the crowd and
takes questions.

The Samba tweak on this model is that the Samba Team is not permitted
to sit down during the Q&A. This year, it didn’t last that long,
but it was still rather amusing. I’ve never seen a developers’ panel
before where the developers couldn’t sit down!

After Samba XP, I headed “back” to Berlin (my
flight had landed there on Saturday and I’d taken the Deutsche Bahn ICE
train to Göttingen for Samba XP), and arrived just in
time to
attend LinuxNacht,
the LinuxTag annual party
. (WARNING: name dropping follows!) It was excellent to
see Vincent
Untz
, Lennart Poettering,
Michael Meeks and
Stefano Zacchiroli at the party
(listed in order I saw them at the party).

The next day I
attended Vincent’s
talk, which was about cross-distribution collaboration
. It was a
good talk, although, I think Vincent glossed over too much the fact that
many distributions (Fedora, Ubuntu, and OpenSUSE, specifically) are
controlled by companies and that cross-distribution collaboration has
certain complications because of this corporate influence. I talked
with Vincent in more detail about this later, and he argued that the
developers at the companies in question have a lot of freedom to
operate, but I maintain there are subtle (and sometimes, not so subtle)
influences that cause problems for cross-distribution collaboration. I
also encouraged Vincent to listen
to Richard Fontana‘s talk, Open
Source Projects and Corporate Entanglement, that Karen and I
released as an episode
of the
FaiF oggcast
.

I also attended Martin
Michlmayr
‘s talk
on SPDX
. I kibitzed more than I should have from the audience,
pointing out that while SPDX is a good “first start”, it’s a
bit of a “too little, too late” attempt to address and
prevent the flood of GPL violations that are now all too common. I
believe SPDX is a great tool for those who already are generally in
compliance, but it isn’t very likely to impact the more common
violations, wherein the companies just ignore their GPL obligations. A
lively debate ensued on this topic. I frankly hope to be proved wrong
on this; if SPDX actually ends or reduces GPL violations, I’ll be happy
to work on something else instead.

On Friday afternoon, I gave
my second
keynote of the week, which was an updated version of my talk, 12
Years of GPL Compliance: A Historical Perspective
. It went
well, although I misunderstood and thought I had a full hour slot, but
only actually had a 50 minute slot, so I had to rush a bit at the end. I
really do hate rushing at the end when speaking primarily to a
non-native-English-speaking audience, as I know I’m capable of speaking
English way too fast (a problem that I am constantly vigilant
about under normal public speaking circumstances).

The talk was nevertheless pretty well received, and afterward, I was
surrounded by a gaggle of interested copyleft enthusiasts, who, as
always, were asking what more can be done to enforce the GPL. My talks
on enforcement always tend to elicit this reaction, since my final
slides are a
bit depressing with regard to the volume of GPL enforcement that’s
currently occurring.

Meanwhile, I also decided I should also start putting up my slides from
talks in a more accessible fashion. Since I
use S5 (although I hope
to switch to jQuery
S5
RSN), my slides are trivially web-publishable anyway. While
I’ve generally
published the source code to my slides
, it makes sense to also make
compiled, quickly viewable
versions of my slides
on my website too. Finally, I realized I
should also put my upcoming public
speaking events on my frontpage
and have done so.

After a late lunch on Friday, I saw only the very end
of Lennart’s
talk on systemd
, and then I visited for a while
with Claudia
Rauch
, Business
Manager of KDE, e.V.
in the KDE booth. Claudia kindly helped me
practice my German a bit by speaking slowly enough that I could actually
parse the words.

I must admit I was pretty frustrated all week that my German is now so
poor. I studied German for two years in High School and one semester in
college. I even participated in a three-week student exchange trip to a
Gymnasium (the German term for college-prep high school) in Munich in
1990. Yet, German speaking skills are just a degraded version of what
they once were.

Meanwhile, I did rather like Berlin’s Tegel airport (TXL). It’s a
pretty small airport, but I really like its layout. Because of its
small size, each check-in area is attached to a security checkpoint,
which is then directly connected to the gate. While this might seem a
bit tight, it makes it very easy to check-in, go through security, and
then be right at your gate. I can understand why an airport this small
would have to be closed (it’s slated for closure in 2012), but I am glad
that I got a chance to travel to it (and probably again, for the Desktop
Summit) before it closes.

Software Freedom Is Elementary, My Dear Watson.

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2011/03/01/watson.html

I’ve watched the game
show, Jeopardy!
, regularly since its Trebek-hosted
relaunch on 1984-09-10. I even remember distinctly the Final Jeopardy
question that night as This date is the first day of the new
millennium. At the age of 11, I got the answer wrong, falling for
the incorrect What is 2000-01-01?, but I recalled this memory
eleven years ago during the
debates
regarding when the millennium turnover happened
.

I had periods of life where I watched Jeopardy! only
rarely, but in recent years (as I’ve become more of a student of games
(in part, because of poker)), I’ve watched Jeopardy! almost
nightly over dinner with my wife. I’ve learned that I’m unlikely to
excel as a Jeopardy! player myself because (a) I read slow
and (b) my recall of facts, while reasonably strong, is not
instantaneous. I thus haven’t tried out for the show, but I’m
nevertheless a fan of strong players.

Jeopardy! isn’t my only spectator game. Right after
college, even though I’m a worse-than-mediocre chess player, I watched
with excitement
as Deep
Blue
played and defeated Kasparov. Kasparov has disputed the
results and how much humans were actually involved, but even so, such
interference was minimal (between matches) and the demonstration still
showed computer algorithmic mastery of chess.

Of course, the core algorithms that Deep Blue used were well known and
often implemented. I learned α-β pruning in my undergraduate
AI course and it was clear that a sufficiently fast computer, given a
few strong heuristics, could beat most any full information game with a
reasonable branching factor. And, computers typically do these days.

I suppose I never really thought about the issues of Deep Blue being
released as Free Software. First, because I was not as involved with
Free Software then as I am now, and also, as near as anyone could tell,
Deep Blue’s software was probably not useful for anything other than
playing chess, and its primary power was in its ability to go very deep
(hence the name, I guess) in the search tree. In short, Deep Blue was
primarily a hardware, not a software, success story.

It was nevertheless, impressive, and last month, I saw the next
installment in this IBM story. I watched with interest
as IBM’s
Watson defeated two champion Jeopardy! players
. Ken
Jennings, for one, even welcomed our new computer overlords.

Watson beating Jeopardy! is, frankly, a lot more
innovative than Deep Blue beating chess. Most don’t know this about me,
but I came very close to focusing my career on PhD work in Natural
Language Processing; I believe fundamentally it’s the area of AI most in
need of attention and research. Watson is a shining example of success
in modern NLP, and I actually believe some of the IBM hype about
how Watson’s
technology can be applied elsewhere, such as medical information
systems
. Indeed, IBM
has announced
a deal with Columbia University Medical Center to adapt the system for
medical diagnostics
. (Perhaps Watson’s next TV appearance will be
on House.)

This all sounds great to most people, but to me, my real concern is the
freedom of the software. We’ve shown in the software freedom community
that to advance software and improve it, sharing the software is
essential. Technology locked up in a vaulted cave doesn’t allow all the
great minds to collaborate. Just as we don’t lock up libraries so that
only the guilded overlords have access, nor should the best software
technology be restricted in proprietariness.

Indeed, Eric
Brown
, at
his Linux
Foundation End User Linux Summit talk
, told us that Watson relied
heavily on the publicly available software freedom codebase, such as
GNU/Linux, Hadoop, and other
FLOSS
components. They clearly couldn’t do their work without building upon the
work we shared with IBM, yet IBM apparently ignores its moral obligation to
reciprocate.

So, I just point-blank asked Brown why Watson is proprietary. Of
course, I long ago learned to never ask a confrontational question from
the crowd at a technical talk without knowing what the answer is likely to
be. Brown answered in the way I expected: We’re working with
Universities to provide a framework for their research. I followed
up asking
when he would actually release the sources and what license
would be. He dodged the question, and instead speculated about what
licenses IBM sometimes like to use when it does chose to release code;
he did not indicate if Watson’s sources will ever be released. In
short, the answer from IBM is clear: Watson’s general ideas
will be shared with academics, but the source code won’t be.

This point is precisely one of the reasons I didn’t pursue a career in
academic Computer Science. Since most jobs — including
professorships at Universities — for PhDs in Computer Science
require that any code written be kept proprietary, most
Computer Science researchers have convinced themselves that code doesn’t
matter; only publishing ideas do. This belief is so pervasive that I
knew something like this would be Brown’s response to my query. (I was
even so sure, I wrote almost this entire blog post before I asked the
question).

I’d easily agree that publishing papers is better than the technology
being only a trade secret. At least we can learn a little bit about the
work. But in all but the pure theoretical areas of Computer
Science, code is written to exemplify, test, and exercise the
ideas. Merely publishing papers and not the code is akin to a chemist
publishing final results but nothing about the methodologies or raw
data. Science, in such cases, is unverifiable and unreproducible. If
we accepted such in fields other than CS, we’d have accepted the idea
that cold
fusion was discovered in 1989
.

I don’t think I’m going to convince IBM to release Watson’s sources as
Free Software. What I do hope is that perhaps this blog post convinces
a few more people that we just shouldn’t accept that Computer Science is
advanced by researchers who give us flashy demos and code-less
research papers. I, for one, welcome our computer overlords…but only
if I can study and modify their source code.

Handy binaries for Thecus NAS boxes

Post Syndicated from Laurie Denness original https://laur.ie/blog/2010/11/handy-binaries-for-thecus-nas-boxes/

I recently took delivery of the rather splendid Thecus N5500 which I love; it’s the perfect mix between “it just works” and “oh, let’s stick SSH on there and poke around”. With 5 hot swap disk shelves, and 2TB hard drives you’ve got a serious amount of storage.

For your money you get a very nice little piece of hardware in a pretty nice shell (it strikes me as a touch tacky in places but then again it’s hardly going on show) with software that gets the job done. NFS, AFP, Samba, iSCSI, iTunes DAAP support, and plenty of modules to tickle your fancy (Logitech Squeezecenter, for instance).

But who am I kidding, I’m a sysadmin. 10 minutes after powering the thing on I was dying to log in using SSH so I could watch /proc/mdstat to see the RAID build. Luckily, the modules from the Thecus N5200 work fine; which means you’re a couple of clicks away from a root terminal.

  1. Grab the SSH and SYSUSER N5200 modules, and unzip them (a mistake I made.. How embarrassing.)
  2. Upload them using the webinterface, and enable them.
  3. SSH to the NAS box using the user “sys” and the password “sys”
  4. Enjoy your shell, and remember to run `passwd sys` to change the password to something else.

Now, you’ve got yourself a pretty handy, albeit it BusyBox-ridden install of Linux. The whole point of this post, is so I can pimp a few statically compiled binaries that might come in useful to you; they did to me anyway.

(You may wish to install the UTILITIES module, which gives you a proper version of top and ps, amongst other things, available here)

You can simply untar and drop the binaries into /raid/data/modules/bin folder so that they’re in your path, and stored on your disks rather than the flash units which are rather limited in space. By the way, these modules should also work fine on the Thecus N5200 NAS boxes too.

The binaries are available here: http://denness.net/thecus/binaries/

The list includes (all the latest versions as of the date of this blog post):

  • ethtool, handy for network interface prodding
  • iftop, a very useful “GUI” app that shows incoming/outgoing network bandwidth (let’s face it, this is fun on a NAS. NOTE: you may need to execute this one using `TERM=vt100; iftop`)
  • iostat, for hard core disk stats porn. Run it with `iostat -mx 1` and watch the megabytes fly
  • rsync, particularly handy if you want to synchronise/backup data from one place to another, so particularly handy on a NAS.
  • vim, just in case you were planning on writing a lot of code on the Thecus 🙂
  • GNU screen, a nice place to store your terminals and detach and come back later. (NOTE: you may need to execute this one using `TERM=vt100; screen`)
  • The command line version of PHP, in case you were planning on writing any scripts in PHP to run on the Thecus.

Any suggestions/comments, let me know.

Post-Bilski Steps for Anti-Software-Patent Advocates

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2010/06/30/bilski.html

Lots of people are opining about
the USA
Supreme Court’s ruling in the Bilski case
. Yesterday, I participated
in
a oggcast
with the folks at SFLC
. In that oggcast, Dan Ravicher explained most
of the legal details of Bilski; I could never cover them as well as he
did, and I wouldn’t even try.

Anyway, as a non-lawyer worried about the policy questions, I’m pretty
much only concerned about those forward-looking policy questions.
However, to briefly look back at how our community responded to this
Bilski situation over the last 18 months: it seems similar to what
happened
while the Eldred
case
was working its way to the Supreme Court. In the months
preceding both Eldred and Bilski, there seemed to be a mass hypnosis that
the Supreme Court would actually change copyright law (Eldred) or patent
law (Bilski) to make it better for freedom of computer users.

In both cases, that didn’t happen. There was admittedly less of that
giddy optimism before Bilski as there was before Eldred, but the ultimate
outcome for computer users is roughly no different in both cases: as we
were with Eldred, we’re left back with the same policy situation we had
before Bilski ever started making its way through the various courts. As
near as I can tell from what I’ve learned, the entire “Bilski
thing” appears to be a no-op. In short, as before, the Patent
Office sometimes can and will deny applications that it determines are
only abstract ideas, and the Supreme Court has now confirmed that the
Patent Office can reject such an application if the Patent Office knows
an abstract idea when it sees it. Nothing has changed regarding most
patents that are granted every day, including those that read on software.
Those of us that oppose software patents continue to believe that software
algorithms are indeed merely abstract ideas and pure mathematics and
shouldn’t be patentable subject matter. The governmental powers still
seems to disagree with us, or, at least, just won’t comment on that
question.

Looking forward, my largest concern, from a policy
perspective, is that the “patent reform” crowd,
who claim to be the allies of the anti-software-patent folks,
will use this decision to declare that the system works.
Bilski’s patent was ultimately denied, but on grounds that leave us no
closer to abolishing software patents. Patent reformists will
say: Well, invalid patents get denied, leaving space for the valid
ones. Those valid ones, they will say, do and should include
lots of patents that read on software. But only the really good
ideas should be patented, they will insist.

We must not yield to the patent reformists, particularly at a time like
this. (BTW, be sure to read
RMS’ classic and still relevant essay,
Patent
Reform Is Not Enough
, if you haven’t already.)

Since Bilski has given us no new tools for abolishing software patents,
we must redouble efforts with tools we already have to mitigate the
threat patents pose to software freedom. Here are a few suggestions,
which I think are actually all implementable by the average developer,
to will keep up the fight against software patents, or at least,
mitigate their impact:

License your software using the
AGPLv3,
GPLv3,
LGPLv3,
or Apache-2.0.
Among the copyleft
licenses, AGPLv3
and GPLv3 offer the
best patent
protections; LGPLv3
offers the best among the weak copyleft
licenses; Apache
License 2.0
offers the best patent protections among the permissive
licenses. These are the licenses we should gravitate toward,
particularly since multiple companies with software patents are
regularly attacking Free Software. At least when such companies
contribute code to projects under these licenses, we know those
particular codebases will be safe from that particular company’s
patents.

Demand real patent licenses from companies, not mere
promises. Patent promises are not
enough0. The Free Software
community deserves to know it has real patent licenses from companies
that hold patents. At the very least, we should demand unilateral
patent licenses for all their patents perpetually for all
possible copylefted code (i.e., companies should grant, ahead of
time, the exact same license that the community would get if the
company had contributed to a yet-to-exist GPLv3’d
codebase)1. Note
further that some companies, that claim to be part of the
FLOSS community, haven’t even given the
(inadequate-but-better-than-nothing) patent promises.
For example,
BlackDuck holds a
patent related to FLOSS
, but
despite saying
they would consider at least a patent promise
, have failed to do
even that minimal effort.

Support organizations/efforts that work to oppose and end
software patents. In particular, be sure that the efforts
you support are not merely “patent reform” efforts hidden
behind anti-software patent rhetoric. Here are a few initiatives that
I’ve recently seen doing work regarding complete abolition of software
patents. I suggest you support them (with your time or dollars):

End Software
Patents
(a project of
FSF),
FFII
(European-specific)
, and
APRIL
(France-specific)
.

Write your legislators. This never hurts. In the
USA, it’s unlikely we can convince Congress to change patent law,
because there are just too many lobbying dollars from those big
patent-holding companies (e.g., the same ones that wrote
those nasty
amicus
briefs
in Bilski). But, writing your Senators and Congresspeople once a year
to remind them of your opposition patents that read on software simply
can’t hurt, and may theoretically help a tiny bit. Now would be a good
time to do it, since you can mention how the Bilski decision convinced
you there’s a need for legislative abolition of software patents.
Meanwhile, remember, it’s even better if you show up at political
debates during election season and ask these candidates to oppose
software patents!

Explain to your colleagues why software patents should be
abolished, particularly if you work in computing. Software
patent abolition is actually a broad spectrum issue across the
computing industry. Only big and powerful companies benefit from
software patents. The little guy — even the little guy
proprietary developer — is hurt by software patents.
Even if you can’t convince your colleagues who write proprietary
software that they should switch to writing Free Software,
you can instead convince them that software patents
are bad for them personally and for their chances to succeed in
software. Share the film,
Patent
Absurdity
, with them and then discuss the issue with them
after they’ve viewed it. Blog, tweet, dent, and the like about the
issue regularly.

(added 2010-07-01 on tmarble‘s
suggestion) Avoid products from pro-software-patent
companies. This is tough to do, and it’s why I didn’t call
for an all-out boycott. Most companies that make computers are
pro-software-patent, so it’s actually tough to buy a computer (or even
components for one) without buying from a pro-software-patent company.
However, avoiding the companies who are most aggressive with patent
aggression is easy: starting with avoiding Apple products is a good
first step (there are plenty of other reasons to avoid Apple anyway).
Microsoft would be next on the list, since they specifically use
software patents to attack FLOSS projects. Those are likely the big
two to avoid, but always remember that all large companies with
proprietary software products actively enforce patents, even if they
don’t file lawsuits. In other words, go with the little guy if you
can; it’s more likely to be a patent-free zone.

If you have a good idea, publish it and make sure the great
idea is well described in code comments and documentation, and that
everything is well archived by date. I put this one last on
my list, because it’s more of a help for the software patent
reformists than it is for the software patent abolitionists.
Nevertheless, sometimes, patents will get in the way of Free Software,
and it will be good if there is strong prior art showing that the idea
was already thought of, implemented, and put out into the world before
the patent was filed. But, fact is,
the “valid”
software patents with no prior art are a bigger threat to software
freedom
. The stronger the patent, the worst the threat, because
it’s more likely to be innovative, new technology that we want to
implement in Free Software.

I sat and thought of what else I could add to this list that
individuals can do to help abolish software patents. I was sad that
these were the only five six things that
I could collect, but that’s all the more reason to do
these five six
things in earnest. The battle for software freedom for all users is not
one we’ll win in our lifetimes. It’s also possible abolition of
software patents will take a generation as well. Those of us that seek
this outcome must be prepared for patience and lifelong, diligent work
so that the right outcome happens, eventually.

0 Update: I
was asked
for a longer write up on software patent licenses as
compared to mere “promises”.
Unfortunately, I
don’t have one, so the best I was able to offer

was the interview I did
on Linux Outlaws, Episode 102, about Microsoft’s patent
promise
. I’ve also added a TODO to write something up more completely
on this particular issue.

1 I am not
leaving my permissively-license-preferring friends out of this issue
without careful consideration. Specifically, I just don’t think it’s
practical or even fair to ask companies to license their patents for
all permissively-licensed code, since that would be the same as
licensing to everyone, including their proprietary software
competitors. An ahead-of-time perpetual license to practice the
teachings of all the company’s patents under AGPLv3 basically makes
sure that code that’s eternally Free Software will also eternally be
patent-licensed from that company, even if the company never
contributes to the AGPLv3’d codebase. Anyone trying to make
proprietary code that infringed the patent wouldn’t have benefit of
the license; only Free Software users, distributors and modifiers
would have the benefit. If a company supports copyleft generally,
then there is no legitimate reason for the company to refuse such a
broad license for copyleft distributions and deployments.

On the Brokenness of File Locking

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/locking.html

It’s amazing how far Linux has come without providing for proper file
locking that works and is usable from userspace. A little overview why file
locking is still in a very sad state:

To begin with, there’s a plethora of APIs, and all of them are awful:

POSIX File locking as available with fcntl(F_SET_LK): the POSIX
locking API is the most portable one and in theory works across NFS. It can do
byte-range locking. So much on the good side. On the bad side there’s a lot
more however: locks are bound to processes, not file descriptors. That means
that this logic cannot be used in threaded environments unless combined with a
process-local mutex. This is hard to get right, especially in libraries that do
not know the environment they are run in, i.e. whether they are used in
threaded environments or not. The worst part however is that POSIX locks are
automatically released if a process calls close() on any (!) of
its open file descriptors for that file. That means that when one part of a
program locks a file and another by coincidence accesses it too for a short
time, the first part’s lock will be broken and it won’t be notified about that.
Modern software tends to load big frameworks (such as Gtk+ or Qt) into memory
as well as arbitrary modules via mechanisms such as NSS, PAM, gvfs,
GTK_MODULES, Apache modules, GStreamer modules where one module seldom can
control what another module in the same process does or accesses. The effect of
this is that POSIX locks are unusable in any non-trivial program where it
cannot be ensured that a file that is locked is never accessed by
any other part of the process at the same time. Example: a user managing
daemon wants to write /etc/passwd and locks the file for that. At
the same time in another thread (or from a stack frame further down)
something calls getpwuid() which internally accesses
/etc/passwd and causes the lock to be released, the first thread
(or stack frame) not knowing that. Furthermore should two threads use the
locking fcntl()s on the same file they will interfere with each other’s locks
and reset the locking ranges and flags of each other. On top of that locking
cannot be used on any file that is publicly accessible (i.e. has the R bit set
for groups/others, i.e. more access bits on than 0600), because that would
otherwise effectively give arbitrary users a way to indefinitely block
execution of any process (regardless of the UID it is running under) that wants
to access and lock the file. This is generally not an acceptable security risk.
Finally, while POSIX file locks are supposedly NFS-safe they not always really
are as there are still many NFS implementations around where locking is not properly
implemented, and NFS tends to be used in heterogenous networks. The biggest
problem about this is that there is no way to properly detect whether file
locking works on a specific NFS mount (or any mount) or not.

The other API for POSIX file locks: lockf() is another API for the
same mechanism and suffers by the same problems. One wonders why there are two
APIs for the same messed up interface.

BSD locking based on flock(). The semantics of this kind of
locking are much nicer than for POSIX locking: locks are bound to file
descriptors, not processes. This kind of locking can hence be used safely
between threads and can even be inherited across fork() and
exec(). Locks are only automatically broken on the close()
call for the one file descriptor they were created with (or the last duplicate
of it). On the other hand this kind of locking does not offer byte-range
locking and suffers by the same security problems as POSIX locking, and works
on even less cases on NFS than POSIX locking (i.e. on BSD and Linux < 2.6.12
they were NOPs returning success). And since BSD locking is not as portable as
POSIX locking this is sometimes an unsafe choice. Some OSes even find it funny
to make flock() and fcntl(F_SET_LK) control the same locks.
Linux treats them independently — except for the cases where it doesn’t: on
Linux NFS they are transparently converted to POSIX locks, too now. What a chaos!

Mandatory locking is available too. It’s based on the POSIX locking API but
not portable in itself. It’s dangerous business and should generally be avoided
in cleanly written software.

Traditional lock file based file locking. This is how things where done
traditionally, based around known atomicity guarantees of certain basic file
system operations. It’s a cumbersome thing, and requires polling of the file
system to get notifications when a lock is released. Also, On Linux NFS < 2.6.5
it doesn’t work properly, since O_EXCL isn’t atomic there. And of course the
client cannot really know what the server is running, so again this brokeness
is not detectable.

The Disappointing Summary

File locking on Linux is just broken. The broken semantics of POSIX locking
show that the designers of this API apparently never have tried to actually use
it in real software. It smells a lot like an interface that kernel people
thought makes sense but in reality doesn’t when you try to use it from
userspace.

Here’s a list of places where you shouldn’t use file locking due to the
problems shown above: If you want to lock a file in $HOME, forget about it as
$HOME might be NFS and locks generally are not reliable there. The same applies
to every other file system that might be shared across the network. If the file
you want to lock is accessible to more than your own user (i.e. an access mode
> 0700), forget about locking, it would allow others to block your
application indefinitely. If your program is non-trivial or threaded or uses a
framework such as Gtk+ or Qt or any of the module-based APIs such as NSS, PAM,
… forget about about POSIX locking. If you care about portability, don’t use
file locking.

Or to turn this around, the only case where it is kind of safe to use file locking
is in trivial applications where portability is not key and by using BSD
locking on a file system where you can rely that it is local and on files
inaccessible to others. Of course, that doesn’t leave much, except for private
files in /tmp for trivial user applications.

Or in one sentence: in its current state Linux file locking is unusable.

And that is a shame.

Update: Check out the follow-up story on this topic.

On the Brokenness of File Locking

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/locking.html

It’s amazing how far Linux has come without providing for proper file
locking that works and is usable from userspace. A little overview why file
locking is still in a very sad state:

To begin with, there’s a plethora of APIs, and all of them are awful:

  • POSIX File locking as available with fcntl(F_SET_LK): the POSIX
    locking API is the most portable one and in theory works across NFS. It can do
    byte-range locking. So much on the good side. On the bad side there’s a lot
    more however: locks are bound to processes, not file descriptors. That means
    that this logic cannot be used in threaded environments unless combined with a
    process-local mutex. This is hard to get right, especially in libraries that do
    not know the environment they are run in, i.e. whether they are used in
    threaded environments or not. The worst part however is that POSIX locks are
    automatically released if a process calls close() on any (!) of
    its open file descriptors for that file. That means that when one part of a
    program locks a file and another by coincidence accesses it too for a short
    time, the first part’s lock will be broken and it won’t be notified about that.
    Modern software tends to load big frameworks (such as Gtk+ or Qt) into memory
    as well as arbitrary modules via mechanisms such as NSS, PAM, gvfs,
    GTK_MODULES, Apache modules, GStreamer modules where one module seldom can
    control what another module in the same process does or accesses. The effect of
    this is that POSIX locks are unusable in any non-trivial program where it
    cannot be ensured that a file that is locked is never accessed by
    any other part of the process at the same time. Example: a user managing
    daemon wants to write /etc/passwd and locks the file for that. At
    the same time in another thread (or from a stack frame further down)
    something calls getpwuid() which internally accesses
    /etc/passwd and causes the lock to be released, the first thread
    (or stack frame) not knowing that. Furthermore should two threads use the
    locking fcntl()s on the same file they will interfere with each other’s locks
    and reset the locking ranges and flags of each other. On top of that locking
    cannot be used on any file that is publicly accessible (i.e. has the R bit set
    for groups/others, i.e. more access bits on than 0600), because that would
    otherwise effectively give arbitrary users a way to indefinitely block
    execution of any process (regardless of the UID it is running under) that wants
    to access and lock the file. This is generally not an acceptable security risk.
    Finally, while POSIX file locks are supposedly NFS-safe they not always really
    are as there are still many NFS implementations around where locking is not properly
    implemented, and NFS tends to be used in heterogenous networks. The biggest
    problem about this is that there is no way to properly detect whether file
    locking works on a specific NFS mount (or any mount) or not.
  • The other API for POSIX file locks: lockf() is another API for the
    same mechanism and suffers by the same problems. One wonders why there are two
    APIs for the same messed up interface.
  • BSD locking based on flock(). The semantics of this kind of
    locking are much nicer than for POSIX locking: locks are bound to file
    descriptors, not processes. This kind of locking can hence be used safely
    between threads and can even be inherited across fork() and
    exec(). Locks are only automatically broken on the close()
    call for the one file descriptor they were created with (or the last duplicate
    of it). On the other hand this kind of locking does not offer byte-range
    locking and suffers by the same security problems as POSIX locking, and works
    on even less cases on NFS than POSIX locking (i.e. on BSD and Linux < 2.6.12
    they were NOPs returning success). And since BSD locking is not as portable as
    POSIX locking this is sometimes an unsafe choice. Some OSes even find it funny
    to make flock() and fcntl(F_SET_LK) control the same locks.
    Linux treats them independently — except for the cases where it doesn’t: on
    Linux NFS they are transparently converted to POSIX locks, too now. What a chaos!
  • Mandatory locking is available too. It’s based on the POSIX locking API but
    not portable in itself. It’s dangerous business and should generally be avoided
    in cleanly written software.
  • Traditional lock file based file locking. This is how things where done
    traditionally, based around known atomicity guarantees of certain basic file
    system operations. It’s a cumbersome thing, and requires polling of the file
    system to get notifications when a lock is released. Also, On Linux NFS < 2.6.5
    it doesn’t work properly, since O_EXCL isn’t atomic there. And of course the
    client cannot really know what the server is running, so again this brokeness
    is not detectable.

The Disappointing Summary

File locking on Linux is just broken. The broken semantics of POSIX locking
show that the designers of this API apparently never have tried to actually use
it in real software. It smells a lot like an interface that kernel people
thought makes sense but in reality doesn’t when you try to use it from
userspace.

Here’s a list of places where you shouldn’t use file locking due to the
problems shown above: If you want to lock a file in $HOME, forget about it as
$HOME might be NFS and locks generally are not reliable there. The same applies
to every other file system that might be shared across the network. If the file
you want to lock is accessible to more than your own user (i.e. an access mode
> 0700), forget about locking, it would allow others to block your
application indefinitely. If your program is non-trivial or threaded or uses a
framework such as Gtk+ or Qt or any of the module-based APIs such as NSS, PAM,
… forget about about POSIX locking. If you care about portability, don’t use
file locking.

Or to turn this around, the only case where it is kind of safe to use file locking
is in trivial applications where portability is not key and by using BSD
locking on a file system where you can rely that it is local and on files
inaccessible to others. Of course, that doesn’t leave much, except for private
files in /tmp for trivial user applications.

Or in one sentence: in its current state Linux file locking is unusable.

And that is a shame.

Update: Check out the follow-up story on this topic.

A Guide Through The Linux Sound API Jungle

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/guide-to-sound-apis.html

At the Audio MC at the Linux Plumbers Conference one
thing became very clear: it is very difficult for programmers to
figure out which audio API to use for which purpose and which API not
to use when doing audio programming on Linux. So here’s my try to
guide you through this jungle:

What do you want to do?

I want to write a media-player-like application!
Use GStreamer! (Unless your focus is only KDE in which cases Phonon might be an alternative.)

I want to add event sounds to my application!
Use libcanberra, install your sound files according to the XDG Sound Theming/Naming Specifications! (Unless your focus is only KDE in which case KNotify might be an alternative although it has a different focus.)

I want to do professional audio programming, hard-disk recording, music synthesizing, MIDI interfacing!
Use JACK and/or the full ALSA interface.

I want to do basic PCM audio playback/capturing!
Use the safe ALSA subset.

I want to add sound to my game!
Use the audio API of SDL for full-screen games, libcanberra for simple games with standard UIs such as Gtk+.

I want to write a mixer application!
Use the layer you want to support directly: if you want to support enhanced desktop software mixers, use the PulseAudio volume control APIs. If you want to support hardware mixers, use the ALSA mixer APIs.

I want to write audio software for the plumbing layer!
Use the full ALSA stack.

I want to write audio software for embedded applications!
For technical appliances usually the safe ALSA subset is a good choice, this however depends highly on your use-case.

You want to know more about the different sound APIs?

GStreamer
GStreamer is the de-facto
standard media streaming system for Linux desktops. It supports decoding and
encoding of audio and video streams. You can use it for a wide range of
purposes from simple audio file playback to elaborate network
streaming setups. GStreamer supports a wide range of CODECs and audio
backends. GStreamer is not particularly suited for basic PCM playback
or low-latency/realtime applications. GStreamer is portable and not
limited in its use to Linux. Among the supported backends are ALSA, OSS, PulseAudio. [Programming Manuals and References]

libcanberra
libcanberra
is an abstract event sound API. It implements the XDG
Sound Theme and Naming Specifications
. libcanberra is a blessed
GNOME dependency, but itself has no dependency on GNOME/Gtk/GLib and can be
used with other desktop environments as well. In addition to an easy
interface for playing sound files, libcanberra provides caching
(which is very useful for networked thin clients) and allows passing
of various meta data to the underlying audio system which then can be
used to enhance user experience (such as positional event sounds) and
for improving accessibility. libcanberra supports multiple backends
and is portable beyond Linux. Among the supported backends are ALSA, OSS, PulseAudio, GStreamer. [API Reference]

JACK

JACK is a sound system for
connecting professional audio production applications and hardware
output. It’s focus is low-latency and application interconnection. It
is not useful for normal desktop or embedded use. It is not an API
that is particularly useful if all you want to do is simple PCM
playback. JACK supports multiple backends, although ALSA is best
supported. JACK is portable beyond Linux. Among the supported backends are ALSA, OSS. [API Reference]

Full ALSA

ALSA is the Linux API
for doing PCM playback and recording. ALSA is very focused on
hardware devices, although other backends are supported as well (to a
limit degree, see below). ALSA as a name is used both for the Linux
audio kernel drivers and a user-space library that wraps these. ALSA — the library — is
comprehensive, and portable (to a limited degree). The full ALSA API
can appear very complex and is large. However it supports almost
everything modern sound hardware can provide. Some of the
functionality of the ALSA API is limited in its use to actual hardware
devices supported by the Linux kernel (in contrast to software sound
servers and sound drivers implemented in user-space such as those for
Bluetooth and FireWire audio — among others) and Linux specific
drivers. [API
Reference
]

Safe ALSA

Only a subset of the full ALSA API works on all backends ALSA
supports. It is highly recommended to stick to this safe subset
if you do ALSA programming to keep programs portable, future-proof and
compatible with sound servers, Bluetooth audio and FireWire audio. See
below for more details about which functions of ALSA are considered
safe. The safe ALSA API is a suitable abstraction for basic,
portable PCM playback and recording — not just for ALSA kernel driver
supported devices. Among the supported backends are ALSA kernel driver
devices, OSS, PulseAudio, JACK.

Phonon and KNotify

Phonon is high-level
abstraction for media streaming systems such as GStreamer, but goes a
bit further than that. It supports multiple backends. KNotify is a
system for “notifications”, which goes beyond mere event
sounds. However it does not support the XDG Sound Theming/Naming
Specifications at this point, and also doesn’t support caching or
passing of event meta-data to an underlying sound system. KNotify
supports multiple backends for audio playback via Phonon. Both APIs
are KDE/Qt specific and should not be used outside of KDE/Qt
applications. [Phonon API Reference] [KNotify API Reference]

SDL

SDL is a portable API
primarily used for full-screen game development. Among other stuff it
includes a portable audio interface. Among others SDL support OSS,
PulseAudio, ALSA as backends. [API Reference]

PulseAudio

PulseAudio is a sound system
for Linux desktops and embedded environments that runs in user-space
and (usually) on top of ALSA. PulseAudio supports network
transparency, per-application volumes, spatial events sounds, allows
switching of sound streams between devices on-the-fly, policy
decisions, and many other high-level operations. PulseAudio adds a glitch-free
audio playback model to the Linux audio stack. PulseAudio is not
useful in professional audio production environments. PulseAudio is
portable beyond Linux. PulseAudio has a native API and also supports
the safe subset of ALSA, in addition to limited,
LD_PRELOAD-based OSS compatibility. Among others PulseAudio supports
OSS and ALSA as backends and provides connectivity to JACK. [API
Reference
]

OSS

The Open Sound System is a
low-level PCM API supported by a variety of Unixes including Linux. It
started out as the standard Linux audio system and is supported on
current Linux kernels in the API version 3 as OSS3. OSS3 is considered
obsolete and has been fully replaced by ALSA. A successor to OSS3
called OSS4 is available but plays virtually no role on Linux and is
not supported in standard kernels or by any of the relevant
distributions. The OSS API is very low-level, based around direct
kernel interfacing using ioctl()s. It it is hence awkward to use and
can practically not be virtualized for usage on non-kernel audio
systems like sound servers (such as PulseAudio) or user-space sound
drivers (such as Bluetooth or FireWire audio). OSS3’s timing model
cannot properly be mapped to software sound servers at all, and is
also problematic on non-PCI hardware such as USB audio. Also, OSS does
not do sample type conversion, remapping or resampling if
necessary. This means that clients that properly want to support OSS
need to include a complete set of converters/remappers/resamplers for
the case when the hardware does not natively support the requested
sampling parameters. With modern sound cards it is very common to
support only S32LE samples at 48KHz and nothing else. If an OSS client
assumes it can always play back S16LE samples at 44.1KHz it will thus
fail. OSS3 is portable to other Unix-like systems, various differences
however apply. OSS also doesn’t support surround sound and other
functionality of modern sounds systems properly. OSS should be
considered obsolete and not be used in new applications. ALSA and
PulseAudio have limited LD_PRELOAD-based compatibility with OSS. [Programming Guide]

All sound systems and APIs listed above are supported in all
relevant current distributions. For libcanberra support the newest
development release of your distribution might be necessary.

All sound systems and APIs listed above are suitable for
development for commercial (read: closed source) applications, since
they are licensed under LGPL or more liberal licenses or no client
library is involved.

You want to know why and when you should use a specific sound API?

GStreamer

GStreamer is best used for very high-level needs: i.e. you want to
play an audio file or video stream and do not care about all the tiny
details down to the PCM or codec level.

libcanberra

libcanberra is best used when adding sound feedback to user input
in UIs. It can also be used to play simple sound files for
notification purposes.

JACK

JACK is best used in professional audio production and where interconnecting applications is required.

Full ALSA

The full ALSA interface is best used for software on “plumbing layer” or when you want to make use of very specific hardware features, which might be need for audio production purposes.

Safe ALSA

The safe ALSA interface is best used for software that wants to output/record basic PCM data from hardware devices or software sound systems.

Phonon and KNotify

Phonon and KNotify should only be used in KDE/Qt applications and only for high-level media playback, resp. simple audio notifications.

SDL

SDL is best used in full-screen games.

PulseAudio

For now, the PulseAudio API should be used only for applications
that want to expose sound-server-specific functionality (such as
mixers) or when a PCM output abstraction layer is already available in
your application and it thus makes sense to add an additional backend
to it for PulseAudio to keep the stack of audio layers minimal.

OSS

OSS should not be used for new programs.

You want to know more about the safe ALSA subset?

Here’s a list of DOS and DONTS in the ALSA API if you care about
that you application stays future-proof and works fine with
non-hardware backends or backends for user-space sound drivers such as
Bluetooth and FireWire audio. Some of these recommendations apply for
people using the full ALSA API as well, since some functionality
should be considered obsolete for all cases.

If your application’s code does not follow these rules, you must have
a very good reason for that. Otherwise your code should simply be considered
broken!

DONTS:

Do not use “async handlers”, e.g. via
snd_async_add_pcm_handler() and friends. Asynchronous
handlers are implemented using POSIX signals, which is a very
questionable use of them, especially from libraries and plugins. Even
when you don’t want to limit yourself to the safe ALSA subset
it is highly recommended not to use this functionality. Read
this for a longer explanation why signals for audio IO are
evil.

Do not parse the ALSA configuration file yourself or with
any of the ALSA functions such as snd_config_xxx(). If you
need to enumerate audio devices use snd_device_name_hint()
(and related functions). That
is the only API that also supports enumerating non-hardware audio
devices and audio devices with drivers implemented in userspace.

Do not parse any of the files from
/proc/asound/. Those files only include information about
kernel sound drivers — user-space plugins are not listed there. Also,
the set of kernel devices might differ from the way they are presented
in user-space. (i.e. sub-devices are mapped in different ways to
actual user-space devices such as surround51 an suchlike.

Do not rely on stable device indexes from ALSA. Nowadays
they depend on the initialization order of the drivers during boot-up
time and are thus not stable.

Do not use the snd_card_xxx() APIs. For
enumerating use snd_device_name_hint() (and related
functions). snd_card_xxx() is obsolete. It will only list
kernel hardware devices. User-space devices such as sound servers,
Bluetooth audio are not included. snd_card_load() is
completely obsolete in these days.

Do not hard-code device strings, especially not
hw:0 or plughw:0 or even dmix — these devices define no channel
mapping and are mapped to raw kernel devices. It is highly recommended
to use exclusively default as device string. If specific
channel mappings are required the correct device strings should be
front for stereo, surround40 for Surround 4.0,
surround41, surround51, and so on. Unfortunately at
this point ALSA does not define standard device names with channel
mappings for non-kernel devices. This means default may only
be used safely for mono and stereo streams. You should probably prefix
your device string with plug: to make sure ALSA transparently
reformats/remaps/resamples your PCM stream for you if the
hardware/backend does not support your sampling parameters
natively.

Do not assume that any particular sample type is supported
except the following ones: U8, S16_LE, S16_BE, S32_LE, S32_BE,
FLOAT_LE, FLOAT_BE, MU_LAW, A_LAW.

Do not use snd_pcm_avail_update() for
synchronization purposes. It should be used exclusively to query the
amount of bytes that may be written/read right now. Do not use
snd_pcm_delay() to query the fill level of your playback
buffer. It should be used exclusively for synchronisation
purposes. Make sure you fully understand the difference, and note that
the two functions return values that are not necessarily directly
connected!

Do not assume that the mixer controls always know dB information.

Do not assume that all devices support MMAP style buffer access.

Do not assume that the hardware pointer inside the (possibly mmaped) playback buffer is the actual position of the sample in the DAC. There might be an extra latency involved.

Do not try to recover with your own code from ALSA error conditions such as buffer under-runs. Use snd_pcm_recover() instead.

Do not touch buffering/period metrics unless you have
specific latency needs. Develop defensively, handling correctly the
case when the backend cannot fulfill your buffering metrics
requests. Be aware that the buffering metrics of the playback buffer
only indirectly influence the overall latency in many
cases. i.e. setting the buffer size to a fixed value might actually result in
practical latencies that are much higher.

Do not assume that snd_pcm_rewind() is available and works and to which degree.

Do not assume that the time when a PCM stream can receive
new data is strictly dependant on the sampling and buffering
parameters and the resulting average throughput. Always make sure to
supply new audio data to the device when it asks for it by signalling
“writability” on the fd. (And similarly for capturing)

Do not use the “simple” interface snd_spcm_xxx().

Do not use any of the functions marked as “obsolete”.

Do not use the timer, midi, rawmidi, hwdep subsystems.

DOS:

Use snd_device_name_hint() for enumerating audio devices.

Use snd_smixer_xx() instead of raw snd_ctl_xxx()

For synchronization purposes use snd_pcm_delay().

For checking buffer playback/capture fill level use snd_pcm_update_avail().

Use snd_pcm_recover() to recover from errors returned by any of the ALSA functions.

If possible use the largest buffer sizes the device supports to maximize power saving and drop-out safety. Use snd_pcm_rewind() if you need to react to user input quickly.

FAQ

What about ESD and NAS?

ESD and NAS are obsolete, both as API and as sound daemon. Do not develop for it any further.

ALSA isn’t portable!

That’s not true! Actually the user-space library is relatively portable, it even includes a backend for OSS sound devices. There is no real reason that would disallow using the ALSA libraries on other Unixes as well.

Portability is key to me! What can I do?

Unfortunately no truly portable (i.e. to Win32) PCM API is
available right now that I could truly recommend. The systems shown
above are more or less portable at least to Unix-like operating
systems. That does not mean however that there are suitable backends
for all of them available. If you care about portability to Win32 and
MacOS you probably have to find a solution outside of the
recommendations above, or contribute the necessary
backends/portability fixes. None of the systems (with the exception of
OSS) is truly bound to Linux or Unix-like kernels.

What about PortAudio?

I don’t think that PortAudio is very good API for Unix-like operating systems. I cannot recommend it, but it’s your choice.

Oh, why do you hate OSS4 so much?

I don’t hate anything or anyone. I just don’t think OSS4 is a
serious option, especially not on Linux. On Linux, it is also
completely redundant due to ALSA.

You idiot, you have no clue!

You are right, I totally don’t. But that doesn’t hinder me from recommending things. Ha!

Hey I wrote/know this tiny new project which is an awesome abstraction layer for audio/media!

Sorry, that’s not sufficient. I only list software here that is known to be sufficiently relevant and sufficiently well maintained.

Final Words

Of course these recommendations are very basic and are only intended to
lead into the right direction. For each use-case different necessities
apply and hence options that I did not consider here might become
viable. It’s up to you to decide how much of what I wrote here
actually applies to your application.

This summary only includes software systems that are considered
stable and universally available at the time of writing. In the
future I hope to introduce a more suitable and portable replacement
for the safe ALSA subset of functions. I plan to update this text
from time to time to keep things up-to-date.

If you feel that I forgot a use case or an important API, then
please contact me or leave a comment. However, I think the summary
above is sufficiently comprehensive and if an entry is missing I most
likely deliberately left it out.

(Also note that I am upstream for both PulseAudio and libcanberra and did some minor contributions to ALSA, GStreamer and some other of the systems listed above. Yes, I am biased.)

Oh, and please syndicate this, digg it. I’d like to see this guide to be well-known all around the Linux community. Thank you!

A Guide Through The Linux Sound API Jungle

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/guide-to-sound-apis.html

At the Audio MC at the Linux Plumbers Conference one
thing became very clear: it is very difficult for programmers to
figure out which audio API to use for which purpose and which API not
to use when doing audio programming on Linux. So here’s my try to
guide you through this jungle:

What do you want to do?

I want to write a media-player-like application!
Use GStreamer! (Unless your focus is only KDE in which cases Phonon might be an alternative.)
I want to add event sounds to my application!
Use libcanberra, install your sound files according to the XDG Sound Theming/Naming Specifications! (Unless your focus is only KDE in which case KNotify might be an alternative although it has a different focus.)
I want to do professional audio programming, hard-disk recording, music synthesizing, MIDI interfacing!
Use JACK and/or the full ALSA interface.
I want to do basic PCM audio playback/capturing!
Use the safe ALSA subset.
I want to add sound to my game!
Use the audio API of SDL for full-screen games, libcanberra for simple games with standard UIs such as Gtk+.
I want to write a mixer application!
Use the layer you want to support directly: if you want to support enhanced desktop software mixers, use the PulseAudio volume control APIs. If you want to support hardware mixers, use the ALSA mixer APIs.
I want to write audio software for the plumbing layer!
Use the full ALSA stack.
I want to write audio software for embedded applications!
For technical appliances usually the safe ALSA subset is a good choice, this however depends highly on your use-case.

You want to know more about the different sound APIs?

GStreamer
GStreamer is the de-facto
standard media streaming system for Linux desktops. It supports decoding and
encoding of audio and video streams. You can use it for a wide range of
purposes from simple audio file playback to elaborate network
streaming setups. GStreamer supports a wide range of CODECs and audio
backends. GStreamer is not particularly suited for basic PCM playback
or low-latency/realtime applications. GStreamer is portable and not
limited in its use to Linux. Among the supported backends are ALSA, OSS, PulseAudio. [Programming Manuals and References]
libcanberra
libcanberra
is an abstract event sound API. It implements the XDG
Sound Theme and Naming Specifications
. libcanberra is a blessed
GNOME dependency, but itself has no dependency on GNOME/Gtk/GLib and can be
used with other desktop environments as well. In addition to an easy
interface for playing sound files, libcanberra provides caching
(which is very useful for networked thin clients) and allows passing
of various meta data to the underlying audio system which then can be
used to enhance user experience (such as positional event sounds) and
for improving accessibility. libcanberra supports multiple backends
and is portable beyond Linux. Among the supported backends are ALSA, OSS, PulseAudio, GStreamer. [API Reference]
JACK
JACK is a sound system for
connecting professional audio production applications and hardware
output. It’s focus is low-latency and application interconnection. It
is not useful for normal desktop or embedded use. It is not an API
that is particularly useful if all you want to do is simple PCM
playback. JACK supports multiple backends, although ALSA is best
supported. JACK is portable beyond Linux. Among the supported backends are ALSA, OSS. [API Reference]
Full ALSA
ALSA is the Linux API
for doing PCM playback and recording. ALSA is very focused on
hardware devices, although other backends are supported as well (to a
limit degree, see below). ALSA as a name is used both for the Linux
audio kernel drivers and a user-space library that wraps these. ALSA — the library — is
comprehensive, and portable (to a limited degree). The full ALSA API
can appear very complex and is large. However it supports almost
everything modern sound hardware can provide. Some of the
functionality of the ALSA API is limited in its use to actual hardware
devices supported by the Linux kernel (in contrast to software sound
servers and sound drivers implemented in user-space such as those for
Bluetooth and FireWire audio — among others) and Linux specific
drivers. [API
Reference
]
Safe ALSA
Only a subset of the full ALSA API works on all backends ALSA
supports. It is highly recommended to stick to this safe subset
if you do ALSA programming to keep programs portable, future-proof and
compatible with sound servers, Bluetooth audio and FireWire audio. See
below for more details about which functions of ALSA are considered
safe. The safe ALSA API is a suitable abstraction for basic,
portable PCM playback and recording — not just for ALSA kernel driver
supported devices. Among the supported backends are ALSA kernel driver
devices, OSS, PulseAudio, JACK.
Phonon and KNotify
Phonon is high-level
abstraction for media streaming systems such as GStreamer, but goes a
bit further than that. It supports multiple backends. KNotify is a
system for “notifications”, which goes beyond mere event
sounds. However it does not support the XDG Sound Theming/Naming
Specifications at this point, and also doesn’t support caching or
passing of event meta-data to an underlying sound system. KNotify
supports multiple backends for audio playback via Phonon. Both APIs
are KDE/Qt specific and should not be used outside of KDE/Qt
applications. [Phonon API Reference] [KNotify API Reference]
SDL
SDL is a portable API
primarily used for full-screen game development. Among other stuff it
includes a portable audio interface. Among others SDL support OSS,
PulseAudio, ALSA as backends. [API Reference]
PulseAudio
PulseAudio is a sound system
for Linux desktops and embedded environments that runs in user-space
and (usually) on top of ALSA. PulseAudio supports network
transparency, per-application volumes, spatial events sounds, allows
switching of sound streams between devices on-the-fly, policy
decisions, and many other high-level operations. PulseAudio adds a glitch-free
audio playback model to the Linux audio stack. PulseAudio is not
useful in professional audio production environments. PulseAudio is
portable beyond Linux. PulseAudio has a native API and also supports
the safe subset of ALSA, in addition to limited,
LD_PRELOAD-based OSS compatibility. Among others PulseAudio supports
OSS and ALSA as backends and provides connectivity to JACK. [API
Reference
]
OSS
The Open Sound System is a
low-level PCM API supported by a variety of Unixes including Linux. It
started out as the standard Linux audio system and is supported on
current Linux kernels in the API version 3 as OSS3. OSS3 is considered
obsolete and has been fully replaced by ALSA. A successor to OSS3
called OSS4 is available but plays virtually no role on Linux and is
not supported in standard kernels or by any of the relevant
distributions. The OSS API is very low-level, based around direct
kernel interfacing using ioctl()s. It it is hence awkward to use and
can practically not be virtualized for usage on non-kernel audio
systems like sound servers (such as PulseAudio) or user-space sound
drivers (such as Bluetooth or FireWire audio). OSS3’s timing model
cannot properly be mapped to software sound servers at all, and is
also problematic on non-PCI hardware such as USB audio. Also, OSS does
not do sample type conversion, remapping or resampling if
necessary. This means that clients that properly want to support OSS
need to include a complete set of converters/remappers/resamplers for
the case when the hardware does not natively support the requested
sampling parameters. With modern sound cards it is very common to
support only S32LE samples at 48KHz and nothing else. If an OSS client
assumes it can always play back S16LE samples at 44.1KHz it will thus
fail. OSS3 is portable to other Unix-like systems, various differences
however apply. OSS also doesn’t support surround sound and other
functionality of modern sounds systems properly. OSS should be
considered obsolete and not be used in new applications.
ALSA and
PulseAudio have limited LD_PRELOAD-based compatibility with OSS. [Programming Guide]

All sound systems and APIs listed above are supported in all
relevant current distributions. For libcanberra support the newest
development release of your distribution might be necessary.

All sound systems and APIs listed above are suitable for
development for commercial (read: closed source) applications, since
they are licensed under LGPL or more liberal licenses or no client
library is involved.

You want to know why and when you should use a specific sound API?

GStreamer
GStreamer is best used for very high-level needs: i.e. you want to
play an audio file or video stream and do not care about all the tiny
details down to the PCM or codec level.
libcanberra
libcanberra is best used when adding sound feedback to user input
in UIs. It can also be used to play simple sound files for
notification purposes.
JACK
JACK is best used in professional audio production and where interconnecting applications is required.
Full ALSA
The full ALSA interface is best used for software on “plumbing layer” or when you want to make use of very specific hardware features, which might be need for audio production purposes.
Safe ALSA
The safe ALSA interface is best used for software that wants to output/record basic PCM data from hardware devices or software sound systems.
Phonon and KNotify
Phonon and KNotify should only be used in KDE/Qt applications and only for high-level media playback, resp. simple audio notifications.
SDL
SDL is best used in full-screen games.
PulseAudio
For now, the PulseAudio API should be used only for applications
that want to expose sound-server-specific functionality (such as
mixers) or when a PCM output abstraction layer is already available in
your application and it thus makes sense to add an additional backend
to it for PulseAudio to keep the stack of audio layers minimal.
OSS
OSS should not be used for new programs.

You want to know more about the safe ALSA subset?

Here’s a list of DOS and DONTS in the ALSA API if you care about
that you application stays future-proof and works fine with
non-hardware backends or backends for user-space sound drivers such as
Bluetooth and FireWire audio. Some of these recommendations apply for
people using the full ALSA API as well, since some functionality
should be considered obsolete for all cases.

If your application’s code does not follow these rules, you must have
a very good reason for that. Otherwise your code should simply be considered
broken!

DONTS:

  • Do not use “async handlers”, e.g. via
    snd_async_add_pcm_handler() and friends. Asynchronous
    handlers are implemented using POSIX signals, which is a very
    questionable use of them, especially from libraries and plugins. Even
    when you don’t want to limit yourself to the safe ALSA subset
    it is highly recommended not to use this functionality. Read
    this for a longer explanation why signals for audio IO are
    evil.
  • Do not parse the ALSA configuration file yourself or with
    any of the ALSA functions such as snd_config_xxx(). If you
    need to enumerate audio devices use snd_device_name_hint()
    (and related functions). That
    is the only API that also supports enumerating non-hardware audio
    devices and audio devices with drivers implemented in userspace.
  • Do not parse any of the files from
    /proc/asound/. Those files only include information about
    kernel sound drivers — user-space plugins are not listed there. Also,
    the set of kernel devices might differ from the way they are presented
    in user-space. (i.e. sub-devices are mapped in different ways to
    actual user-space devices such as surround51 an suchlike.
  • Do not rely on stable device indexes from ALSA. Nowadays
    they depend on the initialization order of the drivers during boot-up
    time and are thus not stable.
  • Do not use the snd_card_xxx() APIs. For
    enumerating use snd_device_name_hint() (and related
    functions). snd_card_xxx() is obsolete. It will only list
    kernel hardware devices. User-space devices such as sound servers,
    Bluetooth audio are not included. snd_card_load() is
    completely obsolete in these days.
  • Do not hard-code device strings, especially not
    hw:0 or plughw:0 or even dmix — these devices define no channel
    mapping and are mapped to raw kernel devices. It is highly recommended
    to use exclusively default as device string. If specific
    channel mappings are required the correct device strings should be
    front for stereo, surround40 for Surround 4.0,
    surround41, surround51, and so on. Unfortunately at
    this point ALSA does not define standard device names with channel
    mappings for non-kernel devices. This means default may only
    be used safely for mono and stereo streams. You should probably prefix
    your device string with plug: to make sure ALSA transparently
    reformats/remaps/resamples your PCM stream for you if the
    hardware/backend does not support your sampling parameters
    natively.
  • Do not assume that any particular sample type is supported
    except the following ones: U8, S16_LE, S16_BE, S32_LE, S32_BE,
    FLOAT_LE, FLOAT_BE, MU_LAW, A_LAW.
  • Do not use snd_pcm_avail_update() for
    synchronization purposes. It should be used exclusively to query the
    amount of bytes that may be written/read right now. Do not use
    snd_pcm_delay() to query the fill level of your playback
    buffer. It should be used exclusively for synchronisation
    purposes. Make sure you fully understand the difference, and note that
    the two functions return values that are not necessarily directly
    connected!
  • Do not assume that the mixer controls always know dB information.
  • Do not assume that all devices support MMAP style buffer access.
  • Do not assume that the hardware pointer inside the (possibly mmaped) playback buffer is the actual position of the sample in the DAC. There might be an extra latency involved.
  • Do not try to recover with your own code from ALSA error conditions such as buffer under-runs. Use snd_pcm_recover() instead.
  • Do not touch buffering/period metrics unless you have
    specific latency needs. Develop defensively, handling correctly the
    case when the backend cannot fulfill your buffering metrics
    requests. Be aware that the buffering metrics of the playback buffer
    only indirectly influence the overall latency in many
    cases. i.e. setting the buffer size to a fixed value might actually result in
    practical latencies that are much higher.
  • Do not assume that snd_pcm_rewind() is available and works and to which degree.
  • Do not assume that the time when a PCM stream can receive
    new data is strictly dependant on the sampling and buffering
    parameters and the resulting average throughput. Always make sure to
    supply new audio data to the device when it asks for it by signalling
    “writability” on the fd. (And similarly for capturing)
  • Do not use the “simple” interface snd_spcm_xxx().
  • Do not use any of the functions marked as “obsolete”.
  • Do not use the timer, midi, rawmidi, hwdep subsystems.

DOS:

  • Use snd_device_name_hint() for enumerating audio devices.
  • Use snd_smixer_xx() instead of raw snd_ctl_xxx()
  • For synchronization purposes use snd_pcm_delay().
  • For checking buffer playback/capture fill level use snd_pcm_update_avail().
  • Use snd_pcm_recover() to recover from errors returned by any of the ALSA functions.
  • If possible use the largest buffer sizes the device supports to maximize power saving and drop-out safety. Use snd_pcm_rewind() if you need to react to user input quickly.

FAQ

What about ESD and NAS?
ESD and NAS are obsolete, both as API and as sound daemon. Do not develop for it any further.
ALSA isn’t portable!
That’s not true! Actually the user-space library is relatively portable, it even includes a backend for OSS sound devices. There is no real reason that would disallow using the ALSA libraries on other Unixes as well.
Portability is key to me! What can I do?
Unfortunately no truly portable (i.e. to Win32) PCM API is
available right now that I could truly recommend. The systems shown
above are more or less portable at least to Unix-like operating
systems. That does not mean however that there are suitable backends
for all of them available. If you care about portability to Win32 and
MacOS you probably have to find a solution outside of the
recommendations above, or contribute the necessary
backends/portability fixes. None of the systems (with the exception of
OSS) is truly bound to Linux or Unix-like kernels.
What about PortAudio?
I don’t think that PortAudio is very good API for Unix-like operating systems. I cannot recommend it, but it’s your choice.
Oh, why do you hate OSS4 so much?
I don’t hate anything or anyone. I just don’t think OSS4 is a
serious option, especially not on Linux. On Linux, it is also
completely redundant due to ALSA.
You idiot, you have no clue!
You are right, I totally don’t. But that doesn’t hinder me from recommending things. Ha!
Hey I wrote/know this tiny new project which is an awesome abstraction layer for audio/media!
Sorry, that’s not sufficient. I only list software here that is known to be sufficiently relevant and sufficiently well maintained.

Final Words

Of course these recommendations are very basic and are only intended to
lead into the right direction. For each use-case different necessities
apply and hence options that I did not consider here might become
viable. It’s up to you to decide how much of what I wrote here
actually applies to your application.

This summary only includes software systems that are considered
stable and universally available at the time of writing. In the
future I hope to introduce a more suitable and portable replacement
for the safe ALSA subset of functions. I plan to update this text
from time to time to keep things up-to-date.

If you feel that I forgot a use case or an important API, then
please contact me or leave a comment. However, I think the summary
above is sufficiently comprehensive and if an entry is missing I most
likely deliberately left it out.

(Also note that I am upstream for both PulseAudio and libcanberra and did some minor contributions to ALSA, GStreamer and some other of the systems listed above. Yes, I am biased.)

Oh, and please syndicate this, digg it. I’d like to see this guide to be well-known all around the Linux community. Thank you!

More Xen Tricks

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2007/08/24/more-xen.html

In
my previous
post about Xen
, I talked about how easy Xen is to configure and
set up, particularly on Ubuntu and Debian. I’m still grateful that
Xen remains easy; however, I’ve lately had a few Xen-related
challenges that needed attention. In particular, I’ve needed to
create some surprisingly messy solutions when using vif-route to
route multiple IP numbers on the same network through the dom0 to a
domU.

I tend to use vif-route rather than vif-bridge, as I like the control
it gives me in the dom0. The dom0 becomes a very traditional
packet-forwarding firewall that can decide whether or not to forward
packets to each domU host. However, I recently found some deep
weirdness in IP routing when I use this approach while needing
multiple Ethernet interfaces on the domU. Here’s an example:

Multiple IP numbers for Apache

Suppose the domU host, called webserv, hosts a number of
websites, each with a different IP number, so that I have Apache
doing something like1:

Listen 192.168.0.200:80
Listen 192.168.0.201:80
Listen 192.168.0.202:80

NameVirtualHost 192.168.0.200:80
<VirtualHost 192.168.0.200:80>

NameVirtualHost 192.168.0.201:80
<VirtualHost 192.168.0.201:80>

NameVirtualHost 192.168.0.202:80
<VirtualHost 192.168.0.202:80>

The Xen Configuration for the Interfaces

Since I’m serving all three of those sites from webserv, I
need all those IP numbers to be real, live IP numbers on the local
machine as far as the webserv is concerned. So, in
dom0:/etc/xen/webserv.cfg I list something like:

vif = [ ‘mac=de:ad:be:ef:00:00, ip=192.168.0.200’,
‘mac=de:ad:be:ef:00:01, ip=192.168.0.201’,
‘mac=de:ad:be:ef:00:02, ip=192.168.0.202’ ]

… And then make webserv:/etc/iftab look like:

eth0 mac de:ad:be:ef:00:00 arp 1
eth1 mac de:ad:be:ef:00:01 arp 1
eth2 mac de:ad:be:ef:00:02 arp 1

… And make webserv:/etc/network/interfaces (this is
probably Ubuntu/Debian-specific, BTW) look like:

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 192.168.0.200
netmask 255.255.255.0
auto eth1
iface eth1 inet static
address 192.168.0.201
netmask 255.255.255.0
auto eth2
iface eth2 inet static
address 192.168.0.202
netmask 255.255.255.0

Packet Forwarding from the Dom0

But, this doesn’t get me the whole way there. My next step is to make
sure that the dom0 is routing the packets properly to
webserv. Since my dom0 is heavily locked down, all
packets are dropped by default, so I have to let through explicitly
anything I’d like webserv to be able to process. So, I
add some code to my firewall script on the dom0 that looks like:2

webIpAddresses=”192.168.0.200 192.168.0.201 192.168.0.202″
UNPRIVPORTS=”1024:65535″

for dport in 80 443;
do
for sport in $UNPRIVPORTS 80 443 8080;
do
for ip in $webIpAddresses;
do
/sbin/iptables -A FORWARD -i eth0 -p tcp -d $ip
–syn -m state –state NEW
–sport $sport –dport $dport -j ACCEPT

/sbin/iptables -A FORWARD -i eth0 -p tcp -d $ip
–sport $sport –dport $dport
-m state –state ESTABLISHED,RELATED -j ACCEPT

/sbin/iptables -A FORWARD -o eth0 -s $ip
-p tcp –dport $sport –sport $dport
-m state –state NEW,ESTABLISHED,RELATED -j ACCEPT
done
done
done

Phew! So at this point, I thought I was done. The packets should find
their way forwarded through the dom0 to the Apache instance running on
the domU, webserv. While that much was true, I now have
the additional problem that packets got lost in a bit of a black hole
on webserv. When I discovered the black hole, I quickly
realized why. It was somewhat atypical, from webserv’s
point of view, to have three “real” and different Ethernet
devices with three different IP numbers, which all talk to the exact
same network. There was more intelligent routing
needed.3

Routing in the domU

While most non-sysadmins still use the route command to
set up local IP routes on a GNU/Linux host, iproute2
(available via the ip command) has been a standard part
of GNU/Linux distributions and supported by Linux for nearly ten
years. To properly support the situation of multiple (from
webserv’s point of view, at least) physical interfaces on
the same network, some special iproute2 code is needed.
Specifically, I set up separate route tables for each device. I first
encoded their names in /etc/iproute2/rt_tables (the
numbers 16-18 are arbitrary, BTW):

16 eth0-200
17 eth1-201
18 eth2-202

And here are the ip commands that I thought would work
(but didn’t, as you’ll see next):

/sbin/ip route del default via 192.168.0.1

for table in eth0-200 eth1-201 eth2-202;
do
iface=`echo $table | perl -pe ‘s/^(S+)-.*$/$1/;’`
ipEnding=`echo $table | perl -pe ‘s/^.*-(S+)$/$1/;’`
ip=192.168.0.$ipEnding
/sbin/ip route add 192.168.0.0/24 dev $iface table $table

/sbin/ip route add default via 192.168.0.1 table $table
/sbin/ip rule add from $ip table $table
/sbin/ip rule add to 0.0.0.0 dev $iface table $table
done

/sbin/ip route add default via 192.168.0.1

The idea is that each table will use rules to force all traffic coming
in on the given IP number and/or interface to always go back out on
the same, and vice versa. The key is these two lines:

/sbin/ip rule add from $ip table $table
/sbin/ip rule add to 0.0.0.0 dev $iface table $table

The first rule says that when traffic is coming from the given IP number,
$ip, the routing rules in table, $table should
be used. The second says that traffic to anywhere when bound for
interface, $iface should use table,
$table.

The tables themselves are set up to always make sure the local network
traffic goes through the proper associated interface, and that the
network router (in this case, 192.168.0.1) is always
used for foreign networks, but that it is reached via the correct
interface.

This is all well and good, but it doesn’t work. Certain instructions
fail with the message, RTNETLINK answers: Network is
unreachable, because the 192.168.0.0 network cannot be found
while the instructions are running. Perhaps there is an
elegant solution; I couldn’t find one. Instead, I temporarily set
up “dummy” global routes in the main route table and
deleted them once the table-specific ones were created. Here’s the
new bash script that does that (lines that are added are emphasized
and in bold):

/sbin/ip route del default via 192.168.0.1
for table in eth0-200 eth1-201 eth2-202;
do
iface=`echo $table | perl -pe ‘s/^(S+)-.*$/$1/;’`
ipEnding=`echo $table | perl -pe ‘s/^.*-(S+)$/$1/;’`
ip=192.168.0.$ipEnding
/sbin/ip route add 192.168.0.0/24 dev $iface table $table

/sbin/ip route add 192.168.0.0/24 dev $iface src $ip

/sbin/ip route add default via 192.168.0.1 table $table
/sbin/ip rule add from $ip table $table

/sbin/ip rule add to 0.0.0.0 dev $iface table $table

/sbin/ip route del 192.168.0.0/24 dev $iface src $ip
done
/sbin/ip route add 192.168.0.0/24 dev eth0 src 192.168.0.200
/sbin/ip route add default via 192.168.0.1
/sbin/ip route del 192.168.0.0/24 dev eth0 src 192.168.0.200

I am pretty sure I’m missing something here — there must be a
better way to do this, but the above actually works, even if it’s
ugly.

Alas, Only Three

There was one additional confusion I put myself through while
implementing the solution. I was actually trying to route four
separate IP addresses into webserv, but discovered that
I got found this error message (found via dmesg on the
domU):
netfront can’t alloc rx grant refs. A quick google
around showed me
that the
XenFaq, which says that Xen 3 cannot handled more than three network
interfaces per domU
. Seems strangely arbitrary to me; I’d love
to hear why cuts it off at three. I can imagine limits at one and
two, but it seems that once you can do three, n should be
possible (perhaps still with linear slowdown or some such). I’ll
have to ask the Xen developers (or UTSL) some day to find out what
makes it possible to have three work but not four.

1Yes, I know I
could rely on client-provided Host: headers and do this with full
name-based virtual hosting, but I don’t
like to do that for good reason (as outlined in the Apache
docs)
.

2Note that the
above firewall code must run on dom0, which has one real
Ethernet device (its eth0) that is connected properly to
the wide 192.168.0.0/24 network, and should have some IP
number of its own there — say 192.168.0.100. And,
don’t forget that dom0 is configured for vif-route, not
vif-bridge. Finally, for brevity, I’ve left out some of the
firewall code that FORWARDs through key stuff like DNS. If you are
interested in it, email me or look it up in a firewall book.

3I was actually a
bit surprised at this, because I often have multiple IP numbers
serviced from the same computer and physical Ethernet interface.
However, in those cases, I use virtual interfaces
(eth0:0, eth0:1, etc.). On a normal system,
Linux does the work of properly routing the IP numbers when you attach
multiple IP numbers virtually to the same physical interface.
However, in Xen domUs, the physical interfaces are locked by Xen to
only permit specific IP numbers to come through, and while you can set
up all the virtual interfaces you want in the domU, it will only get
packets destine for the IP number specified in the vif
section of the configuration file. That’s why I added my three
different “actual” interfaces in the domU.

Remember the Verbosity (A Brief Note)

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2007/04/17/linux-verbose-build.html

I don’t remember when it happened, but sometime in the past four years,
the Makefiles for the kernel named Linux changed. I can’t remember
exactly, but I do recall sometime “recently” that the
kernel build output stopped looking like what I remember from 1991,
and started looking like this:

CC arch/i386/kernel/semaphore.o
CC arch/i386/kernel/signal.o

This is a heck of a lot easier to read, but there was something cool
about having make display the whole gcc
command lines, like this:

gcc -m32 -Wp,-MD,arch/i386/kernel/.semaphore.o.d -nostdinc -isystem /usr/lib/gcc/i486-linux-gnu/4.0.3/include -D__KERNEL__ -Iinclude -include include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -pipe -msoft-float -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium4 -Iinclude/asm-i386/mach-default -Wdeclaration-after-statement -Wno-pointer-sign -D”KBUILD_STR(s)=#s” -D”KBUILD_BASENAME=KBUILD_STR(semaphore)” -D”KBUILD_MODNAME=KBUILD_STR(semaphore)” -c -o arch/i386/kernel/semaphore.o arch/i386/kernel/semaphore.c
gcc -m32 -Wp,-MD,arch/i386/kernel/.signal.o.d -nostdinc -isystem /usr/lib/gcc/i486-linux-gnu/4.0.3/include -D__KERNEL__ -Iinclude -include include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -pipe -msoft-float -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium4 -Iinclude/asm-i386/mach-default -Wdeclaration-after-statement -Wno-pointer-sign -D”KBUILD_STR(s)=#s” -D”KBUILD_BASENAME=KBUILD_STR(signal)” -D”KBUILD_MODNAME=KBUILD_STR(signal)” -c -o arch/i386/kernel/signal.o arch/i386/kernel/signal.c

I never gave it much thought, since the new form was easier to read. I
figured that those folks who still eat kernel code for breakfast knew
about this change well ahead of time. Of course, they were the only
ones who needed to see the verbose output of the gcc
command lines. I could live with seeing the simpler CC
lines for my purposes, until today.

I was compiling kernel code and for the first time since this change in
the Makefiles, I was using a non-default gcc to build
Linux. I wanted to double-check that I’d given the right options to
make throughout the process. I therefore found myself
looking for a way to see the full output again (and for the first
time). It was easy enough to figure out: giving the variable setting
V=1 to make gives you the verbose version.
For you Debian folks like me, we’re using make-kpkg, so
the line we need looks like: MAKEFLAGS=”V=1″ make-kpkg
kernel_image.

It’s nice sometimes to pretend I’m compiling 0.99pl12 again and not
2.6.20.7. 🙂 No matter which options you give make, it is
still a whole lot easier to bootstrap Linux these days.