Tag Archives: Projects

Third Berlin Open Source Meetup

2013-01-04 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/berlin-open-source-meetup-3.html

The Third Berlin Open Source Meetup is going to take place on Sunday, January 20th. You are invited!

It’s a public event, so everybody is welcome, and please feel free to invite others!

foss.in Needs Your Funding!

2012-11-15 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/fossin2012-2.html

One of the most exciting conferences in the Free Software world, foss.in in Bangalore, India has trouble finding enough
sponsoring for this year’s edition. Many speakers from
all around the Free Software world (including yours truly) have signed up
to present at the event, and the conference would appreciate any corporate
funding they can get!

Please check if
your company can help and contact the
organizers for details!

See you in Bangalore!

systemd for Developers III

2012-10-25 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/journal-submit.html

Here’s the third episode of of my
systemd for Developers series.

Logging to the Journal

In a recent blog
story intended for administrators I shed some light on how to use
the journalctl(1)
tool to browse and search the systemd journal. In this blog story for developers
I want to explain a little how to get log data into the systemd
Journal in the first place.

The good thing is that getting log data into the Journal is not
particularly hard, since there’s a good chance the Journal already
collects it anyway and writes it to disk. The journal collects:

All data logged via libc syslog()
The data from the kernel logged with printk()
Everything written to STDOUT/STDERR of any system service

This covers pretty much all of the traditional log output of a
Linux system, including messages from the kernel initialization phase,
the initial RAM disk, the early boot logic, and the main system
runtime.

syslog()

Let’s have a quick look how syslog() is used again. Let’s
write a journal message using this call:

#include <syslog.h>

int main(int argc, char *argv[]) {
        syslog(LOG_NOTICE, "Hello World");
        return 0;
}

This is C code, of course. Many higher level languages provide APIs
that allow writing local syslog messages. Regardless which language
you choose, all data written like this ends up in the Journal.

Let’s have a look how this looks after it has been written into the
journal (this is the JSON
output journalctl -o json-pretty generates):

{
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "_TRANSPORT" : "syslog",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "SYSLOG_FACILITY" : "1",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "_PID" : "3068",
        "SYSLOG_IDENTIFIER" : "test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351126905014938"
}

This nicely shows how the Journal implicitly augmented our little
log message with various meta data fields which describe in more
detail the context our message was generated from. For an explanation
of the various fields, please refer to systemd.journal-fields(7)

printf()

If you are writing code that is run as a systemd service, generating journal
messages is even easier:

#include <stdio.h>

int main(int argc, char *argv[]) {
        printf("Hello World\n");
        return 0;
}

Yupp, that’s easy, indeed.

The printed string in this example is logged at a default log
priority of LOG_INFO^[1]. Sometimes it is useful to change
the log priority for such a printed string. When systemd parses
STDOUT/STDERR of a service it will look for priority values enclosed
in < > at the beginning of each line^[2], following the scheme
used by the kernel’s printk() which in turn took
inspiration from the BSD syslog network serialization of messages. We
can make use of this systemd feature like this:

#include <stdio.h>

#define PREFIX_NOTICE "<5>"

int main(int argc, char *argv[]) {
        printf(PREFIX_NOTICE "Hello World\n");
        return 0;
}

Nice! Logging with nothing but printf() but we still get
log priorities!

This scheme works with any programming language, including, of course, shell:

#!/bin/bash

echo "<5>Hellow world"

Native Messages

Now, what I explained above is not particularly exciting: the
take-away is pretty much only that things end up in the journal if
they are output using the traditional message printing APIs. Yaaawn!

Let’s make this more interesting, let’s look at what the Journal
provides as native APIs for logging, and let’s see what its benefits
are. Let’s translate our little example into the 1:1 counterpart
using the Journal’s logging API sd_journal_print(3):

#include <systemd/sd-journal.h>

int main(int argc, char *argv[]) {
        sd_journal_print(LOG_NOTICE, "Hello World");
        return 0;
}

This doesn’t look much more interesting than the two examples
above, right? After compiling this with `pkg-config --cflags --libs libsystemd-journal` appended to the compiler parameters,
let’s have a closer look at the JSON representation of the journal
entry this generates:

 {
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World",
        "CODE_LINE" : "4",
        "_PID" : "3516",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351128226954170"
}

This looks pretty much the same, right? Almost! I highlighted three new
lines compared to the earlier output. Yes, you guessed it, by using
sd_journal_print() meta information about the generating
source code location is implicitly appended to each
message^[3], which is helpful for a developer to identify
the source of a problem.

The primary reason for using the Journal’s native logging APIs is a
not just the source code location however: it is to allow
passing additional structured log messages from the program into the
journal. This additional log data may the be used to search the
journal for, is available for consumption for other programs, and
might help the administrator to track down issues beyond what is
expressed in the human readable message text. Here’s and example how
to do that with sd_journal_send():

#include <systemd/sd-journal.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
        sd_journal_send("MESSAGE=Hello World!",
                        "MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555",
                        "PRIORITY=5",
                        "HOME=%s", getenv("HOME"),
                        "TERM=%s", getenv("TERM"),
                        "PAGE_SIZE=%li", sysconf(_SC_PAGESIZE),
                        "N_CPUS=%li", sysconf(_SC_NPROCESSORS_ONLN),
                        NULL);
        return 0;
}

This will write a log message to the journal much like the earlier
examples. However, this times a few additional, structured fields are
attached:

{
        "__CURSOR" : "s=ac9e9c423355411d87bf0ba1a9b424e8;i=5930;b=5335e9cf5d954633bb99aefc0ec38c25;m=16544f875b;t=4ccd863cdc4f0;x=896defe53cc1a96a",
        "__REALTIME_TIMESTAMP" : "1351129666274544",
        "__MONOTONIC_TIMESTAMP" : "95903778651",
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_PID" : "4049",
        "CODE_LINE" : "6",
        "MESSAGE_ID" : "52fb62f99e2c49d89cfbf9d6de5e3555",
        "HOME" : "/home/lennart",
        "TERM" : "xterm-256color",
        "PAGE_SIZE" : "4096",
        "N_CPUS" : "4",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351129666241467"
}

Awesome! Our simple example worked! The five meta data fields we
attached to our message appeared in the journal. We used sd_journal_send()
for this which works much like sd_journal_print() but takes a
NULL terminated list of format strings each followed by its
arguments. The format strings must include the field name and a ‘=’
before the values.

Our little structured message included seven fields. The first three we passed are well-known fields:

MESSAGE= is the actual human readable message part of the structured message.
PRIORITY= is the numeric message priority value as known from BSD syslog formatted as an integer string.
MESSAGE_ID= is a 128bit ID that identifies our specific
message call, formatted as hexadecimal string. We randomly generated
this string with journalctl --new-id128. This can be used by
applications to track down all occasions of this specific
message. The 128bit can be a UUID, but this is not a requirement or enforced.

Applications may relatively freely define additional fields as they
see fit (we defined four pretty arbitrary ones in our example). A
complete list of the currently well-known fields is available in systemd.journal-fields(7).

Let’s see how the message ID helps us finding this message and all
its occasions in the journal:

$ journalctl MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555
-- Logs begin at Thu, 2012-10-18 04:07:03 CEST, end at Thu, 2012-10-25 04:48:21 CEST. --
Oct 25 03:47:46 epsilon test-journal-se[4049]: Hello World!
Oct 25 04:40:36 epsilon test-journal-se[4480]: Hello World!

Seems I already invoked this example tool twice!

Many messages systemd itself generates have
message IDs. This is useful for example, to find all occasions
where a program dumped core (journalctl MESSAGE_ID=fc2e22bc6ee647b6b90729ab34a250b1), or when a user
logged in (journalctl MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66). If your application
generates a message that might be interesting to recognize in the
journal stream later on, we recommend attaching such a message ID to
it. You can easily allocate a new one for your message with journalctl --new-id128.

This example shows how we can use the Journal’s native APIs to
generate structured, recognizable messages. You can do much more than
this with the C API. For example, you may store binary data in journal
fields as well, which is useful to attach coredumps or hard disk SMART
states to events where this applies. In order to make this blog story
not longer than it already is we’ll not go into detail about how to do
this, an I ask you to check out sd_journal_send(3)
for further information on this.

Python

The examples above focus on C. Structured logging to the Journal is
also available from other languages. Along with systemd itself we ship
bindings for Python. Here’s an example how to use this:

from systemd import journal
journal.send('Hello world')
journal.send('Hello, again, world', FIELD2='Greetings!', FIELD3='Guten tag')

Other binding exist for Node.js,
PHP, Lua.

Portability

Generating structured data is a very useful feature for services to
make their logs more accessible both for administrators and other
programs. In addition to the implicit structure the Journal
adds to all logged messages it is highly beneficial if the various
components of our stack also provide explicit structure
in their messages, coming from within the processes themselves.

Porting an existing program to the Journal’s logging APIs comes
with one pitfall though: the Journal is Linux-only. If non-Linux
portability matters for your project it’s a good idea to provide an
alternative log output, and make it selectable at compile-time.

Regardless which way to log you choose, in all cases we’ll forward
the message to a classic syslog daemon running side-by-side with the
Journal, if there is one. However, much of the structured meta data of
the message is not forwarded since the classic syslog protocol simply
has no generally accepted way to encode this and we shouldn’t attempt
to serialize meta data into classic syslog messages which might turn
/var/log/messages into an unreadable dump of machine
data. Anyway, to summarize this: regardless if you log with
syslog(), printf(), sd_journal_print() or
sd_journal_send(), the message will be stored and indexed by
the journal and it will also be forwarded to classic syslog.

And that’s it for today. In a follow-up episode we’ll focus on
retrieving messages from the Journal using the C API, possibly
filtering for a specific subset of messages. Later on, I hope to give
a real-life example how to port an existing service to the Journal’s
logging APIs. Stay tuned!

Footnotes

[1] This can be changed with the SyslogLevel= service
setting. See systemd.exec(5)
for details.

[2] Interpretation of the < > prefixes of logged lines
may be disabled with the SyslogLevelPrefix= service setting. See systemd.exec(5)
for details.

[3] Appending the code location to the log messages can be
turned off at compile time by defining
-DSD_JOURNAL_SUPPRESS_LOCATION.

systemd for Administrators, Part XVIII

2012-10-24 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/resources.html

Hot
on the
heels of
the previous
story, here’s
now the eighteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Managing Resources

An important facet of modern computing is resource management: if
you run more than one program on a single machine you want to assign
the available resources to them enforcing particular policies. This is
particularly crucial on smaller, embedded or mobile systems where the
scarce resources are the main constraint, but equally for large
installations such as cloud setups, where resources are plenty, but
the number of programs/services/containers on a single node is
drastically higher.

Traditionally, on Linux only one policy was really available: all
processes got about the same CPU time, or IO bandwith, modulated a bit
via the process nice value. This approach is very simple and
covered the various uses for Linux quite well for a long
time. However, it has drawbacks: not all all processes deserve to be
even, and services involving lots of processes (think: Apache with a
lot of CGI workers) this way would get more resources than services
whith very few (think: syslog).

When thinking about service management for systemd, we quickly
realized that resource management must be core functionality of it. In
a modern world — regardless if server or embedded — controlling CPU,
Memory, and IO resources of the various services cannot be an
afterthought, but must be built-in as first-class service settings. And
it must be per-service and not per-process as the traditional nice
values or POSIX
Resource Limits were.

In this story I want to shed some light on what you can do to
enforce resource policies on systemd services. Resource Management in
one way or another has been available in systemd for a while already,
so it’s really time we introduce this to the broader audience.

In an
earlier blog post I highlighted the difference between Linux
Control Croups (cgroups) as a labelled, hierarchal grouping mechanism,
and Linux cgroups as a resource controlling subsystem. While systemd
requires the former, the latter is optional. And this optional latter
part is now what we can make use of to manage per-service
resources. (At this points, it’s probably a good idea to read up on cgroups before
reading on, to get at least a basic idea what they are and what they
accomplish. Even thought the explanations below will be pretty
high-level, it all makes a lot more sense if you grok the background a
bit.)

The main Linux cgroup controllers for resource management are cpu,
memory
and blkio. To
make use of these, they need to be enabled in the kernel, which many
distributions (including Fedora) do. systemd exposes a couple of high-level service
settings to make use of these controllers without requiring too much
knowledge of the gory kernel details.

Managing CPU

As a nice default, if the cpu controller is enabled in the
kernel, systemd will create a cgroup for each service when starting
it. Without any further configuration this already has one nice
effect: on a systemd system every system service will get an even
amount of CPU, regardless how many processes it consists off. Or in
other words: on your web server MySQL will get the roughly same amount
of CPU as Apache, even if the latter consists a 1000 CGI script
processes, but the former only of a few worker tasks. (This behavior can
be turned off, see DefaultControllers=
in /etc/systemd/system.conf.)

On top of this default, it is possible to explicitly configure the
CPU shares a service gets with the CPUShares=
setting. The default value is 1024, if you increase this number you’ll
assign more CPU to a service than an unaltered one at 1024, if you decrease it, less.

Let’s see in more detail, how we can make use of this. Let’s say we
want to assign Apache 1500 CPU shares instead of the default of
1024. For that, let’s create a new administrator service file for
Apache in /etc/systemd/system/httpd.service, overriding the
vendor supplied one in /usr/lib/systemd/system/httpd.service,
but let’s change the CPUShares= parameter:

.include /usr/lib/systemd/system/httpd.service

[Service]
CPUShares=1500

The first line will pull in the vendor service file. Now, lets’s
reload systemd’s configuration and restart Apache so that the new
service file is taken into account:

systemctl daemon-reload
systemctl restart httpd.service

And yeah, that’s already it, you are done!

(Note that setting CPUShares= in a unit file will cause the
specific service to get its own cgroup in the cpu hierarchy,
even if cpu is not included in
DefaultControllers=.)

Analyzing Resource usage

Of course, changing resource assignments without actually
understanding the resource usage of the services in questions is like
blind flying. To help you understand the resource usage of all
services, we created the tool systemd-cgtop,
that will enumerate all cgroups of the system, determine their
resource usage (CPU, Memory, and IO) and present them in a top-like fashion. Building
on the fact that systemd services are managed in cgroups this tool
hence can present to you for services what top shows you for
processes.

Unfortunately, by default cgtop will only be able to chart
CPU usage per-service for you, IO and Memory are only tracked as total
for the entire machine. The reason for this is simply that by default
there are no per-service cgroups in the blkio and
memory controller hierarchies but that’s what we need to
determine the resource usage. The best way to get this data for all
services is to simply add the memory and blkio
controllers to the aforementioned DefaultControllers= setting
in system.conf.

Managing Memory

To enforce limits on memory systemd provides the
MemoryLimit=, and MemorySoftLimit= settings for
services, summing up the memory of all its processes. These settings
take memory sizes in bytes that are the total memory limit for the
service. This setting understands the usual K, M, G, T suffixes for
Kilobyte, Megabyte, Gigabyte, Terabyte (to the base of 1024).

.include /usr/lib/systemd/system/httpd.service

[Service]
MemoryLimit=1G

(Analogue to CPUShares= above setting this option will cause
the service to get its own cgroup in the memory cgroup
hierarchy.)

Managing Block IO

To control block IO multiple settings are available. First of all
BlockIOWeight= may be used which assigns an IO weight
to a specific service. In behaviour the weight concept is not
unlike the shares concept of CPU resource control (see
above). However, the default weight is 1000, and the valid range is
from 10 to 1000:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=500

Optionally, per-device weights can be specified:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/dev/disk/by-id/ata-SAMSUNG_MMCRE28G8MXP-0VBL1_DC06K01009SE009B5252 750

Instead of specifiying an actual device node you also specify any
path in the file system:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/home/lennart 750

If the specified path does not refer to a device node systemd will
determine the block device /home/lennart is on, and assign
the bandwith weight to it.

You can even add per-device and normal lines at the same time,
which will set the per-device weight for the device, and the other
value as default for everything else.

Alternatively one may control explicit bandwith limits with the
BlockIOReadBandwidth= and BlockIOWriteBandwidth=
settings. These settings take a pair of device node and bandwith rate
(in bytes per second) or of a file path and bandwith rate:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOReadBandwith=/var/log 5M

This sets the maximum read bandwith on the block device backing
/var/log to 5Mb/s.

(Analogue to CPUShares= and MemoryLimit= using
any of these three settings will result in the service getting its own
cgroup in the blkio hierarchy.)

Managing Other Resource Parameters

The options described above cover only a small subset of the
available controls the various Linux control group controllers
expose. We picked these and added high-level options for them since we
assumed that these are the most relevant for most folks, and that they
really needed a nice interface that can handle units properly and
resolve block device names.

In many cases the options explained above might not be sufficient
for your usecase, but a low-level kernel cgroup setting might help. It
is easy to make use of these options from systemd unit files, without
having them covered with a high-level setting. For example, sometimes
it might be useful to set the swappiness of a service. The
kernel makes this controllable via the memory.swappiness
cgroup attribute, but systemd does not expose it as a high-level
option. Here’s how you use it nonetheless, using the low-level
ControlGroupAttribute= setting:

.include /usr/lib/systemd/system/httpd.service

[Service]
ControlGroupAttribute=memory.swappiness 70

(Analogue to the other cases this too causes the service to be
added to the memory hierarchy.)

Later on we might add more high-level controls for the
various cgroup attributes. In fact, please ping us if you frequently
use one and believe it deserves more focus. We’ll consider adding a
high-level option for it then. (Even better: send us a patch!)

Disclaimer: note that making use of the various resource
controllers does have a runtime impact on the system. Enforcing
resource limits comes at a price. If you do use them, certain
operations do get slower. Especially the memory controller
has (used to have?) a bad reputation to come at a performance
cost.

For more details on all of this, please have a look at the
documenation of the mentioned
unit settings, and of the cpu,
memory
and blkio
controllers.

And that’s it for now. Of course, this blog story only focussed on
the per-service resource settings. On top this, you can also
set the more traditional, well-known per-process resource
settings, which will then be inherited by the various subprocesses,
but always only be enforced per-process. More specifically that’s
IOSchedulingClass=, IOSchedulingPriority=,
CPUSchedulingPolicy=, CPUSchedulingPriority=,
CPUAffinity=, LimitCPU= and related. These do not
make use of cgroup controllers and have a much lower performance
cost. We might cover those in a later article in more detail.

systemd for Administrators, Part XVII

2012-10-24 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/journalctl.html

It’s
that
time again,
here’s
now the seventeenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Using the Journal

A
while back I already posted a blog story introducing some
functionality of the journal, and how it is exposed in
systemctl. In this episode I want to explain a few more uses
of the journal, and how you can make it work for you.

If you are wondering what the journal is, here’s an explanation in
a few words to get you up to speed: the journal is a component of systemd,
that captures Syslog messages, Kernel log messages, initial RAM disk
and early boot messages as well as messages written to STDOUT/STDERR
of all services, indexes them and makes this available to the user. It
can be used in parallel, or in place of a traditional syslog daemon,
such as rsyslog or syslog-ng. For more information, see the initial
announcement.

The journal has been part of Fedora since F17. With Fedora 18 it
now has grown into a reliable, powerful tool to handle your logs. Note
however, that on F17 and F18 the journal is configured by default to
store logs only in a small ring-buffer in /run/log/journal,
i.e. not persistent. This of course limits its usefulness quite
drastically but is sufficient to show a bit of recent log history in
systemctl status. For Fedora 19, we plan to change this, and
enable persistent logging by default. Then, journal files will be
stored in /var/log/journal and can grow much larger, thus
making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald’s persistent storage manually:

# mkdir -p /var/log/journal

After that, it’s a good idea to reboot, to get some useful
structured data into your journal to play with. Oh, and since you have
the journal now, you don’t need syslog anymore (unless having
/var/log/messages as text file is a necessity for you.), so
you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features
of systemd 195 as it will be included in Fedora 18^[1], so
if your F17 can’t do the tricks you see, please wait for F18. First,
let’s start with some basics. To access the logs of the journal use
the journalctl(1)
tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the
system, from system components the same way as for logged in
users. The output you will get looks like a pixel-perfect copy of the
traditional /var/log/messages format, but actually has a
couple of improvements over it:

Lines of error priority (and higher) will be highlighted red.
Lines of notice/warning priority will be highlighted bold.
The timestamps are converted into your local time-zone.
The output is auto-paged with your pager of choice (defaults to less).
This will show all available data, including rotated logs.
Between the output of each boot we’ll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of
the output this generates, I cut that out for brevity — and to give
you a reason to try it out yourself with a current image for F18’s
development version with systemd 195. But I do hope you get the idea
anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be
root sucks of course, even administrators tend to do most of their
work as unprivileged users these days. By default, Journal users can
only watch their own logs, unless they are root or in the adm
group. To make watching system logs more fun, let’s add ourselves to
adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access
to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current
log database. Sometimes one needs to watch logs as they grow, where
one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you
the last ten logs lines and then wait for changes and show them as
they take place.

Basic Filtering

When invoking journalctl without parameters you’ll see the
whole set of logs, beginning with the oldest message stored. That of
course, can be a lot of data. Much more useful is just viewing the
logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the
aforementioned gimmicks mentioned. But sometimes even this is way too
much data to process. So what about just listing all the real issues
to care about: all messages of priority levels ERROR and worse, from
the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense,
filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in
the morning until right now. Awesome! Of course, we can combine this with
-p err or a similar match. But humm, we are looking for
something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I
noticed that some CGI script in Apache was acting up earlier today,
let’s see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn’t there an issue with
that disk /dev/sdc? Let’s figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error!^[2] Hmm, let’s quickly replace the
disk before we lose data. Done! Next! — Hmm, didn’t I see that the vpnc binary made a booboo? Let’s
check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don’t get this, this seems to be some weird interaction with
dhclient, let’s see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let’s turn this up a
notch. Internally systemd stores each log entry with a set of
implicit meta data. This meta data looks a lot like an
environment block, but actually is a bit more powerful: values can
take binary, large values (though this is the exception, and usually
they just contain UTF-8), and fields can have multiple values assigned
(an exception too, usually they only have one value). This implicit
meta data is collected for each and every log message, without user
intervention. The data will be there, and wait to be used by
you. Let’s see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don’t want to make this story
overly long. -n without parameter shows you the last 10 log
entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose
output. Instead of showing a pixel-perfect copy of classic
/var/log/messages that only includes a minimimal subset of
what is available we now see all the gory details the journal has
about each entry. But it’s highly interesting: there is user credential
information, SELinux bits, machine information and more. For a full
list of common, well-known fields, see the
man page.

Now, as it turns out the journal database is indexed by all
of these fields, out-of-the-box! Let’s try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux
user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical
OR combination of the matches. All entries matching either will be
shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field
names, they will be combined with a logical AND. All entries matching
both will be shown now, meaning that all messages from processes named
avahi-daemon and host epsilon.

But of course, that’s
not fancy enough for us. We are computer nerds after all, we live off
logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when
you match the same field twice. The line above hence means: show me
everything from host theta with UID 70, or of host
epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can
remember all those values a field can take in the journal, I mean,
seriously, who has thaaaat kind of photographic memory? Well, the
journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the
database, or in other words: the names of all systemd services which
ever logged into the journal. This makes it super-easy to build nice
matches. But wait, turns out this all is actually hooked up with shell
completion on bash! This gets even more awesome: as you type your
match expression you will get a list of well-known field names, and of
the values they can take! Let’s figure out how to filter for SELinux
labels again. We remember the field name was something with SELINUX in
it, let’s try that:

$ journalctl _SE<TAB>

And yupp, it’s immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what’s the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit’s security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn’t know anything related to SELinux could
be thaaat easy! 😉 Of course this kind of completion works with any
field, not just SELinux labels.

So much for now. There’s a lot more cool stuff in journalctl(1)
than this. For example, it generates JSON output for you! You can match
against kernel fields! You can get simple
/var/log/messages-like output but with relative timestamps!
And so much more!

Anyway, in the next weeks I hope to post more stories about all the
cool things the journal can do for you. This is just the beginning,
stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi
but hopefully will get into F18 proper soon, and definitely before the
release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in
the kernel yet, but on its way due to Hannes’
fantastic work, and I hope it will make appearence in
F18.

systemd for Administrators, Part XVI

2012-10-13 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/serial-console.html

And,
yes,
here’s
now the sixteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Gettys on Serial Consoles (and Elsewhere)

TL;DR: To make use of a serial console, just use
console=ttyS0 on the kernel command line, and systemd will
automatically start a getty on it for you.

While physical RS232 serial ports
have become exotic in today’s PCs they play an important role in
modern servers and embedded hardware. They provide a relatively robust
and minimalistic way to access the console of your device, that works
even when the network is hosed, or the primary UI is unresponsive. VMs
frequently emulate a serial port as well.

Of course, Linux has always had good support for serial consoles,
but with systemd we
tried to make serial console support even simpler to use. In the
following text I’ll try to give an overview how serial console gettys on
systemd work, and how TTYs of any kind are handled.

Let’s start with the key take-away: in most cases, to get a login
prompt on your serial prompt you don’t need to do anything. systemd
checks the kernel configuration for the selected kernel console and
will simply spawn a serial getty on it. That way it is entirely
sufficient to configure your kernel console properly (for example, by
adding console=ttyS0 to the kernel command line) and that’s
it. But let’s have a look at the details:

In systemd, two template units are responsible for bringing up a
login prompt on text consoles:

[email protected] is responsible for virtual
terminal (VT) login prompts, i.e. those on your VGA screen as
exposed in /dev/tty1 and similar devices.
[email protected] is responsible for all other
terminals, including serial ports such as /dev/ttyS0. It
differs in a couple of ways from [email protected]: among other
things the $TERM environment variable is set to
vt102 (hopefully a good default for most serial terminals)
rather than linux (which is the right choice for VTs only),
and a special logic that clears the VT scrollback buffer (and only
work on VTs) is skipped.

Virtual Terminals

Let’s have a closer look how [email protected] is started,
i.e. how login prompts on the virtual terminal (i.e. non-serial TTYs)
work. Traditionally, the init system on Linux machines was configured
to spawn a fixed number login prompts at boot. In most cases six
instances of the getty program were spawned, on the first six VTs,
tty1 to tty6.

In a systemd world we made this more dynamic: in order to make
things more efficient login prompts are now started on demand only. As
you switch to the VTs the getty service is instantiated to
[email protected], [email protected] and so
on. Since we don’t have to unconditionally start the getty processes
anymore this allows us to save a bit of resources, and makes start-up
a bit faster. This behaviour is mostly transparent to the user: if the
user activates a VT the getty is started right-away, so that the user
will hardly notice that it wasn’t running all the time. If he then
logs in and types ps he’ll notice however that getty
instances are only running for the VTs he so far switched to.

By default this automatic spawning is done for the VTs up to VT6
only (in order to be close to the traditional default configuration of
Linux systems)^[1]. Note that the auto-spawning of gettys
is only attempted if no other subsystem took possession of the VTs
yet. More specifically, if a user makes frequent use of fast user
switching via GNOME he’ll get his X sessions on the first six VTs,
too, since the lowest available VT is allocated for each session.

Two VTs are handled specially by the auto-spawning logic: firstly
tty1 gets special treatment: if we boot into graphical mode
the display manager takes possession of this VT. If we boot into
multi-user (text) mode a getty is started on it — unconditionally,
without any on-demand logic^[2].

Secondly, tty6 is
especially reserved for auto-spawned gettys and unavailable to other
subsystems such as X^[3]. This is done in order to ensure
that there’s always a way to get a text login, even if due to
fast user switching X took possession of more than 5 VTs.

Serial Terminals

Handling of login prompts on serial terminals (and all other kind
of non-VT terminals) is different from that of VTs. By default systemd
will instantiate one [email protected] on the main
kernel^[4] console, if it is not a virtual terminal. The
kernel console is where the kernel outputs its own log messages and is
usually configured on the kernel command line in the boot loader via
an argument such as console=ttyS0^[5]. This logic ensures that
when the user asks the kernel to redirect its output onto a certain
serial terminal, he will automatically also get a login prompt on it
as the boot completes^[6]. systemd will also spawn a login
prompt on the first special VM console (that’s /dev/hvc0,
/dev/xvc0, /dev/hvsi0), if the system is run in a VM
that provides these devices. This logic is implemented in a generator
called systemd-getty-generator
that is run early at boot and pulls in the necessary services
depending on the execution environment.

In many cases, this automatic logic should already suffice to get
you a login prompt when you need one, without any specific
configuration of systemd. However, sometimes there’s the need to
manually configure a serial getty, for example, if more than one
serial login prompt is needed or the kernel console should be
redirected to a different terminal than the login prompt. To
facilitate this it is sufficient to instantiate
[email protected] once for each serial port you want it
to run on^[7]:

# systemctl enable [email protected]
# systemctl start [email protected]

And that’s it. This will make sure you get the login prompt on the
chosen port on all subsequent boots, and starts it right-away
too.

Sometimes, there’s the need to configure the login prompt in even
more detail. For example, if the default baud rate configured by the
kernel is not correct or other agetty parameters need to
be changed. In such a case simply copy the default unit template to
/etc/systemd/system and edit it there:

# cp /usr/lib/systemd/system/[email protected] /etc/systemd/system/[email protected]
# vi /etc/systemd/system/[email protected]
 .... now make your changes to the agetty command line ...
# ln -s /etc/systemd/system/[email protected] /etc/systemd/system/getty.target.wants/
# systemctl daemon-reload
# systemctl start [email protected]

This creates a unit file that is specific to serial port
ttyS2, so that you can make specific changes to this port and
this port only.

And this is pretty much all there’s to say about serial ports, VTs
and login prompts on them. I hope this was interesting, and please
come back soon for the next installment of this series!

Footnotes

[1] You can easily modify this by changing
NAutoVTs= in logind.conf.

[2] Note that whether the getty on VT1 is started on-demand
or not hardly makes a difference, since VT1 is the default active VT
anyway, so the demand is there anyway at boot.

[3] You can easily change this special reserved VT by
modifying ReserveVT= in logind.conf.

[4] If multiple kernel consoles are used simultaneously, the
main console is the one listed first in
/sys/class/tty/console/active, which is the last one
listed on the kernel command line.

[5] See kernel-parameters.txt
for more information on this kernel command line
option.

[6] Note that agetty -s is used here so that the
baud rate configured at the kernel command line is not altered and
continued to be used by the login prompt.

[7] Note that this systemctl enable syntax only
works with systemd 188 and newer (i.e. F18). On older versions use
ln -s /usr/lib/systemd/system/[email protected] /etc/systemd/system/getty.target.wants/[email protected] ; systemctl daemon-reload instead.

Berlin Open Source Meetup

2012-08-06 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/berlin-open-source-meetup.html

Chris Kühl and I are organizing a Berlin
Open Source Meetup on Aug 19th at the Prater Biergarten in Prenzlauer Berg.
If you live in Berlin (or are passing by) and are involved in or interested in
Open Source then you are invited!

There’s also a Google+ event for the meetup.

It’s a public event, so everybody is welcome, and please feel free to invite others!

See you at the Prater!

foss.in 2012 CFP Ends in a Few Hours

2012-07-08 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/fossin2012.html

foss.in 2012 in Bangalore takes place again after a
hiatus of some years. It has always been a fantastic conference, and a great opportunity to
visit Bangalore and India. I just submitted my talk proposals, so, hurry up, and submit yours!

systemd for Administrators, Part XV

2012-06-28 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/watchdog.html

Quickly
following the previous iteration, here’s
now the fifteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd:
the embedded/mobile folks, the desktop people and the server
folks. While the systems used by embedded/mobile tend to be
underpowered and have few resources are available, desktops tend to be
much more powerful machines — but still much less resourceful than
servers. Nonetheless there are surprisingly many features that matter
to both extremes of this axis (embedded and servers), but not the
center (desktops). On of them is support for watchdogs in
hardware and software.

Embedded devices frequently rely on watchdog hardware that resets
it automatically if software stops responding (more specifically,
stops signalling the hardware in fixed intervals that it is still
alive). This is required to increase reliability and make sure that
regardless what happens the best is attempted to get the system
working again. Functionality like this makes little sense on the
desktop^[1]. However, on
high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for
hardware watchdogs (as exposed in /dev/watchdog to
userspace), as well as supervisor (software) watchdog support for
invidual system services. The basic idea is the following: if enabled,
systemd will regularly ping the watchdog hardware. If systemd or the
kernel hang this ping will not happen anymore and the hardware will
automatically reset the system. This way systemd and the kernel are
protected from boundless hangs — by the hardware. To make the chain
complete, systemd then exposes a software watchdog interface for
individual services so that they can also be restarted (or some other
action taken) if they begin to hang. This software watchdog logic can
be configured individually for each service in the ping frequency and
the action to take. Putting both parts together (i.e. hardware
watchdogs supervising systemd and the kernel, as well as systemd
supervising all other services) we have a reliable way to watchdog
every single component of the system.

To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec= option in
/etc/systemd/system.conf. It defaults to 0 (i.e. no hardware
watchdog use). Set it to a value like 20s and the watchdog is
enabled. After 20s of no keep-alive pings the hardware will reset
itself. Note that systemd will send a ping to the hardware at half the
specified interval, i.e. every 10s. And that’s already all there is to
it. By enabling this single, simple option you have turned on
supervision by the hardware of systemd and the kernel beneath
it.^[2]

Note that the hardware watchdog device (/dev/watchdog) is
single-user only. That means that you can either enable this
functionality in systemd, or use a separate external watchdog daemon,
such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be
configured in /etc/systemd/system.conf. It controls the
watchdog interval to use during reboots. It defaults to 10min, and
adds extra reliability to the system reboot logic: if a clean reboot
is not possible and shutdown hangs, we rely on the watchdog hardware
to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are
really everything that is necessary to make use of the hardware
watchdogs. Now, let’s have a look how to add watchdog logic to
individual services.

First of all, to make software watchdog-supervisable it needs to be
patched to send out “I am alive” signals in regular intervals in its
event loop. Patching this is relatively easy. First, a daemon needs to
read the WATCHDOG_USEC= environment variable. If it is set,
it will contain the watchdog interval in usec formatted as ASCII text
string, as it is configured for the service. The daemon should then
issue sd_notify("WATCHDOG=1")
calls every half of that interval. A daemon patched this way should
transparently support watchdog functionality by checking whether the
environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been
patched to support the logic pointed out above) it is sufficient to
set the WatchdogSec= to the desired failure latency. See systemd.service(5)
for details on this setting. This causes WATCHDOG_USEC= to be
set for the service’s processes and will cause the service to enter a
failure state as soon as no keep-alive ping is received within the
configured interval.

If a service enters a failure state as soon as the watchdog logic
detects a hang, then this is hardly sufficient to build a reliable
system. The next step is to configure whether the service shall be
restarted and how often, and what to do if it then still fails. To
enable automatic service restarts on failure set
Restart=on-failure for the service. To configure how many
times a service shall be attempted to be restarted use the combination
of StartLimitBurst= and StartLimitInterval= which
allow you to configure how often a service may restart within a time
interval. If that limit is reached, a special action can be
taken. This action is configured with StartLimitAction=. The
default is a none, i.e. that no further action is taken and
the service simply remains in the failure state without any further
attempted restarts. The other three possible values are
reboot, reboot-force and
reboot-immediate. reboot attempts a clean reboot,
going through the usual, clean shutdown logic. reboot-force
is more abrupt: it will not actually try to cleanly shutdown any
services, but immediately kills all remaining services and unmounts
all file systems and then forcibly reboots (this way all file systems
will be clean but reboot will still be very fast). Finally,
reboot-immediate does not attempt to kill any process or
unmount any file systems. Instead it just hard reboots the machine
without delay. reboot-immediate hence comes closest to a
reboot triggered by a hardware watchdog. All these settings are
documented in systemd.service(5).

Putting this all together we now have pretty flexible options to
watchdog-supervise a specific service and configure automatic restarts
of the service if it hangs, plus take ultimate action if that doesn’t
help.

Here’s an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn’t pinged
the system manager for longer than 30s or if it fails otherwise. If it
is restarted this way more often than 4 times in 5min action is taken
and the system quickly rebooted, with all file systems being clean
when it comes up again.

And that’s already all I wanted to tell you about! With hardware
watchdog support right in PID 1, as well as supervisor watchdog
support for individual services we should provide everything you need
for most watchdog usecases. Regardless if you are building an embedded
or mobile applience, or if your are working with high-availability
servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with
/dev/watchdog, and why this shouldn’t be kept in a separate
daemon, then please read this again and try to understand that this is
all about the supervisor chain we are building here, where the hardware watchdog
supervises systemd, and systemd supervises the individual
services. Also, we believe that a service not responding should be
treated in a similar way as any other service error. Finally, pinging
/dev/watchdog is one of the most trivial operations in the OS
(basically little more than a ioctl() call), to the support for this
is not more than a handful lines of code. Maintaining this externally
with complex IPC between PID 1 (and the daemons) and this watchdog
daemon would be drastically more complex, error-prone and resource
intensive.)

Note that the built-in hardware watchdog support of systemd does
not conflict with other watchdog software by default. systemd does not
make use of /dev/watchdog by default, and you are welcome to
use external watchdog daemons in conjunction with systemd, if this
better suits your needs.

And one last thing: if you wonder whether your hardware has a
watchdog, then the answer is: almost definitely yes — if it is anything more
recent than a few years. If you want to verify this, try the wdctl
tool from recent util-linux, which shows you everything you need to
know about your watchdog hardware.

I’d like to thank the great folks from Pengutronix for contributing
most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog
hardware these days too, as this is cheap to build and available in
most modern PC chipsets.

[2] So, here’s a free tip for you if you hack on the core
OS: don’t enable this feature while you hack. Otherwise your system
might suddenly reboot if you are in the middle of tracing through PID
1 with gdb and cause it to be stopped for a moment, so that no
hardware ping can be done…

systemd for Administrators, Part XIV

2012-06-27 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/self-documented-boot.html

And
here’s the fourteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

The Self-Explanatory Boot

One complaint we often hear about systemd is
that its boot process was hard to understand, even
incomprehensible. In general I can only disagree with this sentiment, I
even believe in quite the opposite: in comparison to what we had
before — where to even remotely understand what was going on you had
to have a decent comprehension of the programming language that is
Bourne Shell^[1] — understanding systemd’s boot process is
substantially easier. However, like in many complaints there is some
truth in this frequently heard discomfort: for a seasoned Unix
administrator there indeed is a bit of learning to do when the switch
to systemd is
made. And as systemd developers it is our duty to make the learning
curve shallow, introduce as few surprises as we can, and provide
good documentation where that is not possible.

systemd always had huge body of documentation as manual
pages (nearly 100 individual pages now!), in the Wiki and
the various blog stories I posted. However, any amount of
documentation alone is not enough to make software easily
understood. In fact, thick manuals sometimes appear intimidating and
make the reader wonder where to start reading, if all he was
interested in was this one simple concept of the whole system.

Acknowledging all this we have now added a new, neat, little
feature to systemd: the self-explanatory boot process. What do we mean
by that? Simply that each and every single component of our boot comes
with documentation and that this documentation is closely linked to
its component, so that it is easy to find.

More specifically, all units in systemd (which are what
encapsulate the components of the boot) now include references to
their documentation, the documentation of their configuration files
and further applicable manuals. A user who is trying to understand the
purpose of a unit, how it fits into the boot process and how to
configure it can now easily look up this documentation with the
well-known systemctl status command. Here’s an example how
this looks for systemd-logind.service:

$ systemctl status systemd-logind.service
systemd-logind.service - Login Service
	  Loaded: loaded (/usr/lib/systemd/system/systemd-logind.service; static)
	  Active: active (running) since Mon, 25 Jun 2012 22:39:24 +0200; 1 day and 18h ago
	    Docs: man:systemd-logind.service(7)
	          man:logind.conf(5)
	          http://www.freedesktop.org/wiki/Software/systemd/multiseat
	Main PID: 562 (systemd-logind)
	  CGroup: name=systemd:/system/systemd-logind.service
		  └ 562 /usr/lib/systemd/systemd-logind

Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event2 (Power Button)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event6 (Video Bus)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event0 (Lid Switch)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event1 (Sleep Button)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event7 (ThinkPad Extra Buttons)
Jun 25 22:39:25 epsilon systemd-logind[562]: New session 1 of user gdm.
Jun 25 22:39:25 epsilon systemd-logind[562]: Linked /tmp/.X11-unix/X0 to /run/user/42/X11-display.
Jun 25 22:39:32 epsilon systemd-logind[562]: New session 2 of user lennart.
Jun 25 22:39:32 epsilon systemd-logind[562]: Linked /tmp/.X11-unix/X0 to /run/user/500/X11-display.
Jun 25 22:39:54 epsilon systemd-logind[562]: Removed session 1.

On the first look this output changed very little. If you look
closer however you will find that it now includes one new field:
Docs lists references to the documentation of this
service. In this case there are two man page URIs and one web URL
specified. The man pages describe the purpose and configuration of
this service, the web URL includes an introduction to the basic
concepts of this service.

If the user uses a recent graphical terminal implementation it is
sufficient to click on the URIs shown to get the respective
documentation^[2]. With other words: it never has been that
easy to figure out what a specific component of our boot is about:
just use systemctl status to get more information about it
and click on the links shown to find the documentation.

The past days I have written man pages and added these references
for every single unit we ship with systemd. This means, with
systemctl status you now have a very easy way to find out
more about every single service of the core OS.

If you are not using a graphical terminal (where you can just click
on URIs), a man page URI in the middle of the output of systemctl status is not the most useful thing to have. To make reading the
referenced man pages easier we have also added a new command:

systemctl help systemd-logind.service

Which will open the listed man pages right-away, without the need
to click anything or copy/paste an URI.

The URIs are in the formats documented by the uri(7)
man page. Units may reference http and https URLs, as well as man and
info pages.

Of course all this doesn’t make everything self-explanatory, simply
because the user still has to find out about systemctl status
(and even systemctl in the first place so that he even knows
what units there are); however with this basic knowledge further
help on specific units is in very easy reach.

We hope that this kind of interlinking of runtime behaviour and the
matching documentation is a big step forward to make our boot easier
to understand.

This functionality is partially already available in Fedora 17, and
will show up in complete form in Fedora 18.

That all said, credit where credit is due: this kind of references
to documentation within the service descriptions is not new, Solaris’
SMF had similar functionality for quite some time. However, we believe
this new systemd feature is certainly a novelty on Linux, and with
systemd we now offer you the best documented and best self-explaining
init system.

Of course, if you are writing unit files for your own packages,
please consider also including references to the documentation of your
services and its configuration. This is really easy to do, just list
the URIs in the new Documentation= field in the
[Unit] section of your unit files. For details see systemd.unit(5). The
more comprehensively we include links to documentation in our OS
services the easier the work of administrators becomes. (To make sure
Fedora makes comprehensive use of this functionality I filed a bug on
FPC).

Oh, and BTW: if you are looking for a rough overview of systemd’s
boot process here’s
another new man page we recently added, which includes a pretty
ASCII flow chart of the boot process and the units involved.

Footnotes

[1] Which TBH is a pretty crufty, strange one on top.

[2] Well, a terminal
where this bug is fixed (used together with a help
browser where this one is fixed).

Presentation in Warsaw

2012-05-24 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/warsaw.html

I recently had the chance to speak about systemd
and other projects, as well as the politics behind them at a Bar Camp in Warsaw,
organized by the fine people of OSEC. The presentation has been recorded,
and has now been posted online. It’s a very long recording (1:43h),
but it’s quite interesting (as I’d like to believe) and contains a bit
of background where we are coming from and where are going to. Anyway,
please have a look. Enjoy!

I’d like to thank the organizers for this great event and for
publishing the recording online.

systemd for Administrators, Part XIII

2012-05-18 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/systemctl-journal.html

Here’s
the thirteenth installment
of

my ongoing series
on
systemd
for
Administrators:

Log and Service Status

This one is a short episode. One of the most commonly used commands
on a systemd
system is systemctl status which may be used to determine the
status of a service (or other unit). It always has been a valuable
tool to figure out the processes, runtime information and other meta
data of a daemon running on the system.

With Fedora 17 we introduced the
journal, our new logging scheme that provides structured, indexed
and reliable logging on systemd systems, while providing a certain
degree of compatibility with classic syslog implementations. The
original reason we started to work on the journal was one specific
feature idea, that to the outsider might appear simple but without the
journal is difficult and inefficient to implement: along with the
output of systemctl status we wanted to show the last 10 log
messages of the daemon. Log data is some of the most essential bits of
information we have on the status of a service. Hence it it is an
obvious choice to show next to the general status of the
service.

And now to make it short: at the same time as we integrated the
journal into systemd and Fedora we also hooked up
systemctl with it. Here’s an example output:

$ systemctl status avahi-daemon.service
avahi-daemon.service - Avahi mDNS/DNS-SD Stack
	  Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; enabled)
	  Active: active (running) since Fri, 18 May 2012 12:27:37 +0200; 14s ago
	Main PID: 8216 (avahi-daemon)
	  Status: "avahi-daemon 0.6.30 starting up."
	  CGroup: name=systemd:/system/avahi-daemon.service
		  ├ 8216 avahi-daemon: running [omega.local]
		  └ 8217 avahi-daemon: chroot helper

May 18 12:27:37 omega avahi-daemon[8216]: Joining mDNS multicast group on interface eth1.IPv4 with address 172.31.0.52.
May 18 12:27:37 omega avahi-daemon[8216]: New relevant interface eth1.IPv4 for mDNS.
May 18 12:27:37 omega avahi-daemon[8216]: Network interface enumeration completed.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for 192.168.122.1 on virbr0.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for fd00::e269:95ff:fe87:e282 on eth1.*.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for 172.31.0.52 on eth1.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering HINFO record with values 'X86_64'/'LINUX'.
May 18 12:27:38 omega avahi-daemon[8216]: Server startup complete. Host name is omega.local. Local service cookie is 3555095952.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/ssh.service) successfully established.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/sftp-ssh.service) successfully established.

This, of course, shows the status of everybody’s favourite
mDNS/DNS-SD daemon with a list of its processes, along with — as
promised — the 10 most recent log lines. Mission accomplished!

There are a couple of switches available to alter the output
slightly and adjust it to your needs. The two most interesting
switches are -f to enable follow mode (as in tail -f) and -n to change the number of lines to show (you
guessed it, as in tail -n).

The log data shown comes from three sources: everything any of the
daemon’s processes logged with libc’s syslog() call,
everything submitted using the native Journal API, plus everything any
of the daemon’s processes logged to STDOUT or STDERR. In short:
everything the daemon generates as log data is collected, properly
interleaved and shown in the same format.

And that’s it already for today. It’s a very simple feature, but an
immensely useful one for every administrator. One of the kind “Why didn’t
we already do this 15 years ago?”.

Stay tuned for the next installment!

Boot & Base OS Miniconf at Linux Plumbers Conference 2012, San Diego

2012-05-03 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/lpc2012.html

We are working on putting together a miniconf on
the topic of Boot & Base OS for the Linux Plumbers Conference 2012 in San
Diego (Aug 29-31). And we need your submission!

Are you working on some exciting project related to Boot and Base OS and
would like to present your work? Then please submit something following
these guidelines, but please CC Kay Sievers and Lennart Poettering.

I hope that at this point the Linux Plumbers Conference
needs little introduction, so I will spare any further prose on how great and
useful and the best conference ever it is for everybody who works on the plumbing
layer of Linux. However, there’s one conference that will be co-located with
LPC that is still little known, because it happens for the first time: The C Conference, organized by Brandon Philips
and friends. It covers all things C, and they are still looking for more
topics, in a reverse CFP. Please
consider submitting a proposal and registering to the conference!

The Most Awesome, Least-Advertised Fedora 17 Feature

2012-05-02 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/multi-seat.html

There’s one feature In the upcoming Fedora 17 release that is
immensly useful but very little known, since its feature page
‘ckremoval’ does not explicitly refer to it in its name: true
automatic multi-seat support for Linux.

A multi-seat computer is a system that offers not only one local
seat for a user, but multiple, at the same time. A seat refers to a
combination of a screen, a set of input devices (such as mice and
keyboards), and maybe an audio card or webcam, as individual local
workplace for a user. A multi-seat computer can drive an entire class
room of seats with only a fraction of the cost in hardware, energy,
administration and space: you only have one PC, which usually has way
enough CPU power to drive 10 or more workplaces. (In fact, even a
Netbook has fast enough to drive a couple of seats!) Automatic
multi-seat refers to an entirely automatically managed seat setup:
whenever a new seat is plugged in a new login screen immediately
appears — without any manual configuration –, and when the seat is
unplugged all user sessions on it are removed without delay.

In Fedora 17 we added this functionality to the low-level user and
device tracking of systemd, replacing the previous ConsoleKit logic
that lacked support for automatic multi-seat. With all the ground work
done in systemd, udev and the other components of our plumbing layer
the last remaining bits were surprisingly easy to add.

Currently, the automatic multi-seat logic works best with the USB
multi-seat hardware from Plugable
you can buy cheaply on Amazon
(US). These devices require exactly zero configuration with the
new scheme implemented in Fedora 17: just plug them in at any time,
login screens pop up on them, and you have your additional
seats. Alternatively you can also assemble your seat manually with a
few easy loginctl
attach commands, from any kind of hardware you might have lying
around. To get a full seat you need multiple graphics cards, keyboards
and mice: one set for each seat. (Later on we’ll probably have a graphical
setup utility for additional seats, but that’s not a pressing issue we
believe, as the plug-n-play multi-seat support with the Plugable
devices is so awesomely nice.)

Plugable provided us for free with hardware for testing
multi-seat. They are also involved with the upstream development of
the USB DisplayLink driver for Linux. Due to their positive
involvement with Linux we can only recommend to buy their
hardware. They are good guys, and support Free Software the way all
hardware vendors should! (And besides that, their hardware is also
nicely put together. For example, in contrast to most similar vendors
they actually assign proper vendor/product IDs to their USB hardware
so that we can easily recognize their hardware when plugged in to set
up automatic seats.)

Currently, all this magic is only implemented in the GNOME stack
with the biggest component getting updated being the GNOME Display
Manager. On the Plugable USB hardware you get a full GNOME Shell
session with all the usual graphical gimmicks, the same way as on any
other hardware. (Yes, GNOME 3 works perfectly fine on simpler graphics
cards such as these USB devices!) If you are hacking on a different
desktop environment, or on a different display manager, please have a
look at the
multi-seat documentation we put together, and particularly at
our short piece about writing
display managers which are multi-seat capable.

If you work on a major desktop environment or display manager and
would like to implement multi-seat support for it, but lack the
aforementioned Plugable hardware, we might be able to provide you with
the hardware for free. Please contact us directly, and we might be
able to send you a device. Note that we don’t have unlimited devices
available, hence we’ll probably not be able to pass hardware to
everybody who asks, and we will pass the hardware preferably to people
who work on well-known software or otherwise have contributed good
code to the community already. Anyway, if in doubt, ping us, and
explain to us why you should get the hardware, and we’ll consider you!
(Oh, and this not only applies to display managers, if you hack on some other
software where multi-seat awareness would be truly useful, then don’t
hesitate and ping us!)

Phoronix has this
story about this new multi-seat support which is quite interesting and
full of pictures. Please have a look.

Plugable started a Pledge
drive to lower the price of the Plugable USB multi-seat terminals
further. It’s full of pictures (and a video showing all this in action!), and uses the code we now make
available in Fedora 17 as base. Please consider pledging a few
bucks.

Recently David Zeuthen added
multi-seat support to udisks as well. With this in place, a user
logged in on a specific seat can only see the USB storage plugged into
his individual seat, but does not see any USB storage plugged into any
other local seat. With this in place we closed the last missing bit of
multi-seat support in our desktop stack.

With this code in Fedora 17 we cover the big use cases of
multi-seat already: internet cafes, class rooms and similar
installations can provide PC workplaces cheaply and easily without any
manual configuration. Later on we want to build on this and make this
useful for different uses too: for example, the ability to get a login
screen as easily as plugging in a USB connector makes this not useful
only for saving money in setups for many people, but also in embedded
environments (consider monitoring/debugging screens made available via
this hotplug logic) or servers (get trivially quick local access to
your otherwise head-less server). To be truly useful in these areas we
need one more thing though: the ability to run a simply getty
(i.e. text login) on the seat, without necessarily involving a
graphical UI.

The well-known X successor Wayland already comes out of the box with multi-seat
support based on this logic.

Oh, and BTW, as Ubuntu appears to be “focussing” on “clarity” in the
“cloud” now ;-), and chose Upstart instead of systemd, this feature
won’t be available in Ubuntu any time soon. That’s (one detail of) the
price Ubuntu has to pay for choosing to maintain it’s own (largely
legacy, such as ConsoleKit) plumbing stack.

Multi-seat has a long history on Unix. Since the earliest days Unix
systems could be accessed by multiple local terminals at the same
time. Since then local terminal support (and hence multi-seat)
gradually moved out of view in computing. The fewest machines these
days have more than one seat, the concept of terminals survived almost
exclusively in the context of PTYs (i.e. fully virtualized API
objects, disconnected from any real hardware seat) and VCs (i.e. a
single virtualized local seat), but almost not in any other way (well,
server setups still use serial terminals for emergency remote access,
but they almost never have more than one serial terminal). All what we
do in systemd is based on the ideas originally brought forward in
Unix; with systemd we now try to bring back a number of the good ideas
of Unix that since the old times were lost on the roadside. For
example, in true Unix style we already started to expose the concept
of a service in the file system (in
/sys/fs/cgroup/systemd/system/), something where on Linux the
(often misunderstood) “everything is a file” mantra previously
fell short. With automatic multi-seat support we bring back support
for terminals, but updated with all the features of today’s desktops:
plug and play, zero configuration, full graphics, and not limited to
input devices and screens, but extending to all kinds of devices, such
as audio, webcams or USB memory sticks.

Anyway, this is all for now; I’d like to thank everybody who was
involved with making multi-seat work so nicely and natively on the
Linux platform. You know who you are! Thanks a ton!

systemd Status Update

2012-04-21 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/systemd-update-3.html

It
has been way too long since my last status update on
systemd. Here’s another short, incomprehensive status update on
what we worked on for systemd since
then.

We have been working hard to turn systemd into the most viable set
of components to build operating systems, appliances and devices from,
and make it the best choice for servers, for desktops and for embedded
environments alike. I think we have a really convincing set of
features now, but we are actively working on making it even
better.

Here’s a list of some more and some less interesting features, in
no particular order:

We added an automatic pager to systemctl (and related tools), similar
to how git has it.
systemctl learnt a new switch --failed, to show only
failed services.
You may now start services immediately, overrding all dependency
logic by passing --ignore-dependencies to
systemctl. This is mostly a debugging tool and nothing people
should use in real life.
Sending SIGKILL as final part of the implicit shutdown
logic of services is now optional and may be configured with the
SendSIGKILL= option individually for each service.
We split off the Vala/Gtk tools into its own project systemd-ui.
systemd-tmpfiles learnt file globbing and creating FIFO
special files as well as character and block device nodes, and
symlinks. It also is capable of relabelling certain directories at
boot now (in the SELinux sense).
Immediately before shuttding dow we will now invoke all binaries
found in /lib/systemd/system-shutdown/, which is useful for
debugging late shutdown.
You may now globally control where STDOUT/STDERR of services goes
(unless individual service configuration overrides it).
There’s a new ConditionVirtualization= option, that makes
systemd skip a specific service if a certain virtualization technology
is found or not found. Similar, we now have a new option to detect
whether a certain security technology (such as SELinux) is available,
called ConditionSecurity=. There’s also
ConditionCapability= to check whether a certain process
capability is in the capability bounding set of the system. There’s
also a new ConditionFileIsExecutable=,
ConditionPathIsMountPoint=,
ConditionPathIsReadWrite=,
ConditionPathIsSymbolicLink=.
The file system condition directives now support globbing.
Service conditions may now be “triggering” and “mandatory”, meaning that
they can be a necessary requirement to hold for a service to start, or
simply one trigger among many.
At boot time we now print warnings if: /usr
is on a split-off partition but not already mounted by an initrd;
if /etc/mtab is not a symlink to /proc/mounts; CONFIG_CGROUPS
is not enabled in the kernel. We’ll also expose this as
tainted flag on the bus.
You may now boot the same OS image on a bare metal machine and in
Linux namespace containers and will get a clean boot in both
cases. This is more complicated than it sounds since device management
with udev or write access to /sys, /proc/sys or
things like /dev/kmsg is not available in a container. This
makes systemd a first-class choice for managing thin container
setups. This is all tested with systemd’s own systemd-nspawn
tool but should work fine in LXC setups, too. Basically this means
that you do not have to adjust your OS manually to make it work in a
container environment, but will just work out of the box. It also
makes it easier to convert real systems into containers.
We now automatically spawn gettys on HVC ttys when booting in VMs.
We introduced /etc/machine-id as a generalization of
D-Bus machine ID logic. See this
blog story for more information. On stateless/read-only systems
the machine ID is initialized randomly at boot. In virtualized
environments it may be passed in from the machine manager (with qemu’s
-uuid switch, or via the container
interface).
All of the systemd-specific /etc/fstab mount options are
now in the x-systemd-xyz format.
To make it easy to find non-converted services we will now
implicitly prefix all LSB and SysV init script descriptions with the
strings “LSB:” resp. “SYSV:“.
We introduced /run and made it a hard dependency of
systemd. This directory is now widely accepted and implemented on all
relevant Linux distributions.
systemctl can now execute all its operations remotely too (-H switch).
We now ship systemd-nspawn,
a really powerful tool that can be used to start containers for
debugging, building and testing, much like chroot(1). It is useful to
just get a shell inside a build tree, but is good enough to boot up a
full system in it, too.
If we query the user for a hard disk password at boot he may hit
TAB to hide the asterisks we normally show for each key that is
entered, for extra paranoia.
We don’t enable udev-settle.service anymore, which is
only required for certain legacy software that still hasn’t been
updated to follow devices coming and going cleanly.
We now include a tool that can plot boot speed graphs, similar to
bootchartd, called systemd-analyze.
At boot, we now initialize the kernel’s binfmt_misc logic with the data from /etc/binfmt.d.
systemctl now recognizes if it is run in a chroot()
environment and will work accordingly (i.e. apply changes to the tree
it is run in, instead of talking to the actual PID 1 for this). It also has a new --root= switch to work on an OS tree from outside of it.
There’s a new unit dependency type OnFailureIsolate= that
allows entering a different target whenever a certain unit fails. For
example, this is interesting to enter emergency mode if file system
checks of crucial file systems failed.
Socket units may now listen on Netlink sockets, special files
from /proc and POSIX message queues, too.
There’s a new IgnoreOnIsolate= flag which may be used to
ensure certain units are left untouched by isolation requests. There’s
a new IgnoreOnSnapshot= flag which may be used to exclude
certain units from snapshot units when they are created.
There’s now small mechanism services for
changing the local hostname and other host meta data, changing
the system locale and console settings and the system
clock.
We now limit the capability bounding set for a number of our
internal services by default.
Plymouth may now be disabled globally with
plymouth.enable=0 on the kernel command line.
We now disallocate VTs when a getty finished running (and
optionally other tools run on VTs). This adds extra security since it
clears up the scrollback buffer so that subsequent users cannot get
access to a user’s session output.
In socket units there are now options to control the
IP_TRANSPARENT, SO_BROADCAST, SO_PASSCRED,
SO_PASSSEC socket options.
The receive and send buffers of socket units may now be set larger
than the default system settings if needed by using
SO_{RCV,SND}BUFFORCE.
We now set the hardware timezone as one of the first things in PID
1, in order to avoid time jumps during normal userspace operation, and
to guarantee sensible times on all generated logs. We also no longer
save the system clock to the RTC on shutdown, assuming that this is
done by the clock control tool when the user modifies the time, or
automatically by the kernel if NTP is enabled.
The SELinux directory got moved from /selinux to
/sys/fs/selinux.
We added a small service systemd-logind that keeps tracks
of logged in users and their sessions. It creates control groups for
them, implements the XDG_RUNTIME_DIR
specification for them, maintains seats and device node ACLs and
implements shutdown/idle inhibiting for clients. It auto-spawns gettys
on all local VTs when the user switches to them (instead of starting
six of them unconditionally), thus reducing the resource foot print by
default. It has a D-Bus interface as well as a
simple synchronous library interface. This mechanism obsoletes
ConsoleKit which is now deprecated and should no longer be used.
There’s now full, automatic multi-seat support, and this is
enabled in GNOME 3.4. Just by pluging in new seat hardware you get a
new login screen on your seat’s screen.
There is now an option ControlGroupModify= to allow
services to change the properties of their control groups dynamically,
and one to make control groups persistent in the tree
(ControlGroupPersistent=) so that they can be created and
maintained by external tools.
We now jump back into the initrd in shutdown, so that it can
detach the root file system and the storage devices backing it. This
allows (for the first time!) to reliably undo complex storage setups
on shutdown and leave them in a clean state.
systemctl now supports presets, a way for distributions and
administrators to define their own policies on whether services should
be enabled or disabled by default on package installation.
systemctl now has high-level verbs for masking/unmasking
units. There’s also a new command (systemctl list-unit-files)
for determining the list of all installed unit file files and whether
they are enabled or not.
We now apply sysctl variables to each new network device, as it
appears. This makes /etc/sysctl.d compatible with hot-plug
network devices.
There’s limited profiling for SELinux start-up perfomance built
into PID 1.
There’s a new switch PrivateNetwork=
to turn of any network access for a specific service.
Service units may now include configuration for control group
parameters. A few (such as MemoryLimit=) are exposed with
high-level options, and all others are available via the generic
ControlGroupAttribute= setting.
There’s now the option to mount certain cgroup controllers
jointly at boot. We do this now for cpu and
cpuacct by default.
We added the
journal and turned it on by default.
All service output is now written to the Journal by default,
regardless whether it is sent via syslog or simply written to
stdout/stderr. Both message streams end up in the same location and
are interleaved the way they should. All log messages even from the
kernel and from early boot end up in the journal. Now, no service
output gets unnoticed and is saved and indexed at the same
location.
systemctl status will now show the last 10 log lines for
each service, directly from the journal.
We now show the progress of fsck at boot on the console,
again. We also show the much loved colorful [ OK ] status
messages at boot again, as known from most SysV implementations.
We merged udev into systemd.
We implemented and documented interfaces to container
managers and initrds
for passing execution data to systemd. We also implemented and
documented an
interface for storage daemons that are required to back the root file
system.
There are two new options in service files to propagate reload requests between several units.
systemd-cgls won’t show kernel threads by default anymore, or show empty control groups.
We added a new tool systemd-cgtop that shows resource
usage of whole services in a top(1) like fasion.
systemd may now supervise services in watchdog style. If enabled
for a service the daemon daemon has to ping PID 1 in regular intervals
or is otherwise considered failed (which might then result in
restarting it, or even rebooting the machine, as configured). Also,
PID 1 is capable of pinging a hardware watchdog. Putting this
together, the hardware watchdogs PID 1 and PID 1 then watchdogs
specific services. This is highly useful for high-availability servers
as well as embedded machines. Since watchdog hardware is noawadays
built into all modern chipsets (including desktop chipsets), this
should hopefully help to make this a more widely used
functionality.
We added support for a new kernel command line option
systemd.setenv= to set an environment variable
system-wide.
By default services which are started by systemd will have SIGPIPE
set to ignored. The Unix SIGPIPE logic is used to reliably implement
shell pipelines and when left enabled in services is usually just a
source of bugs and problems.
You may now configure the rate limiting that is applied to
restarts of specific services. Previously the rate limiting parameters
were hard-coded (similar to SysV).
There’s now support for loading the IMA integrity policy into the
kernel early in PID 1, similar to how we already did it with the
SELinux policy.
There’s now an official API to schedule and query scheduled shutdowns.
We changed the license from GPL2+ to LGPL2.1+.
We made systemd-detect-virt
an official tool in the tool set. Since we already had code to detect
certain VM and container environments we now added an official tool
for administrators to make use of in shell scripts and suchlike.
We documented numerous
interfaces systemd introduced.

Much of the stuff above is already available in Fedora 15 and 16,
or will be made available in the upcoming Fedora 17.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

I’d like to thank everybody who contributed to systemd over the past years.

Thanks for your interest!

Control Groups vs. Control Groups

2012-04-10 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/cgroups-vs-cgroups.html

TL;DR: systemd does not
require the performance-sensitive bits of Linux control groups enabled in the kernel.
However, it does require some non-performance-sensitive bits of the control
group logic.

In some areas of the community there’s still some confusion about Linux
control groups and their performance impact, and what precisely it is that
systemd requires of them. In the hope to clear this up a bit, I’d like to point
out a few things:

Control Groups are two things: (A) a way to hierarchally group and
label processes, and (B) a way to then apply resource limits
to these groups. systemd only requires the former (A), and not the latter (B).
That means you can compile your kernel without any control group resource
controllers (B) and systemd will work perfectly on it. However, if you in
addition disable the grouping feature entirely (A) then systemd will loudly
complain at boot and proceed only reluctantly with a big warning and in a
limited functionality mode.

At compile time, the grouping/labelling feature in the kernel is enabled by
CONFIG_CGROUPS=y, the individual controllers by CONFIG_CGROUP_FREEZER=y,
CONFIG_CGROUP_DEVICE=y, CONFIG_CGROUP_CPUACCT=y, CONFIG_CGROUP_MEM_RES_CTLR=y,
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, CONFIG_CGROUP_MEM_RES_CTLR_KMEM=y,
CONFIG_CGROUP_PERF=y, CONFIG_CGROUP_SCHED=y, CONFIG_BLK_CGROUP=y,
CONFIG_NET_CLS_CGROUP=y, CONFIG_NETPRIO_CGROUP=y. And since (as mentioned) we
only need the former (A), not the latter (B) you may disable all of the latter
options while enabling CONFIG_CGROUPS=y, if you want to run systemd on your
system.

What about the performance impact of these options? Well, every bit of code
comes at some price, so none of these options come entirely for free. However,
the grouping feature (A) alters the general logic very little, it just sticks
hierarchial labels on processes, and its impact is minimal since that is
usually not in any hot path of the OS. This is different for the various
controllers (B) which have a much bigger impact since they influence the resource
management of the OS and are full of hot paths. This means that the kernel
feature that systemd mandatorily requires (A) has a minimal effect on system
performance, but the actually performance-sensitive features of control groups
(B) are entirely optional.

On boot, systemd will mount all controller hierarchies it finds enabled
in the kernel to individual directories below /sys/fs/cgroup/. This is
the official place where kernel controllers are mounted to these days. The
/sys/fs/cgroup/ mount point in the kernel was created precisely for
this purpose. Since the control group controllers are a shared facility that
might be used by a number of different subsystems a few
projects have agreed on a set of rules in order to avoid that the various bits
of code step on each other’s toes when using these directories.

systemd will also maintain its own, private, controller-less, named control
group hierarchy which is mounted to /sys/fs/cgroup/systemd/. This
hierarchy is private property of systemd, and other software should not try to
interfere with it. This hierarchy is how systemd makes use of the naming and
grouping feature of control groups (A) without actually requiring any kernel
controller enabled for that.

Now, you might notice that by default systemd does create per-service
cgroups in the “cpu” controller if it finds it enabled in the kernel. This is
entirely optional, however. We chose to make use of it by default to even out
CPU usage between system services. Example: On a traditional web server machine
Apache might end up having 100 CGI worker processes around, while MySQL only
has 5 processes running. Without the use of the “cpu” controller this means
that Apache all together ends up having 20x more CPU available than MySQL since
the kernel tries to provide every process with the same amount of CPU time. On
the other hand, if we add these two services to the “cpu” controller in
individual groups by default, Apache and MySQL get the same amount of CPU,
which we think is a good default.

Note that if the CPU controller is not enabled in the kernel systemd will not
attempt to make use of the “cpu” hierarchy as described above. Also, even if it is enabled in the kernel it
is trivial to tell systemd not to make use of it: Simply edit
/etc/systemd/system.conf and set DefaultControllers= to the
empty string.

Let’s discuss a few frequently heard complaints regarding systemd’s use of control groups:

systemd mounts all controllers to /sys/fs/cgroup/ even though
my software requires it at /dev/cgroup/ (or some other place)! The
standardization of /sys/fs/cgroup/ as mount point of the hierarchies
is a relatively recent change in the kernel. Some software has not been updated
yet for it. If you cannot change the software in question you are welcome to
unmount the hierarchies from /sys/fs/cgroup/ and mount them wherever
you need them instead. However, make sure to leave
/sys/fs/cgroup/systemd/ untouched.
systemd makes use of the “cpu” hierarchy, but it should leave its dirty
fingers from it! As mentioned above, just set the
DefaultControllers= option of systemd to the empty string.
I need my two controllers “foo” and “bar” mounted into one hierarchy,
but systemd mounts them in two! Use the JoinControllers= setting
in /etc/systemd/system.conf to mount several controllers into a single
hierarchy.
Control groups are evil and they make everything slower! Well,
please read the text above and understand the difference between
“control-groups-as-in-naming-and-grouping” (A) and “cgroups-as-in-controllers”
(B). Then, please turn off all controllers in you kernel build (B) but leave
CONFIG_CGROUPS=y (A) enabled.
I have heard some kernel developers really hate control groups
and think systemd is evil because it requires them! Well, there are a
couple of things behind the dislike of control groups by some folks.
Primarily, this is probably caused because the hackers in question do not
distuingish the naming-and-grouping bits of the control group logic (A) and the
controllers that are based on it (B). Mainly, their beef is with the latter
(which systemd does not require, which is the key point I am trying to make in
the text above), but there are other issues as well: for example, the code of
the grouping logic is not the most beautiful bit of code ever written by man
(which is thankfully likely to get better now, since the control groups
subsystem now has an active maintainer again). And then for some
developers it is important that they can compare the runtime behaviour of many
historic kernel versions in order to find bugs (git bisect). Since systemd
requires kernels with basic control group support enabled, and this is a
relatively recent feature addition to the kernel, this makes it difficult for
them to use a newer distribution with all these old kernels
that predate cgroups. Anyway, the summary is probably that what matters to
developers is different from what matters to users and
administrators.

I hope this explanation was useful for a reader or two! Thank you for your time!

GUADEC 2012 CFP Ending Soon!

2012-04-10 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/guadec-2012-cfp.html

In case you haven’t submitted your talk proposal for GUADEC 2012 in A
Coruña, Spain yet, hurry: the deadline is on April 14th, i.e. this
saturday! Read der Call for
Participation! Submit a
proposal!

/tmp or not /tmp?

2012-03-28 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/tmp.html

A number of Linux distributions have recently switched (or started
switching) to /tmp on tmpfs by default (ArchLinux, Debian among
others). Other distributions have plans/are discussing doing the same (Ubuntu, OpenSUSE).
Since we believe this is a good idea and it’s good to keep the delta between
the distributions minimal we are proposing
the same for Fedora 18, too. On Solaris a similar change has already been
implemented in 1994 (and other Unixes have made a similar change long ago,
too). Yet, not all of our software is written in a way that it works nicely
together with /tmp on tmpfs.

Another Fedora
feature (for Fedora 17) changed the semantics of /tmp for many
system services to make them more secure, by isolating the /tmp namespaces of the
various services. Handling of temporary files in /tmp has been
security sensitive since it has been introduced since it traditionally has been
a world writable, shared namespace and unless all user code safely uses randomized file names
it is vulnerable to DoS attacks and worse.

In this blog story I’d like to shed some light on proper usage of
/tmp and what your Linux application should use for what purpose. We’ll not
discuss why /tmp on tmpfs is a good idea, for that refer to the Fedora feature
page. Here we’ll just discuss what /tmp should be used for and for
what it shouldn’t be, as well as what should be used instead. All that in order
to make sure your application remains compatible with these new features
introduced to many newer Linux distributions.

/tmp is (as the name suggests) an area where temporary files
applications require during operation may be placed. Of course, temporary files
differ very much in their properties:

They can be large, or very small
They might be used for sharing between users, or be private to users
They might need to be persistent across boots, or very volatile
They might need to be machine-local or shared on the network

Traditionally, /tmp has not only been the place where actual
temporary files are stored, but some software used to place (and often still
continues to place) communication primitives such as sockets, FIFOs, shared
memory there as well. Notably X11, but many others too. Usage of world-writable
shared namespaces for communication purposes has always been problematic, since
to establish communication you need stable names, but stable names open the
doors for DoS attacks. This can be corrected partially, by establishing
protected per-app directories for certain services during early boot (like we
do for X11), but this only fixes the problem partially, since this only works
correctly if every package installation is followed by a reboot.

Besides /tmp there are various other places where temporary files
(or other files that traditionally have been stored in /tmp) can be
stored. Here’s a quick overview of the candidates:

/tmp, POSIX suggests this is flushed as boot, FHS says that files
do not need to be persistent between two runs of the application. Old files are
often cleaned up automatically after a time (“aging”). Usually it is
recommended to use $TMPDIR if it is set before falling back to /tmp
directly. As mentioned, this is a tmpfs on many Linuxes/Unixes (and most likely
will be for most soon), and hence should be used only for small files. It’s
generally a shared namespace, hence the only APIs for using it should be mkstemp(), mkdtemp() (and friends)
to be entirely safe.^[1] Recently, improvements have been made to
turn this shared namespace into a private namespace (see above), but that doesn’t
relieve developers from writing secure code that is also safe if /tmp is a shared
namespace. Because /tmp is no longer necessarily a shared namespace it
is generally unsuitable as a location for communication primitives. It is
machine-private and local. It’s usually fully featured (locking, …). This
directory is world writable and thus available for both privileged and
unprivileged code.
/var/tmp, according to FHS “more persistent” than /tmp,
and is less often cleaned up (it’s persistent across reboots, for example). It’s not on a tmpfs, but on a real disk, and
hence can be used to store much larger files. The same namespace problems apply
as with /tmp, hence also exclusively use
mkstemp()/mkdtemp() for this directory. It is also
automatically cleaned up by time. It is machine-private. It’s not necessarily
fully featured (no locking, …). This directory is world writable and thus
available for both privileged and unprivileged code. We suggest to also check
$TMPDIR before falling back to /var/tmp. That way if
$TMPDIR is set this overrides usage of both /tmp and
/var/tmp.
/run (traditionally /var/run) where privileged daemons
can store runtime data, such as communication primitives. This is where your
daemon should place its sockets. It’s guaranteed to be a shared namespace, but
is only writable by privileged code and hence very safe to use. This file
system is guaranteed to be a tmpfs and is hence automatically flushed at boots.
No automatic clean-up is done beyond that. It is machine-private and local. It
is fully-featured, and provides all functionality the local OS can provide
(locking, sockets, …).
$XDG_RUNTIME_DIR
where unprivileged user software can store runtime data, such as communication
primitives. This is similar to /run but for user applications. It’s a
user private namespace, and hence very safe to use. It’s cleaned up
automatically at logout and also is cleaned up by time via “aging”. It is
machine-private and fully featured. In GLib applications use
g_get_user_runtime_dir() to query the path of this directory.
$XDG_CACHE_HOME
where unprivileged user software can store non-essential data. It’s a private
namespace of the user. It might be shared between machines. It is not
automatically cleaned up, and not fully featured (no locking, and so on, due to
NFS). In GLib applications use g_get_user_cache_dir() to query this
directory.
$XDG_DOWNLOAD_DIR
where unprivileged user software can store downloads and downloads in progress.
It should only be used for downloads, and is a private namespace fo the user,
but might be shared between machines. It is not automatically cleaned up and
not fully featured. In GLib applications use g_get_user_special_dir()
to query the path of this directory.

Now that we have introduced the contestants, here’s a rough guide how we
suggest you (a Linux application developer) pick the right directory to use:

You need a place to put your socket (or other communication primitive) and your code runs privileged: use a subdirectory beneath /run. (Or beneath /var/run for extra compatibility.)
You need a place to put your socket (or other communication primitive) and your code runs unprivileged: use a subdirectory beneath $XDG_RUNTIME_DIR.
You need a place to put your larger downloads and downloads in progress and run unprivileged: use $XDG_DOWNLOAD_DIR.
You need a place to put cache files which should be persistent and run unprivileged: use $XDG_CACHE_HOME.
Nothing of the above applies and you need to place a small file that needs no persistency: use $TMPDIR with a fallback on /tmp. And use mkstemp(), and mkdtemp() and nothing homegrown.
Otherwise use $TMPDIR with a fallback on /var/tmp. Also use mkstemp()/mkdtemp().

Note that these rules above are only suggested by us. These rules
take into account everything we know about this topic and avoid problems with
current and future distributions, as far as we can see them. Please consider
updating your projects to follow these rules, and keep them in mind if you
write new code.

One thing we’d like to stress is that /tmp and /var/tmp
more often than not are actually not the right choice for your usecase. There
are valid uses of these directories, but quite often another directory might
actually be the better place. So, be careful, consider the other options, but
if you do go for /tmp or /var/tmp then at least make sure to
use mkstemp()/mkdtemp().

Thank you for your interest!

Oh, and if you now complain that we don’t understand Unix, and that we are
morons and worse, then please read this again, and you might notice that this
is just a best practice guide, not a specification we have written. Nothing that
introduces anything new, just something that explains how things are.

If you want to complain about the tmp-on-tmpfs or
ServicesPrivateTmp feature, then this is not the right place either,
because this blog post is not really about that. Please direct this to
fedora-devel instead. Thank you very much.

Footnotes

[1] Well, or to turn this around: unless you have a PhD in advanced
Unixology and are not using mkstemp()/mkdtemp() but use
/tmp nonetheless it’s very likely you are writing vulnerable
code.

/etc/os-release

2012-02-13 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/os-release.html

One of
the new configuration files systemd introduced is /etc/os-release.
It replaces the multitude of per-distribution release files^[1] with
a single one. Yesterday we decided
to drop support for systems lacking /etc/os-release
in systemd since recently the majority of the big distributions adopted
/etc/os-release and many small ones did, too^[2]. It’s our
hope that by dropping support for non-compliant distributions we gently put
some pressure on the remaining hold-outs to adopt this scheme as well.

I’d like to take the opportunity to explain a bit what the new file offers,
why application developers should care, and why the distributions should adopt
it. Of course, this file is pretty much a triviality in many ways,
but I guess it’s still one that deserves explanation.

So, you ask why this all?

It relieves application developers who just want to know the
distribution they are running on to check for a multitude of individual release files.
It provides both a “pretty” name (i.e. one to show to the user), and
machine parsable version/OS identifiers (i.e. for use in build systems).
It is extensible, can easily learn new fields if needed. For example, since
we want to print a welcome message in the color of your distribution at boot
we make it possible to configure the ANSI color for that in the file.

FAQs

There’s already the lsb_release tool for this, why don’t you
just use that? Well, it’s a very strange interface: a shell script you have
to invoke (and hence spawn asynchronously from your C code), and it’s not
written to be extensible. It’s an optional package in many distributions, and
nothing we’d be happy to invoke as part of early boot in order to show a
welcome message. (In times with sub-second userspace boot times we really don’t
want to invoke a huge shell script for a triviality like showing the welcome
message). The lsb_release tool to us appears to be an attempt of
abstracting distribution checks, where standardization of distribution checks
is needed. It’s simply a badly designed interface. In our opinion, it
has its use as an interface to determine the LSB version itself, but not for
checking the distribution or version.

Why haven’t you adopted one of the generic release files, such as
Fedora’s /etc/system-release? Well, they are much nicer than
lsb_release, so much is true. However, they are not extensible and
are not really parsable, if the distribution needs to be identified
programmatically or a specific version needs to be verified.

Why didn’t you call this file /etc/bikeshed instead? The name
/etc/os-release sucks! In a way, I think you kind of answered your
own question there already.

Does this mean my distribution can now drop our equivalent of
/etc/fedora-release? Unlikely, too much code exists that still
checks for the individual release files, and you probably shouldn’t break that.
This new file makes things easy for applications, not for distributions:
applications can now rely on a single file only, and use it in a nice way.
Distributions will have to continue to ship the old files unless they are
willing to break compatibility here.

This is so useless! My application needs to be compatible with distros
from 1998, so how could I ever make use of the new file? I will have to
continue using the old ones! True, if you need compatibility with really
old distributions you do. But for new code this might not be an issue, and in
general new APIs are new APIs. So if you decide to depend on it, you add a
dependency on it. However, even if you need to stay compatible it might make
sense to check /etc/os-release first and just fall back to the old
files if it doesn’t exist. The least it does for you is that you don’t need 25+
open() attempts on modern distributions, but just one.

You evil people are forcing my beloved distro $XYZ to adopt your awful
systemd schemes. I hate you! You hate too much, my friend. Also, I am
pretty sure it’s not difficult to see the benefit of this new file
independently of systemd, and it’s truly useful on systems without systemd,
too.

I hate what you people do, can I just ignore this? Well, you really
need to work on your constant feelings of hate, my friend. But, to a certain
degree yes, you can ignore this for a while longer. But already, there are a
number of applications making use of this file. You lose compatibility with
those. Also, you are kinda working towards the further balkanization of the
Linux landscape, but maybe that’s your intention?

You guys add a new file because you think there are already too many? You
guys are so confused! None of the existing files is generic and extensible
enough to do what we want it to do. Hence we had to introduce a new one. We
acknowledge the irony, however.

The file is extensible? Awesome! I want a new field XYZ= in it! Sure,
it’s extensible, and we are happy if distributions extend it. Please prefix
your keys with your distribution’s name however. Or even better: talk to us and
we might be able update the documentation and make your field standard, if you
convince us that it makes sense.

Anyway, to summarize all this: if you work on an application that needs to
identify the OS it is being built on or is being run on, please consider making
use of this new file, we created it for you. If you work on a distribution, and
your distribution doesn’t support this file yet, please consider adopting this
file, too.

If you are working on a small/embedded distribution, or a legacy-free
distribution we encourage you to adopt only this file and not establish any
other per-distro release file.

Read the documentation for /etc/os-release.

Footnotes

[1] Yes, multitude, there’s at least: /etc/redhat-release,
/etc/SuSE-release, /etc/debian_version,
/etc/arch-release, /etc/gentoo-release,
/etc/slackware-version, /etc/frugalware-release,
/etc/altlinux-release, /etc/mandriva-release,
/etc/meego-release, /etc/angstrom-version,
/etc/mageia-release. And some distributions even have multiple, for
example Fedora has already four different files.

[2] To our knowledge at least OpenSUSE, Fedora, ArchLinux, Angstrom,
Frugalware have adopted this. (This list is not comprehensive, there are
probably more.)

The Case for the /usr Merge

2012-01-27 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/the-usr-merge.html

One of the features of Fedora 17 is the /usr merge, put
forward by Harald Hoyer and Kay Sievers^[1]. In the time since this
feature has been proposed repetitive discussions took place all over the various
Free Software communities, and usually the same questions were asked: what the reasons
behind this feature were, and whether it makes sense to adopt the same scheme for
distribution XYZ, too.

Especially in the Non-Fedora world it appears to be socially unacceptable to
actually have a look at the Fedora feature page
(where many of the questions are already brought up and answered) which is very unfortunate. To
improve the situation I spent some time today to summarize the reasons for the
/usr merge independently. I’d hence like to direct you to this new page I put
up which tries to summarize the reasons for this, with an emphasis on the
compatibility point of view:

The Case for the /usr Merge

Note that even though this page is in the systemd wiki, what it covers is
mostly orthogonal to systemd. systemd supports both systems with a merged /usr
and with a split /usr, and the /usr merge should be interesting for non-systemd
distributions as well.

Primarily I put this together to have a nice place to point all those folks
who continue to write me annoyed emails, even though I am actually not even
working on all of this…

Enjoy the read!

Footnotes:

[1] And not actually by me, I am just a supportive spectator and am
not doing any work on it. Unfortunately some tech press folks created the false
impression I was behind this. But credit where credit is due, this is all
Harald’s and Kay’s work.

Logging to the Journal

syslog()

printf()

Native Messages

Python

Portability

Managing Resources

Managing CPU

Analyzing Resource usage

Managing Memory

Managing Block IO

Managing Other Resource Parameters

Using the Journal

Enabling Persistency

Basics

Access Control

Live View

Basic Filtering

Advanced Filtering

And now, it becomes magic!

Gettys on Serial Consoles (and Elsewhere)

Virtual Terminals

Serial Terminals

Watchdogs

The Self-Explanatory Boot

Log and Service Status

The collective thoughts of the interwebz