All posts by Lennart Poettering

Berlin Open Source Meetup

2012-08-06 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/berlin-open-source-meetup.html

Chris Kühl and I are organizing a Berlin
Open Source Meetup on Aug 19th at the Prater Biergarten in Prenzlauer Berg.
If you live in Berlin (or are passing by) and are involved in or interested in
Open Source then you are invited!

There’s also a Google+ event for the meetup.

It’s a public event, so everybody is welcome, and please feel free to invite others!

See you at the Prater!

Upcoming Hackfests/Sprints

2012-07-20 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/hackfests.html

The Linux Plumbers
Conference 2012 will take place August 29th to 31st in San Diego,
California. We, the systemd
developers, would like to invite you to two hackfests/sprints that will happen
around LPC:

San Diego: libvirt/LXC/systemd/SELinux Integration Hackfest

On 28th of August we’ll have a hackfest on the topic of closer
integration of libvirt, LXC, systemd and SELinux, colocated with LPC in
San Diego, California. We’ll have a number of key people from these projects
participating, including Dan Walsh, Eric Paris, Daniel P. Berrange, Kay
Sievers and myself.

Topics we’ll cover: making Fedora/Linux boot entirely cleanly in
normal containers, teaching systemd’s control tools minimal
container-awareness (such as being able to list all services of all
containers in one go, in addition to those running on the host
system), unified journal logging across multiple containers, the systemd
container interface, auditing and containers, running multiple
instances from the same /usr tree, and a lot more…

Who should attend? Everybody hacking on the mentioned
projects who wants to help integrating them with the
goal of turning them into a secure, reliable, powerful container
solution for Linux.

Who should not attend? If you don’t hack on any of these
projects, or if you are not interested in closer integration of at
least two of these projects.

How to register? Just show up. You get extra points however
for letting us know in advance (just send us an email). Attendance is
free.

➥ See also: Google+ Event

San Francisco: systemd Journal Sprint

On September 3-7 we’ll have a sprint on the topic of the systemd
Journal. It’s going to take place at the Pantheon headquarters in San
Francisco, California. Among others, Kay Sievers, David Strauss and I will participate.

Who should attend? Everybody who wants to help improving the
systemd Journal, regardless if in its core itself, in client software
for it, hooking up other projects or writing library bindings for
it. Also, if you are using or planning to use the journal for a
project, we’d be very interested in high-bandwith face-to-face
feedback regarding what you are missing, what you don’t like so much, and what
you find awesome in the Journal.

How to register? Please sign up at EventBrite. Attendance is
free. For more information see the invitation
mail.

➥ See also: Google+ Event

See you in California!

foss.in 2012 CFP Ends in a Few Hours

2012-07-08 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/fossin2012.html

foss.in 2012 in Bangalore takes place again after a
hiatus of some years. It has always been a fantastic conference, and a great opportunity to
visit Bangalore and India. I just submitted my talk proposals, so, hurry up, and submit yours!

systemd for Administrators, Part XV

2012-06-28 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/watchdog.html

Quickly
following the previous iteration, here’s
now the fifteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd:
the embedded/mobile folks, the desktop people and the server
folks. While the systems used by embedded/mobile tend to be
underpowered and have few resources are available, desktops tend to be
much more powerful machines — but still much less resourceful than
servers. Nonetheless there are surprisingly many features that matter
to both extremes of this axis (embedded and servers), but not the
center (desktops). On of them is support for watchdogs in
hardware and software.

Embedded devices frequently rely on watchdog hardware that resets
it automatically if software stops responding (more specifically,
stops signalling the hardware in fixed intervals that it is still
alive). This is required to increase reliability and make sure that
regardless what happens the best is attempted to get the system
working again. Functionality like this makes little sense on the
desktop^[1]. However, on
high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for
hardware watchdogs (as exposed in /dev/watchdog to
userspace), as well as supervisor (software) watchdog support for
invidual system services. The basic idea is the following: if enabled,
systemd will regularly ping the watchdog hardware. If systemd or the
kernel hang this ping will not happen anymore and the hardware will
automatically reset the system. This way systemd and the kernel are
protected from boundless hangs — by the hardware. To make the chain
complete, systemd then exposes a software watchdog interface for
individual services so that they can also be restarted (or some other
action taken) if they begin to hang. This software watchdog logic can
be configured individually for each service in the ping frequency and
the action to take. Putting both parts together (i.e. hardware
watchdogs supervising systemd and the kernel, as well as systemd
supervising all other services) we have a reliable way to watchdog
every single component of the system.

To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec= option in
/etc/systemd/system.conf. It defaults to 0 (i.e. no hardware
watchdog use). Set it to a value like 20s and the watchdog is
enabled. After 20s of no keep-alive pings the hardware will reset
itself. Note that systemd will send a ping to the hardware at half the
specified interval, i.e. every 10s. And that’s already all there is to
it. By enabling this single, simple option you have turned on
supervision by the hardware of systemd and the kernel beneath
it.^[2]

Note that the hardware watchdog device (/dev/watchdog) is
single-user only. That means that you can either enable this
functionality in systemd, or use a separate external watchdog daemon,
such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be
configured in /etc/systemd/system.conf. It controls the
watchdog interval to use during reboots. It defaults to 10min, and
adds extra reliability to the system reboot logic: if a clean reboot
is not possible and shutdown hangs, we rely on the watchdog hardware
to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are
really everything that is necessary to make use of the hardware
watchdogs. Now, let’s have a look how to add watchdog logic to
individual services.

First of all, to make software watchdog-supervisable it needs to be
patched to send out “I am alive” signals in regular intervals in its
event loop. Patching this is relatively easy. First, a daemon needs to
read the WATCHDOG_USEC= environment variable. If it is set,
it will contain the watchdog interval in usec formatted as ASCII text
string, as it is configured for the service. The daemon should then
issue sd_notify("WATCHDOG=1")
calls every half of that interval. A daemon patched this way should
transparently support watchdog functionality by checking whether the
environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been
patched to support the logic pointed out above) it is sufficient to
set the WatchdogSec= to the desired failure latency. See systemd.service(5)
for details on this setting. This causes WATCHDOG_USEC= to be
set for the service’s processes and will cause the service to enter a
failure state as soon as no keep-alive ping is received within the
configured interval.

If a service enters a failure state as soon as the watchdog logic
detects a hang, then this is hardly sufficient to build a reliable
system. The next step is to configure whether the service shall be
restarted and how often, and what to do if it then still fails. To
enable automatic service restarts on failure set
Restart=on-failure for the service. To configure how many
times a service shall be attempted to be restarted use the combination
of StartLimitBurst= and StartLimitInterval= which
allow you to configure how often a service may restart within a time
interval. If that limit is reached, a special action can be
taken. This action is configured with StartLimitAction=. The
default is a none, i.e. that no further action is taken and
the service simply remains in the failure state without any further
attempted restarts. The other three possible values are
reboot, reboot-force and
reboot-immediate. reboot attempts a clean reboot,
going through the usual, clean shutdown logic. reboot-force
is more abrupt: it will not actually try to cleanly shutdown any
services, but immediately kills all remaining services and unmounts
all file systems and then forcibly reboots (this way all file systems
will be clean but reboot will still be very fast). Finally,
reboot-immediate does not attempt to kill any process or
unmount any file systems. Instead it just hard reboots the machine
without delay. reboot-immediate hence comes closest to a
reboot triggered by a hardware watchdog. All these settings are
documented in systemd.service(5).

Putting this all together we now have pretty flexible options to
watchdog-supervise a specific service and configure automatic restarts
of the service if it hangs, plus take ultimate action if that doesn’t
help.

Here’s an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn’t pinged
the system manager for longer than 30s or if it fails otherwise. If it
is restarted this way more often than 4 times in 5min action is taken
and the system quickly rebooted, with all file systems being clean
when it comes up again.

And that’s already all I wanted to tell you about! With hardware
watchdog support right in PID 1, as well as supervisor watchdog
support for individual services we should provide everything you need
for most watchdog usecases. Regardless if you are building an embedded
or mobile applience, or if your are working with high-availability
servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with
/dev/watchdog, and why this shouldn’t be kept in a separate
daemon, then please read this again and try to understand that this is
all about the supervisor chain we are building here, where the hardware watchdog
supervises systemd, and systemd supervises the individual
services. Also, we believe that a service not responding should be
treated in a similar way as any other service error. Finally, pinging
/dev/watchdog is one of the most trivial operations in the OS
(basically little more than a ioctl() call), to the support for this
is not more than a handful lines of code. Maintaining this externally
with complex IPC between PID 1 (and the daemons) and this watchdog
daemon would be drastically more complex, error-prone and resource
intensive.)

Note that the built-in hardware watchdog support of systemd does
not conflict with other watchdog software by default. systemd does not
make use of /dev/watchdog by default, and you are welcome to
use external watchdog daemons in conjunction with systemd, if this
better suits your needs.

And one last thing: if you wonder whether your hardware has a
watchdog, then the answer is: almost definitely yes — if it is anything more
recent than a few years. If you want to verify this, try the wdctl
tool from recent util-linux, which shows you everything you need to
know about your watchdog hardware.

I’d like to thank the great folks from Pengutronix for contributing
most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog
hardware these days too, as this is cheap to build and available in
most modern PC chipsets.

[2] So, here’s a free tip for you if you hack on the core
OS: don’t enable this feature while you hack. Otherwise your system
might suddenly reboot if you are in the middle of tracing through PID
1 with gdb and cause it to be stopped for a moment, so that no
hardware ping can be done…

systemd for Administrators, Part XIV

2012-06-27 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/self-documented-boot.html

And
here’s the fourteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

The Self-Explanatory Boot

One complaint we often hear about systemd is
that its boot process was hard to understand, even
incomprehensible. In general I can only disagree with this sentiment, I
even believe in quite the opposite: in comparison to what we had
before — where to even remotely understand what was going on you had
to have a decent comprehension of the programming language that is
Bourne Shell^[1] — understanding systemd’s boot process is
substantially easier. However, like in many complaints there is some
truth in this frequently heard discomfort: for a seasoned Unix
administrator there indeed is a bit of learning to do when the switch
to systemd is
made. And as systemd developers it is our duty to make the learning
curve shallow, introduce as few surprises as we can, and provide
good documentation where that is not possible.

systemd always had huge body of documentation as manual
pages (nearly 100 individual pages now!), in the Wiki and
the various blog stories I posted. However, any amount of
documentation alone is not enough to make software easily
understood. In fact, thick manuals sometimes appear intimidating and
make the reader wonder where to start reading, if all he was
interested in was this one simple concept of the whole system.

Acknowledging all this we have now added a new, neat, little
feature to systemd: the self-explanatory boot process. What do we mean
by that? Simply that each and every single component of our boot comes
with documentation and that this documentation is closely linked to
its component, so that it is easy to find.

More specifically, all units in systemd (which are what
encapsulate the components of the boot) now include references to
their documentation, the documentation of their configuration files
and further applicable manuals. A user who is trying to understand the
purpose of a unit, how it fits into the boot process and how to
configure it can now easily look up this documentation with the
well-known systemctl status command. Here’s an example how
this looks for systemd-logind.service:

$ systemctl status systemd-logind.service
systemd-logind.service - Login Service
	  Loaded: loaded (/usr/lib/systemd/system/systemd-logind.service; static)
	  Active: active (running) since Mon, 25 Jun 2012 22:39:24 +0200; 1 day and 18h ago
	    Docs: man:systemd-logind.service(7)
	          man:logind.conf(5)
	          http://www.freedesktop.org/wiki/Software/systemd/multiseat
	Main PID: 562 (systemd-logind)
	  CGroup: name=systemd:/system/systemd-logind.service
		  └ 562 /usr/lib/systemd/systemd-logind

Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event2 (Power Button)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event6 (Video Bus)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event0 (Lid Switch)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event1 (Sleep Button)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event7 (ThinkPad Extra Buttons)
Jun 25 22:39:25 epsilon systemd-logind[562]: New session 1 of user gdm.
Jun 25 22:39:25 epsilon systemd-logind[562]: Linked /tmp/.X11-unix/X0 to /run/user/42/X11-display.
Jun 25 22:39:32 epsilon systemd-logind[562]: New session 2 of user lennart.
Jun 25 22:39:32 epsilon systemd-logind[562]: Linked /tmp/.X11-unix/X0 to /run/user/500/X11-display.
Jun 25 22:39:54 epsilon systemd-logind[562]: Removed session 1.

On the first look this output changed very little. If you look
closer however you will find that it now includes one new field:
Docs lists references to the documentation of this
service. In this case there are two man page URIs and one web URL
specified. The man pages describe the purpose and configuration of
this service, the web URL includes an introduction to the basic
concepts of this service.

If the user uses a recent graphical terminal implementation it is
sufficient to click on the URIs shown to get the respective
documentation^[2]. With other words: it never has been that
easy to figure out what a specific component of our boot is about:
just use systemctl status to get more information about it
and click on the links shown to find the documentation.

The past days I have written man pages and added these references
for every single unit we ship with systemd. This means, with
systemctl status you now have a very easy way to find out
more about every single service of the core OS.

If you are not using a graphical terminal (where you can just click
on URIs), a man page URI in the middle of the output of systemctl status is not the most useful thing to have. To make reading the
referenced man pages easier we have also added a new command:

systemctl help systemd-logind.service

Which will open the listed man pages right-away, without the need
to click anything or copy/paste an URI.

The URIs are in the formats documented by the uri(7)
man page. Units may reference http and https URLs, as well as man and
info pages.

Of course all this doesn’t make everything self-explanatory, simply
because the user still has to find out about systemctl status
(and even systemctl in the first place so that he even knows
what units there are); however with this basic knowledge further
help on specific units is in very easy reach.

We hope that this kind of interlinking of runtime behaviour and the
matching documentation is a big step forward to make our boot easier
to understand.

This functionality is partially already available in Fedora 17, and
will show up in complete form in Fedora 18.

That all said, credit where credit is due: this kind of references
to documentation within the service descriptions is not new, Solaris’
SMF had similar functionality for quite some time. However, we believe
this new systemd feature is certainly a novelty on Linux, and with
systemd we now offer you the best documented and best self-explaining
init system.

Of course, if you are writing unit files for your own packages,
please consider also including references to the documentation of your
services and its configuration. This is really easy to do, just list
the URIs in the new Documentation= field in the
[Unit] section of your unit files. For details see systemd.unit(5). The
more comprehensively we include links to documentation in our OS
services the easier the work of administrators becomes. (To make sure
Fedora makes comprehensive use of this functionality I filed a bug on
FPC).

Oh, and BTW: if you are looking for a rough overview of systemd’s
boot process here’s
another new man page we recently added, which includes a pretty
ASCII flow chart of the boot process and the units involved.

Footnotes

[1] Which TBH is a pretty crufty, strange one on top.

[2] Well, a terminal
where this bug is fixed (used together with a help
browser where this one is fixed).

Presentation in Warsaw

2012-05-24 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/warsaw.html

I recently had the chance to speak about systemd
and other projects, as well as the politics behind them at a Bar Camp in Warsaw,
organized by the fine people of OSEC. The presentation has been recorded,
and has now been posted online. It’s a very long recording (1:43h),
but it’s quite interesting (as I’d like to believe) and contains a bit
of background where we are coming from and where are going to. Anyway,
please have a look. Enjoy!

I’d like to thank the organizers for this great event and for
publishing the recording online.

systemd for Administrators, Part XIII

2012-05-18 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/systemctl-journal.html

Here’s
the thirteenth installment
of

my ongoing series
on
systemd
for
Administrators:

Log and Service Status

This one is a short episode. One of the most commonly used commands
on a systemd
system is systemctl status which may be used to determine the
status of a service (or other unit). It always has been a valuable
tool to figure out the processes, runtime information and other meta
data of a daemon running on the system.

With Fedora 17 we introduced the
journal, our new logging scheme that provides structured, indexed
and reliable logging on systemd systems, while providing a certain
degree of compatibility with classic syslog implementations. The
original reason we started to work on the journal was one specific
feature idea, that to the outsider might appear simple but without the
journal is difficult and inefficient to implement: along with the
output of systemctl status we wanted to show the last 10 log
messages of the daemon. Log data is some of the most essential bits of
information we have on the status of a service. Hence it it is an
obvious choice to show next to the general status of the
service.

And now to make it short: at the same time as we integrated the
journal into systemd and Fedora we also hooked up
systemctl with it. Here’s an example output:

$ systemctl status avahi-daemon.service
avahi-daemon.service - Avahi mDNS/DNS-SD Stack
	  Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; enabled)
	  Active: active (running) since Fri, 18 May 2012 12:27:37 +0200; 14s ago
	Main PID: 8216 (avahi-daemon)
	  Status: "avahi-daemon 0.6.30 starting up."
	  CGroup: name=systemd:/system/avahi-daemon.service
		  ├ 8216 avahi-daemon: running [omega.local]
		  └ 8217 avahi-daemon: chroot helper

May 18 12:27:37 omega avahi-daemon[8216]: Joining mDNS multicast group on interface eth1.IPv4 with address 172.31.0.52.
May 18 12:27:37 omega avahi-daemon[8216]: New relevant interface eth1.IPv4 for mDNS.
May 18 12:27:37 omega avahi-daemon[8216]: Network interface enumeration completed.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for 192.168.122.1 on virbr0.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for fd00::e269:95ff:fe87:e282 on eth1.*.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for 172.31.0.52 on eth1.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering HINFO record with values 'X86_64'/'LINUX'.
May 18 12:27:38 omega avahi-daemon[8216]: Server startup complete. Host name is omega.local. Local service cookie is 3555095952.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/ssh.service) successfully established.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/sftp-ssh.service) successfully established.

This, of course, shows the status of everybody’s favourite
mDNS/DNS-SD daemon with a list of its processes, along with — as
promised — the 10 most recent log lines. Mission accomplished!

There are a couple of switches available to alter the output
slightly and adjust it to your needs. The two most interesting
switches are -f to enable follow mode (as in tail -f) and -n to change the number of lines to show (you
guessed it, as in tail -n).

The log data shown comes from three sources: everything any of the
daemon’s processes logged with libc’s syslog() call,
everything submitted using the native Journal API, plus everything any
of the daemon’s processes logged to STDOUT or STDERR. In short:
everything the daemon generates as log data is collected, properly
interleaved and shown in the same format.

And that’s it already for today. It’s a very simple feature, but an
immensely useful one for every administrator. One of the kind “Why didn’t
we already do this 15 years ago?”.

Stay tuned for the next installment!

Boot & Base OS Miniconf at Linux Plumbers Conference 2012, San Diego

2012-05-03 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/lpc2012.html

We are working on putting together a miniconf on
the topic of Boot & Base OS for the Linux Plumbers Conference 2012 in San
Diego (Aug 29-31). And we need your submission!

Are you working on some exciting project related to Boot and Base OS and
would like to present your work? Then please submit something following
these guidelines, but please CC Kay Sievers and Lennart Poettering.

I hope that at this point the Linux Plumbers Conference
needs little introduction, so I will spare any further prose on how great and
useful and the best conference ever it is for everybody who works on the plumbing
layer of Linux. However, there’s one conference that will be co-located with
LPC that is still little known, because it happens for the first time: The C Conference, organized by Brandon Philips
and friends. It covers all things C, and they are still looking for more
topics, in a reverse CFP. Please
consider submitting a proposal and registering to the conference!

The Most Awesome, Least-Advertised Fedora 17 Feature

2012-05-02 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/multi-seat.html

There’s one feature In the upcoming Fedora 17 release that is
immensly useful but very little known, since its feature page
‘ckremoval’ does not explicitly refer to it in its name: true
automatic multi-seat support for Linux.

A multi-seat computer is a system that offers not only one local
seat for a user, but multiple, at the same time. A seat refers to a
combination of a screen, a set of input devices (such as mice and
keyboards), and maybe an audio card or webcam, as individual local
workplace for a user. A multi-seat computer can drive an entire class
room of seats with only a fraction of the cost in hardware, energy,
administration and space: you only have one PC, which usually has way
enough CPU power to drive 10 or more workplaces. (In fact, even a
Netbook has fast enough to drive a couple of seats!) Automatic
multi-seat refers to an entirely automatically managed seat setup:
whenever a new seat is plugged in a new login screen immediately
appears — without any manual configuration –, and when the seat is
unplugged all user sessions on it are removed without delay.

In Fedora 17 we added this functionality to the low-level user and
device tracking of systemd, replacing the previous ConsoleKit logic
that lacked support for automatic multi-seat. With all the ground work
done in systemd, udev and the other components of our plumbing layer
the last remaining bits were surprisingly easy to add.

Currently, the automatic multi-seat logic works best with the USB
multi-seat hardware from Plugable
you can buy cheaply on Amazon
(US). These devices require exactly zero configuration with the
new scheme implemented in Fedora 17: just plug them in at any time,
login screens pop up on them, and you have your additional
seats. Alternatively you can also assemble your seat manually with a
few easy loginctl
attach commands, from any kind of hardware you might have lying
around. To get a full seat you need multiple graphics cards, keyboards
and mice: one set for each seat. (Later on we’ll probably have a graphical
setup utility for additional seats, but that’s not a pressing issue we
believe, as the plug-n-play multi-seat support with the Plugable
devices is so awesomely nice.)

Plugable provided us for free with hardware for testing
multi-seat. They are also involved with the upstream development of
the USB DisplayLink driver for Linux. Due to their positive
involvement with Linux we can only recommend to buy their
hardware. They are good guys, and support Free Software the way all
hardware vendors should! (And besides that, their hardware is also
nicely put together. For example, in contrast to most similar vendors
they actually assign proper vendor/product IDs to their USB hardware
so that we can easily recognize their hardware when plugged in to set
up automatic seats.)

Currently, all this magic is only implemented in the GNOME stack
with the biggest component getting updated being the GNOME Display
Manager. On the Plugable USB hardware you get a full GNOME Shell
session with all the usual graphical gimmicks, the same way as on any
other hardware. (Yes, GNOME 3 works perfectly fine on simpler graphics
cards such as these USB devices!) If you are hacking on a different
desktop environment, or on a different display manager, please have a
look at the
multi-seat documentation we put together, and particularly at
our short piece about writing
display managers which are multi-seat capable.

If you work on a major desktop environment or display manager and
would like to implement multi-seat support for it, but lack the
aforementioned Plugable hardware, we might be able to provide you with
the hardware for free. Please contact us directly, and we might be
able to send you a device. Note that we don’t have unlimited devices
available, hence we’ll probably not be able to pass hardware to
everybody who asks, and we will pass the hardware preferably to people
who work on well-known software or otherwise have contributed good
code to the community already. Anyway, if in doubt, ping us, and
explain to us why you should get the hardware, and we’ll consider you!
(Oh, and this not only applies to display managers, if you hack on some other
software where multi-seat awareness would be truly useful, then don’t
hesitate and ping us!)

Phoronix has this
story about this new multi-seat support which is quite interesting and
full of pictures. Please have a look.

Plugable started a Pledge
drive to lower the price of the Plugable USB multi-seat terminals
further. It’s full of pictures (and a video showing all this in action!), and uses the code we now make
available in Fedora 17 as base. Please consider pledging a few
bucks.

Recently David Zeuthen added
multi-seat support to udisks as well. With this in place, a user
logged in on a specific seat can only see the USB storage plugged into
his individual seat, but does not see any USB storage plugged into any
other local seat. With this in place we closed the last missing bit of
multi-seat support in our desktop stack.

With this code in Fedora 17 we cover the big use cases of
multi-seat already: internet cafes, class rooms and similar
installations can provide PC workplaces cheaply and easily without any
manual configuration. Later on we want to build on this and make this
useful for different uses too: for example, the ability to get a login
screen as easily as plugging in a USB connector makes this not useful
only for saving money in setups for many people, but also in embedded
environments (consider monitoring/debugging screens made available via
this hotplug logic) or servers (get trivially quick local access to
your otherwise head-less server). To be truly useful in these areas we
need one more thing though: the ability to run a simply getty
(i.e. text login) on the seat, without necessarily involving a
graphical UI.

The well-known X successor Wayland already comes out of the box with multi-seat
support based on this logic.

Oh, and BTW, as Ubuntu appears to be “focussing” on “clarity” in the
“cloud” now ;-), and chose Upstart instead of systemd, this feature
won’t be available in Ubuntu any time soon. That’s (one detail of) the
price Ubuntu has to pay for choosing to maintain it’s own (largely
legacy, such as ConsoleKit) plumbing stack.

Multi-seat has a long history on Unix. Since the earliest days Unix
systems could be accessed by multiple local terminals at the same
time. Since then local terminal support (and hence multi-seat)
gradually moved out of view in computing. The fewest machines these
days have more than one seat, the concept of terminals survived almost
exclusively in the context of PTYs (i.e. fully virtualized API
objects, disconnected from any real hardware seat) and VCs (i.e. a
single virtualized local seat), but almost not in any other way (well,
server setups still use serial terminals for emergency remote access,
but they almost never have more than one serial terminal). All what we
do in systemd is based on the ideas originally brought forward in
Unix; with systemd we now try to bring back a number of the good ideas
of Unix that since the old times were lost on the roadside. For
example, in true Unix style we already started to expose the concept
of a service in the file system (in
/sys/fs/cgroup/systemd/system/), something where on Linux the
(often misunderstood) “everything is a file” mantra previously
fell short. With automatic multi-seat support we bring back support
for terminals, but updated with all the features of today’s desktops:
plug and play, zero configuration, full graphics, and not limited to
input devices and screens, but extending to all kinds of devices, such
as audio, webcams or USB memory sticks.

Anyway, this is all for now; I’d like to thank everybody who was
involved with making multi-seat work so nicely and natively on the
Linux platform. You know who you are! Thanks a ton!

systemd Status Update

2012-04-21 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/systemd-update-3.html

It
has been way too long since my last status update on
systemd. Here’s another short, incomprehensive status update on
what we worked on for systemd since
then.

We have been working hard to turn systemd into the most viable set
of components to build operating systems, appliances and devices from,
and make it the best choice for servers, for desktops and for embedded
environments alike. I think we have a really convincing set of
features now, but we are actively working on making it even
better.

Here’s a list of some more and some less interesting features, in
no particular order:

We added an automatic pager to systemctl (and related tools), similar
to how git has it.
systemctl learnt a new switch --failed, to show only
failed services.
You may now start services immediately, overrding all dependency
logic by passing --ignore-dependencies to
systemctl. This is mostly a debugging tool and nothing people
should use in real life.
Sending SIGKILL as final part of the implicit shutdown
logic of services is now optional and may be configured with the
SendSIGKILL= option individually for each service.
We split off the Vala/Gtk tools into its own project systemd-ui.
systemd-tmpfiles learnt file globbing and creating FIFO
special files as well as character and block device nodes, and
symlinks. It also is capable of relabelling certain directories at
boot now (in the SELinux sense).
Immediately before shuttding dow we will now invoke all binaries
found in /lib/systemd/system-shutdown/, which is useful for
debugging late shutdown.
You may now globally control where STDOUT/STDERR of services goes
(unless individual service configuration overrides it).
There’s a new ConditionVirtualization= option, that makes
systemd skip a specific service if a certain virtualization technology
is found or not found. Similar, we now have a new option to detect
whether a certain security technology (such as SELinux) is available,
called ConditionSecurity=. There’s also
ConditionCapability= to check whether a certain process
capability is in the capability bounding set of the system. There’s
also a new ConditionFileIsExecutable=,
ConditionPathIsMountPoint=,
ConditionPathIsReadWrite=,
ConditionPathIsSymbolicLink=.
The file system condition directives now support globbing.
Service conditions may now be “triggering” and “mandatory”, meaning that
they can be a necessary requirement to hold for a service to start, or
simply one trigger among many.
At boot time we now print warnings if: /usr
is on a split-off partition but not already mounted by an initrd;
if /etc/mtab is not a symlink to /proc/mounts; CONFIG_CGROUPS
is not enabled in the kernel. We’ll also expose this as
tainted flag on the bus.
You may now boot the same OS image on a bare metal machine and in
Linux namespace containers and will get a clean boot in both
cases. This is more complicated than it sounds since device management
with udev or write access to /sys, /proc/sys or
things like /dev/kmsg is not available in a container. This
makes systemd a first-class choice for managing thin container
setups. This is all tested with systemd’s own systemd-nspawn
tool but should work fine in LXC setups, too. Basically this means
that you do not have to adjust your OS manually to make it work in a
container environment, but will just work out of the box. It also
makes it easier to convert real systems into containers.
We now automatically spawn gettys on HVC ttys when booting in VMs.
We introduced /etc/machine-id as a generalization of
D-Bus machine ID logic. See this
blog story for more information. On stateless/read-only systems
the machine ID is initialized randomly at boot. In virtualized
environments it may be passed in from the machine manager (with qemu’s
-uuid switch, or via the container
interface).
All of the systemd-specific /etc/fstab mount options are
now in the x-systemd-xyz format.
To make it easy to find non-converted services we will now
implicitly prefix all LSB and SysV init script descriptions with the
strings “LSB:” resp. “SYSV:“.
We introduced /run and made it a hard dependency of
systemd. This directory is now widely accepted and implemented on all
relevant Linux distributions.
systemctl can now execute all its operations remotely too (-H switch).
We now ship systemd-nspawn,
a really powerful tool that can be used to start containers for
debugging, building and testing, much like chroot(1). It is useful to
just get a shell inside a build tree, but is good enough to boot up a
full system in it, too.
If we query the user for a hard disk password at boot he may hit
TAB to hide the asterisks we normally show for each key that is
entered, for extra paranoia.
We don’t enable udev-settle.service anymore, which is
only required for certain legacy software that still hasn’t been
updated to follow devices coming and going cleanly.
We now include a tool that can plot boot speed graphs, similar to
bootchartd, called systemd-analyze.
At boot, we now initialize the kernel’s binfmt_misc logic with the data from /etc/binfmt.d.
systemctl now recognizes if it is run in a chroot()
environment and will work accordingly (i.e. apply changes to the tree
it is run in, instead of talking to the actual PID 1 for this). It also has a new --root= switch to work on an OS tree from outside of it.
There’s a new unit dependency type OnFailureIsolate= that
allows entering a different target whenever a certain unit fails. For
example, this is interesting to enter emergency mode if file system
checks of crucial file systems failed.
Socket units may now listen on Netlink sockets, special files
from /proc and POSIX message queues, too.
There’s a new IgnoreOnIsolate= flag which may be used to
ensure certain units are left untouched by isolation requests. There’s
a new IgnoreOnSnapshot= flag which may be used to exclude
certain units from snapshot units when they are created.
There’s now small mechanism services for
changing the local hostname and other host meta data, changing
the system locale and console settings and the system
clock.
We now limit the capability bounding set for a number of our
internal services by default.
Plymouth may now be disabled globally with
plymouth.enable=0 on the kernel command line.
We now disallocate VTs when a getty finished running (and
optionally other tools run on VTs). This adds extra security since it
clears up the scrollback buffer so that subsequent users cannot get
access to a user’s session output.
In socket units there are now options to control the
IP_TRANSPARENT, SO_BROADCAST, SO_PASSCRED,
SO_PASSSEC socket options.
The receive and send buffers of socket units may now be set larger
than the default system settings if needed by using
SO_{RCV,SND}BUFFORCE.
We now set the hardware timezone as one of the first things in PID
1, in order to avoid time jumps during normal userspace operation, and
to guarantee sensible times on all generated logs. We also no longer
save the system clock to the RTC on shutdown, assuming that this is
done by the clock control tool when the user modifies the time, or
automatically by the kernel if NTP is enabled.
The SELinux directory got moved from /selinux to
/sys/fs/selinux.
We added a small service systemd-logind that keeps tracks
of logged in users and their sessions. It creates control groups for
them, implements the XDG_RUNTIME_DIR
specification for them, maintains seats and device node ACLs and
implements shutdown/idle inhibiting for clients. It auto-spawns gettys
on all local VTs when the user switches to them (instead of starting
six of them unconditionally), thus reducing the resource foot print by
default. It has a D-Bus interface as well as a
simple synchronous library interface. This mechanism obsoletes
ConsoleKit which is now deprecated and should no longer be used.
There’s now full, automatic multi-seat support, and this is
enabled in GNOME 3.4. Just by pluging in new seat hardware you get a
new login screen on your seat’s screen.
There is now an option ControlGroupModify= to allow
services to change the properties of their control groups dynamically,
and one to make control groups persistent in the tree
(ControlGroupPersistent=) so that they can be created and
maintained by external tools.
We now jump back into the initrd in shutdown, so that it can
detach the root file system and the storage devices backing it. This
allows (for the first time!) to reliably undo complex storage setups
on shutdown and leave them in a clean state.
systemctl now supports presets, a way for distributions and
administrators to define their own policies on whether services should
be enabled or disabled by default on package installation.
systemctl now has high-level verbs for masking/unmasking
units. There’s also a new command (systemctl list-unit-files)
for determining the list of all installed unit file files and whether
they are enabled or not.
We now apply sysctl variables to each new network device, as it
appears. This makes /etc/sysctl.d compatible with hot-plug
network devices.
There’s limited profiling for SELinux start-up perfomance built
into PID 1.
There’s a new switch PrivateNetwork=
to turn of any network access for a specific service.
Service units may now include configuration for control group
parameters. A few (such as MemoryLimit=) are exposed with
high-level options, and all others are available via the generic
ControlGroupAttribute= setting.
There’s now the option to mount certain cgroup controllers
jointly at boot. We do this now for cpu and
cpuacct by default.
We added the
journal and turned it on by default.
All service output is now written to the Journal by default,
regardless whether it is sent via syslog or simply written to
stdout/stderr. Both message streams end up in the same location and
are interleaved the way they should. All log messages even from the
kernel and from early boot end up in the journal. Now, no service
output gets unnoticed and is saved and indexed at the same
location.
systemctl status will now show the last 10 log lines for
each service, directly from the journal.
We now show the progress of fsck at boot on the console,
again. We also show the much loved colorful [ OK ] status
messages at boot again, as known from most SysV implementations.
We merged udev into systemd.
We implemented and documented interfaces to container
managers and initrds
for passing execution data to systemd. We also implemented and
documented an
interface for storage daemons that are required to back the root file
system.
There are two new options in service files to propagate reload requests between several units.
systemd-cgls won’t show kernel threads by default anymore, or show empty control groups.
We added a new tool systemd-cgtop that shows resource
usage of whole services in a top(1) like fasion.
systemd may now supervise services in watchdog style. If enabled
for a service the daemon daemon has to ping PID 1 in regular intervals
or is otherwise considered failed (which might then result in
restarting it, or even rebooting the machine, as configured). Also,
PID 1 is capable of pinging a hardware watchdog. Putting this
together, the hardware watchdogs PID 1 and PID 1 then watchdogs
specific services. This is highly useful for high-availability servers
as well as embedded machines. Since watchdog hardware is noawadays
built into all modern chipsets (including desktop chipsets), this
should hopefully help to make this a more widely used
functionality.
We added support for a new kernel command line option
systemd.setenv= to set an environment variable
system-wide.
By default services which are started by systemd will have SIGPIPE
set to ignored. The Unix SIGPIPE logic is used to reliably implement
shell pipelines and when left enabled in services is usually just a
source of bugs and problems.
You may now configure the rate limiting that is applied to
restarts of specific services. Previously the rate limiting parameters
were hard-coded (similar to SysV).
There’s now support for loading the IMA integrity policy into the
kernel early in PID 1, similar to how we already did it with the
SELinux policy.
There’s now an official API to schedule and query scheduled shutdowns.
We changed the license from GPL2+ to LGPL2.1+.
We made systemd-detect-virt
an official tool in the tool set. Since we already had code to detect
certain VM and container environments we now added an official tool
for administrators to make use of in shell scripts and suchlike.
We documented numerous
interfaces systemd introduced.

Much of the stuff above is already available in Fedora 15 and 16,
or will be made available in the upcoming Fedora 17.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

I’d like to thank everybody who contributed to systemd over the past years.

Thanks for your interest!

Control Groups vs. Control Groups

2012-04-10 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/cgroups-vs-cgroups.html

TL;DR: systemd does not
require the performance-sensitive bits of Linux control groups enabled in the kernel.
However, it does require some non-performance-sensitive bits of the control
group logic.

In some areas of the community there’s still some confusion about Linux
control groups and their performance impact, and what precisely it is that
systemd requires of them. In the hope to clear this up a bit, I’d like to point
out a few things:

Control Groups are two things: (A) a way to hierarchally group and
label processes, and (B) a way to then apply resource limits
to these groups. systemd only requires the former (A), and not the latter (B).
That means you can compile your kernel without any control group resource
controllers (B) and systemd will work perfectly on it. However, if you in
addition disable the grouping feature entirely (A) then systemd will loudly
complain at boot and proceed only reluctantly with a big warning and in a
limited functionality mode.

At compile time, the grouping/labelling feature in the kernel is enabled by
CONFIG_CGROUPS=y, the individual controllers by CONFIG_CGROUP_FREEZER=y,
CONFIG_CGROUP_DEVICE=y, CONFIG_CGROUP_CPUACCT=y, CONFIG_CGROUP_MEM_RES_CTLR=y,
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, CONFIG_CGROUP_MEM_RES_CTLR_KMEM=y,
CONFIG_CGROUP_PERF=y, CONFIG_CGROUP_SCHED=y, CONFIG_BLK_CGROUP=y,
CONFIG_NET_CLS_CGROUP=y, CONFIG_NETPRIO_CGROUP=y. And since (as mentioned) we
only need the former (A), not the latter (B) you may disable all of the latter
options while enabling CONFIG_CGROUPS=y, if you want to run systemd on your
system.

What about the performance impact of these options? Well, every bit of code
comes at some price, so none of these options come entirely for free. However,
the grouping feature (A) alters the general logic very little, it just sticks
hierarchial labels on processes, and its impact is minimal since that is
usually not in any hot path of the OS. This is different for the various
controllers (B) which have a much bigger impact since they influence the resource
management of the OS and are full of hot paths. This means that the kernel
feature that systemd mandatorily requires (A) has a minimal effect on system
performance, but the actually performance-sensitive features of control groups
(B) are entirely optional.

On boot, systemd will mount all controller hierarchies it finds enabled
in the kernel to individual directories below /sys/fs/cgroup/. This is
the official place where kernel controllers are mounted to these days. The
/sys/fs/cgroup/ mount point in the kernel was created precisely for
this purpose. Since the control group controllers are a shared facility that
might be used by a number of different subsystems a few
projects have agreed on a set of rules in order to avoid that the various bits
of code step on each other’s toes when using these directories.

systemd will also maintain its own, private, controller-less, named control
group hierarchy which is mounted to /sys/fs/cgroup/systemd/. This
hierarchy is private property of systemd, and other software should not try to
interfere with it. This hierarchy is how systemd makes use of the naming and
grouping feature of control groups (A) without actually requiring any kernel
controller enabled for that.

Now, you might notice that by default systemd does create per-service
cgroups in the “cpu” controller if it finds it enabled in the kernel. This is
entirely optional, however. We chose to make use of it by default to even out
CPU usage between system services. Example: On a traditional web server machine
Apache might end up having 100 CGI worker processes around, while MySQL only
has 5 processes running. Without the use of the “cpu” controller this means
that Apache all together ends up having 20x more CPU available than MySQL since
the kernel tries to provide every process with the same amount of CPU time. On
the other hand, if we add these two services to the “cpu” controller in
individual groups by default, Apache and MySQL get the same amount of CPU,
which we think is a good default.

Note that if the CPU controller is not enabled in the kernel systemd will not
attempt to make use of the “cpu” hierarchy as described above. Also, even if it is enabled in the kernel it
is trivial to tell systemd not to make use of it: Simply edit
/etc/systemd/system.conf and set DefaultControllers= to the
empty string.

Let’s discuss a few frequently heard complaints regarding systemd’s use of control groups:

systemd mounts all controllers to /sys/fs/cgroup/ even though
my software requires it at /dev/cgroup/ (or some other place)! The
standardization of /sys/fs/cgroup/ as mount point of the hierarchies
is a relatively recent change in the kernel. Some software has not been updated
yet for it. If you cannot change the software in question you are welcome to
unmount the hierarchies from /sys/fs/cgroup/ and mount them wherever
you need them instead. However, make sure to leave
/sys/fs/cgroup/systemd/ untouched.
systemd makes use of the “cpu” hierarchy, but it should leave its dirty
fingers from it! As mentioned above, just set the
DefaultControllers= option of systemd to the empty string.
I need my two controllers “foo” and “bar” mounted into one hierarchy,
but systemd mounts them in two! Use the JoinControllers= setting
in /etc/systemd/system.conf to mount several controllers into a single
hierarchy.
Control groups are evil and they make everything slower! Well,
please read the text above and understand the difference between
“control-groups-as-in-naming-and-grouping” (A) and “cgroups-as-in-controllers”
(B). Then, please turn off all controllers in you kernel build (B) but leave
CONFIG_CGROUPS=y (A) enabled.
I have heard some kernel developers really hate control groups
and think systemd is evil because it requires them! Well, there are a
couple of things behind the dislike of control groups by some folks.
Primarily, this is probably caused because the hackers in question do not
distuingish the naming-and-grouping bits of the control group logic (A) and the
controllers that are based on it (B). Mainly, their beef is with the latter
(which systemd does not require, which is the key point I am trying to make in
the text above), but there are other issues as well: for example, the code of
the grouping logic is not the most beautiful bit of code ever written by man
(which is thankfully likely to get better now, since the control groups
subsystem now has an active maintainer again). And then for some
developers it is important that they can compare the runtime behaviour of many
historic kernel versions in order to find bugs (git bisect). Since systemd
requires kernels with basic control group support enabled, and this is a
relatively recent feature addition to the kernel, this makes it difficult for
them to use a newer distribution with all these old kernels
that predate cgroups. Anyway, the summary is probably that what matters to
developers is different from what matters to users and
administrators.

I hope this explanation was useful for a reader or two! Thank you for your time!

GUADEC 2012 CFP Ending Soon!

2012-04-10 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/guadec-2012-cfp.html

In case you haven’t submitted your talk proposal for GUADEC 2012 in A
Coruña, Spain yet, hurry: the deadline is on April 14th, i.e. this
saturday! Read der Call for
Participation! Submit a
proposal!

/tmp or not /tmp?

2012-03-28 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/tmp.html

A number of Linux distributions have recently switched (or started
switching) to /tmp on tmpfs by default (ArchLinux, Debian among
others). Other distributions have plans/are discussing doing the same (Ubuntu, OpenSUSE).
Since we believe this is a good idea and it’s good to keep the delta between
the distributions minimal we are proposing
the same for Fedora 18, too. On Solaris a similar change has already been
implemented in 1994 (and other Unixes have made a similar change long ago,
too). Yet, not all of our software is written in a way that it works nicely
together with /tmp on tmpfs.

Another Fedora
feature (for Fedora 17) changed the semantics of /tmp for many
system services to make them more secure, by isolating the /tmp namespaces of the
various services. Handling of temporary files in /tmp has been
security sensitive since it has been introduced since it traditionally has been
a world writable, shared namespace and unless all user code safely uses randomized file names
it is vulnerable to DoS attacks and worse.

In this blog story I’d like to shed some light on proper usage of
/tmp and what your Linux application should use for what purpose. We’ll not
discuss why /tmp on tmpfs is a good idea, for that refer to the Fedora feature
page. Here we’ll just discuss what /tmp should be used for and for
what it shouldn’t be, as well as what should be used instead. All that in order
to make sure your application remains compatible with these new features
introduced to many newer Linux distributions.

/tmp is (as the name suggests) an area where temporary files
applications require during operation may be placed. Of course, temporary files
differ very much in their properties:

They can be large, or very small
They might be used for sharing between users, or be private to users
They might need to be persistent across boots, or very volatile
They might need to be machine-local or shared on the network

Traditionally, /tmp has not only been the place where actual
temporary files are stored, but some software used to place (and often still
continues to place) communication primitives such as sockets, FIFOs, shared
memory there as well. Notably X11, but many others too. Usage of world-writable
shared namespaces for communication purposes has always been problematic, since
to establish communication you need stable names, but stable names open the
doors for DoS attacks. This can be corrected partially, by establishing
protected per-app directories for certain services during early boot (like we
do for X11), but this only fixes the problem partially, since this only works
correctly if every package installation is followed by a reboot.

Besides /tmp there are various other places where temporary files
(or other files that traditionally have been stored in /tmp) can be
stored. Here’s a quick overview of the candidates:

/tmp, POSIX suggests this is flushed as boot, FHS says that files
do not need to be persistent between two runs of the application. Old files are
often cleaned up automatically after a time (“aging”). Usually it is
recommended to use $TMPDIR if it is set before falling back to /tmp
directly. As mentioned, this is a tmpfs on many Linuxes/Unixes (and most likely
will be for most soon), and hence should be used only for small files. It’s
generally a shared namespace, hence the only APIs for using it should be mkstemp(), mkdtemp() (and friends)
to be entirely safe.^[1] Recently, improvements have been made to
turn this shared namespace into a private namespace (see above), but that doesn’t
relieve developers from writing secure code that is also safe if /tmp is a shared
namespace. Because /tmp is no longer necessarily a shared namespace it
is generally unsuitable as a location for communication primitives. It is
machine-private and local. It’s usually fully featured (locking, …). This
directory is world writable and thus available for both privileged and
unprivileged code.
/var/tmp, according to FHS “more persistent” than /tmp,
and is less often cleaned up (it’s persistent across reboots, for example). It’s not on a tmpfs, but on a real disk, and
hence can be used to store much larger files. The same namespace problems apply
as with /tmp, hence also exclusively use
mkstemp()/mkdtemp() for this directory. It is also
automatically cleaned up by time. It is machine-private. It’s not necessarily
fully featured (no locking, …). This directory is world writable and thus
available for both privileged and unprivileged code. We suggest to also check
$TMPDIR before falling back to /var/tmp. That way if
$TMPDIR is set this overrides usage of both /tmp and
/var/tmp.
/run (traditionally /var/run) where privileged daemons
can store runtime data, such as communication primitives. This is where your
daemon should place its sockets. It’s guaranteed to be a shared namespace, but
is only writable by privileged code and hence very safe to use. This file
system is guaranteed to be a tmpfs and is hence automatically flushed at boots.
No automatic clean-up is done beyond that. It is machine-private and local. It
is fully-featured, and provides all functionality the local OS can provide
(locking, sockets, …).
$XDG_RUNTIME_DIR
where unprivileged user software can store runtime data, such as communication
primitives. This is similar to /run but for user applications. It’s a
user private namespace, and hence very safe to use. It’s cleaned up
automatically at logout and also is cleaned up by time via “aging”. It is
machine-private and fully featured. In GLib applications use
g_get_user_runtime_dir() to query the path of this directory.
$XDG_CACHE_HOME
where unprivileged user software can store non-essential data. It’s a private
namespace of the user. It might be shared between machines. It is not
automatically cleaned up, and not fully featured (no locking, and so on, due to
NFS). In GLib applications use g_get_user_cache_dir() to query this
directory.
$XDG_DOWNLOAD_DIR
where unprivileged user software can store downloads and downloads in progress.
It should only be used for downloads, and is a private namespace fo the user,
but might be shared between machines. It is not automatically cleaned up and
not fully featured. In GLib applications use g_get_user_special_dir()
to query the path of this directory.

Now that we have introduced the contestants, here’s a rough guide how we
suggest you (a Linux application developer) pick the right directory to use:

You need a place to put your socket (or other communication primitive) and your code runs privileged: use a subdirectory beneath /run. (Or beneath /var/run for extra compatibility.)
You need a place to put your socket (or other communication primitive) and your code runs unprivileged: use a subdirectory beneath $XDG_RUNTIME_DIR.
You need a place to put your larger downloads and downloads in progress and run unprivileged: use $XDG_DOWNLOAD_DIR.
You need a place to put cache files which should be persistent and run unprivileged: use $XDG_CACHE_HOME.
Nothing of the above applies and you need to place a small file that needs no persistency: use $TMPDIR with a fallback on /tmp. And use mkstemp(), and mkdtemp() and nothing homegrown.
Otherwise use $TMPDIR with a fallback on /var/tmp. Also use mkstemp()/mkdtemp().

Note that these rules above are only suggested by us. These rules
take into account everything we know about this topic and avoid problems with
current and future distributions, as far as we can see them. Please consider
updating your projects to follow these rules, and keep them in mind if you
write new code.

One thing we’d like to stress is that /tmp and /var/tmp
more often than not are actually not the right choice for your usecase. There
are valid uses of these directories, but quite often another directory might
actually be the better place. So, be careful, consider the other options, but
if you do go for /tmp or /var/tmp then at least make sure to
use mkstemp()/mkdtemp().

Thank you for your interest!

Oh, and if you now complain that we don’t understand Unix, and that we are
morons and worse, then please read this again, and you might notice that this
is just a best practice guide, not a specification we have written. Nothing that
introduces anything new, just something that explains how things are.

If you want to complain about the tmp-on-tmpfs or
ServicesPrivateTmp feature, then this is not the right place either,
because this blog post is not really about that. Please direct this to
fedora-devel instead. Thank you very much.

Footnotes

[1] Well, or to turn this around: unless you have a PhD in advanced
Unixology and are not using mkstemp()/mkdtemp() but use
/tmp nonetheless it’s very likely you are writing vulnerable
code.

/etc/os-release

2012-02-13 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/os-release.html

One of
the new configuration files systemd introduced is /etc/os-release.
It replaces the multitude of per-distribution release files^[1] with
a single one. Yesterday we decided
to drop support for systems lacking /etc/os-release
in systemd since recently the majority of the big distributions adopted
/etc/os-release and many small ones did, too^[2]. It’s our
hope that by dropping support for non-compliant distributions we gently put
some pressure on the remaining hold-outs to adopt this scheme as well.

I’d like to take the opportunity to explain a bit what the new file offers,
why application developers should care, and why the distributions should adopt
it. Of course, this file is pretty much a triviality in many ways,
but I guess it’s still one that deserves explanation.

So, you ask why this all?

It relieves application developers who just want to know the
distribution they are running on to check for a multitude of individual release files.
It provides both a “pretty” name (i.e. one to show to the user), and
machine parsable version/OS identifiers (i.e. for use in build systems).
It is extensible, can easily learn new fields if needed. For example, since
we want to print a welcome message in the color of your distribution at boot
we make it possible to configure the ANSI color for that in the file.

FAQs

There’s already the lsb_release tool for this, why don’t you
just use that? Well, it’s a very strange interface: a shell script you have
to invoke (and hence spawn asynchronously from your C code), and it’s not
written to be extensible. It’s an optional package in many distributions, and
nothing we’d be happy to invoke as part of early boot in order to show a
welcome message. (In times with sub-second userspace boot times we really don’t
want to invoke a huge shell script for a triviality like showing the welcome
message). The lsb_release tool to us appears to be an attempt of
abstracting distribution checks, where standardization of distribution checks
is needed. It’s simply a badly designed interface. In our opinion, it
has its use as an interface to determine the LSB version itself, but not for
checking the distribution or version.

Why haven’t you adopted one of the generic release files, such as
Fedora’s /etc/system-release? Well, they are much nicer than
lsb_release, so much is true. However, they are not extensible and
are not really parsable, if the distribution needs to be identified
programmatically or a specific version needs to be verified.

Why didn’t you call this file /etc/bikeshed instead? The name
/etc/os-release sucks! In a way, I think you kind of answered your
own question there already.

Does this mean my distribution can now drop our equivalent of
/etc/fedora-release? Unlikely, too much code exists that still
checks for the individual release files, and you probably shouldn’t break that.
This new file makes things easy for applications, not for distributions:
applications can now rely on a single file only, and use it in a nice way.
Distributions will have to continue to ship the old files unless they are
willing to break compatibility here.

This is so useless! My application needs to be compatible with distros
from 1998, so how could I ever make use of the new file? I will have to
continue using the old ones! True, if you need compatibility with really
old distributions you do. But for new code this might not be an issue, and in
general new APIs are new APIs. So if you decide to depend on it, you add a
dependency on it. However, even if you need to stay compatible it might make
sense to check /etc/os-release first and just fall back to the old
files if it doesn’t exist. The least it does for you is that you don’t need 25+
open() attempts on modern distributions, but just one.

You evil people are forcing my beloved distro $XYZ to adopt your awful
systemd schemes. I hate you! You hate too much, my friend. Also, I am
pretty sure it’s not difficult to see the benefit of this new file
independently of systemd, and it’s truly useful on systems without systemd,
too.

I hate what you people do, can I just ignore this? Well, you really
need to work on your constant feelings of hate, my friend. But, to a certain
degree yes, you can ignore this for a while longer. But already, there are a
number of applications making use of this file. You lose compatibility with
those. Also, you are kinda working towards the further balkanization of the
Linux landscape, but maybe that’s your intention?

You guys add a new file because you think there are already too many? You
guys are so confused! None of the existing files is generic and extensible
enough to do what we want it to do. Hence we had to introduce a new one. We
acknowledge the irony, however.

The file is extensible? Awesome! I want a new field XYZ= in it! Sure,
it’s extensible, and we are happy if distributions extend it. Please prefix
your keys with your distribution’s name however. Or even better: talk to us and
we might be able update the documentation and make your field standard, if you
convince us that it makes sense.

Anyway, to summarize all this: if you work on an application that needs to
identify the OS it is being built on or is being run on, please consider making
use of this new file, we created it for you. If you work on a distribution, and
your distribution doesn’t support this file yet, please consider adopting this
file, too.

If you are working on a small/embedded distribution, or a legacy-free
distribution we encourage you to adopt only this file and not establish any
other per-distro release file.

Read the documentation for /etc/os-release.

Footnotes

[1] Yes, multitude, there’s at least: /etc/redhat-release,
/etc/SuSE-release, /etc/debian_version,
/etc/arch-release, /etc/gentoo-release,
/etc/slackware-version, /etc/frugalware-release,
/etc/altlinux-release, /etc/mandriva-release,
/etc/meego-release, /etc/angstrom-version,
/etc/mageia-release. And some distributions even have multiple, for
example Fedora has already four different files.

[2] To our knowledge at least OpenSUSE, Fedora, ArchLinux, Angstrom,
Frugalware have adopted this. (This list is not comprehensive, there are
probably more.)

The Case for the /usr Merge

2012-01-27 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/the-usr-merge.html

One of the features of Fedora 17 is the /usr merge, put
forward by Harald Hoyer and Kay Sievers^[1]. In the time since this
feature has been proposed repetitive discussions took place all over the various
Free Software communities, and usually the same questions were asked: what the reasons
behind this feature were, and whether it makes sense to adopt the same scheme for
distribution XYZ, too.

Especially in the Non-Fedora world it appears to be socially unacceptable to
actually have a look at the Fedora feature page
(where many of the questions are already brought up and answered) which is very unfortunate. To
improve the situation I spent some time today to summarize the reasons for the
/usr merge independently. I’d hence like to direct you to this new page I put
up which tries to summarize the reasons for this, with an emphasis on the
compatibility point of view:

The Case for the /usr Merge

Note that even though this page is in the systemd wiki, what it covers is
mostly orthogonal to systemd. systemd supports both systems with a merged /usr
and with a split /usr, and the /usr merge should be interesting for non-systemd
distributions as well.

Primarily I put this together to have a nice place to point all those folks
who continue to write me annoyed emails, even though I am actually not even
working on all of this…

Enjoy the read!

Footnotes:

[1] And not actually by me, I am just a supportive spectator and am
not doing any work on it. Unfortunately some tech press folks created the false
impression I was behind this. But credit where credit is due, this is all
Harald’s and Kay’s work.

Plumbers Wishlist, The Third Edition, a.k.a. "The Thank You Edition"

2012-01-20 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/plumbers-wishlist-3.html

Last October we published a
wishlist for plumbing related features we’d like to see added to the Linux
kernel. Three months later it’s time to publish a short update, and explain
what has been implemented in the kernel, what people have started working on,
and what’s still missing.

The full, updated list is available
on Google Docs.

In general, I must say that the list turned out to be a great success. It
shows how awesome the Open Source community is: Just ask nicely and there’s a
good chance they’ll fulfill your wishes! Thank you very much, Linux
community!

We’d like to thank everybody who worked on any of the features on that list:
Lucas De Marchi, Andi Kleen, Dan Ballard, Li Zefan, Kirill A. Shutemov,
Davidlohr Bueso, Cong Wang, Lennart Poettering, Kay Sievers.

Of the items on the list 5 have been fully implemented and are already part
of a released kernel, or already merged for inclusion for the next kernels
being released.

For 4 further items patches have been posted, and I am hoping they’ll get
merged eventually. Davidlohr, Wang, Zefan, Kirill, it would be great if you’d
continue working on your patches, as we think they are following the right
approach^[1] even if there was some opposition to them on LKML. So,
please keep pushing to solve the outstanding issues and thanks for your work so far!

Footnotes

[1] Yes, I still believe that tmpfs quota should be implemented via
resource limits, as everything else wouldn’t work, as we don’t want to
implement complex and fragile userspace infrastructure to racily upload complex
quota data for all current and future UIDs ever used on the system into each
tmpfs mount point at mount time.

systemd for Administrators, Part XII

2012-01-20 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/security.html

Here’s the twelfth installment
of

my ongoing series
on
systemd
for
Administrators:

Securing Your Services

One of the core features of Unix systems is the idea of privilege separation
between the different components of the OS. Many system services run under
their own user IDs thus limiting what they can do, and hence the impact they
may have on the OS in case they get exploited.

This kind of privilege separation only provides very basic protection
however, since in general system services run this way can still do at least as
much as a normal local users, though not as much as root. For security purposes
it is however very interesting to limit even further what services can do, and
shut them off a couple of things that normal users are allowed to do.

A great way to limit the impact of services is by employing MAC technologies
such as SELinux. If you are interested to secure down your server, running
SELinux is a very good idea. systemd enables developers and administrators to
apply additional restrictions to local services independently of a MAC. Thus,
regardless whether you are able to make use of SELinux you may still enforce
certain security limits on your services.

In this iteration of the series we want to focus on a couple of these
security features of systemd and how to make use of them in your services.
These features take advantage of a couple of Linux-specific technologies that have
been available in the kernel for a long time, but never have been exposed in a
widely usable fashion. These systemd features have been designed to be as easy to use
as possible, in order to make them attractive to administrators and upstream
developers:

Isolating services from the network
Service-private /tmp
Making directories appear read-only or inaccessible to services
Taking away capabilities from services
Disallowing forking, limiting file creation for services
Controlling device node access of services

All options described here are documented in systemd’s man pages, notably systemd.exec(5).
Please consult these man pages for further details.

All these options are available on all systemd systems, regardless if
SELinux or any other MAC is enabled, or not.

All these options are relatively cheap, so if in doubt use them. Even if you
might think that your service doesn’t write to /tmp and hence enabling
PrivateTmp=yes (as described below) might not be necessary, due to
today’s complex software it’s still beneficial to enable this feature, simply
because libraries you link to (and plug-ins to those libraries) which you do
not control might need temporary files after all. Example: you never know what
kind of NSS module your local installation has enabled, and what that NSS module
does with /tmp.

These options are hopefully interesting both for administrators to secure
their local systems, and for upstream developers to ship their services secure
by default. We strongly encourage upstream developers to consider using these
options by default in their upstream service units. They are very easy to make
use of and have major benefits for security.

Isolating Services from the Network

A very simple but powerful configuration option you may use in systemd
service definitions is PrivateNetwork=:

...
[Service]
ExecStart=...
PrivateNetwork=yes
...

With this simple switch a service and all the processes it consists of are
entirely disconnected from any kind of networking. Network interfaces became
unavailable to the processes, the only one they’ll see is the loopback device
“lo”, but it is isolated from the real host loopback. This is a very powerful
protection from network attacks.

Caveat: Some services require the network to be operational. Of
course, nobody would consider using PrivateNetwork=yes on a
network-facing service such as Apache. However even for non-network-facing
services network support might be necessary and not always obvious. Example: if
the local system is configured for an LDAP-based user database doing glibc name
lookups with calls such as getpwnam() might end up resulting in network access.
That said, even in those cases it is more often than not OK to use
PrivateNetwork=yes since user IDs of system service users are required to
be resolvable even without any network around. That means as long as the only
user IDs your service needs to resolve are below the magic 1000 boundary using
PrivateNetwork=yes should be OK.

Internally, this feature makes use of network namespaces of the kernel. If
enabled a new network namespace is opened and only the loopback device
configured in it.

Service-Private /tmp

Another very simple but powerful configuration switch is
PrivateTmp=:

...
[Service]
ExecStart=...
PrivateTmp=yes
...

If enabled this option will ensure that the /tmp directory the
service will see is private and isolated from the host system’s /tmp.
/tmp traditionally has been a shared space for all local services and
users. Over the years it has been a major source of security problems for a
multitude of services. Symlink attacks and DoS vulnerabilities due to guessable
/tmp temporary files are common. By isolating the service’s
/tmp from the rest of the host, such vulnerabilities become moot.

For Fedora 17 a feature has
been accepted in order to enable this option across a large number of
services.

Caveat: Some services actually misuse /tmp as a location
for IPC sockets and other communication primitives, even though this is almost
always a vulnerability (simply because if you use it for communication you need
guessable names, and guessable names make your code vulnerable to DoS and symlink
attacks) and /run is the much safer replacement for this, simply
because it is not a location writable to unprivileged processes. For example,
X11 places it’s communication sockets below /tmp (which is actually
secure — though still not ideal — in this exception since it does so in a
safe subdirectory which is created at early boot.) Services which need to
communicate via such communication primitives in /tmp are no
candidates for PrivateTmp=. Thankfully these days only very few
services misusing /tmp like this remain.

Internally, this feature makes use of file system namespaces of the kernel.
If enabled a new file system namespace is opened inheritng most of the host
hierarchy with the exception of /tmp.

Making Directories Appear Read-Only or Inaccessible to Services

With the ReadOnlyDirectories= and InaccessibleDirectories=
options it is possible to make the specified directories inaccessible for
writing resp. both reading and writing to the service:

...
[Service]
ExecStart=...
InaccessibleDirectories=/home
ReadOnlyDirectories=/var
...

With these two configuration lines the whole tree below /home
becomes inaccessible to the service (i.e. the directory will appear empty and
with 000 access mode), and the tree below /var becomes read-only.

Caveat: Note that ReadOnlyDirectories= currently is not
recursively applied to submounts of the specified directories (i.e. mounts below
/var in the example above stay writable). This is likely to get fixed
soon.

Internally, this is also implemented based on file system namspaces.

Taking Away Capabilities From Services

Another very powerful security option in systemd is
CapabilityBoundingSet= which allows to limit in a relatively fine
grained fashion which kernel capabilities a service started retains:

...
[Service]
ExecStart=...
CapabilityBoundingSet=CAP_CHOWN CAP_KILL
...

In the example above only the CAP_CHOWN and CAP_KILL capabilities are
retained by the service, and the service and any processes it might create have
no chance to ever acquire any other capabilities again, not even via setuid
binaries. The list of currently defined capabilities is available in capabilities(7).
Unfortunately some of the defined capabilities are overly generic (such as
CAP_SYS_ADMIN), however they are still a very useful tool, in particular for
services that otherwise run with full root privileges.

To identify precisely which capabilities are necessary for a service to run
cleanly is not always easy and requires a bit of testing. To simplify this
process a bit, it is possible to blacklist certain capabilities that are
definitely not needed instead of whitelisting all that might be needed. Example: the
CAP_SYS_PTRACE is a particularly powerful and security relevant capability
needed for the implementation of debuggers, since it allows introspecting and
manipulating any local process on the system. A service like Apache obviously
has no business in being a debugger for other processes, hence it is safe to
remove the capability from it:

...
[Service]
ExecStart=...
CapabilityBoundingSet=~CAP_SYS_PTRACE
...

The ~ character the value assignment here is prefixed with inverts
the meaning of the option: instead of listing all capabalities the service
will retain you may list the ones it will not retain.

Caveat: Some services might react confused if certain capabilities are
made unavailable to them. Thus when determining the right set of capabilities
to keep around you need to do this carefully, and it might be a good idea to talk
to the upstream maintainers since they should know best which operations a
service might need to run successfully.

Caveat 2: Capabilities are
not a magic wand. You probably want to combine them and use them in
conjunction with other security options in order to make them truly useful.

To easily check which processes on your system retain which capabilities use
the pscap tool from the libcap-ng-utils package.

Making use of systemd’s CapabilityBoundingSet= option is often a
simple, discoverable and cheap replacement for patching all system daemons
individually to control the capability bounding set on their own.

Disallowing Forking, Limiting File Creation for Services

Resource Limits may be used to apply certain security limits on services
being run. Primarily, resource limits are useful for resource control (as the
name suggests…) not so much access control. However, two of them can be
useful to disable certain OS features: RLIMIT_NPROC and RLIMIT_FSIZE may be
used to disable forking and disable writing of any files with a size >
0:

...
[Service]
ExecStart=...
LimitNPROC=1
LimitFSIZE=0
...

Note that this will work only if the service in question drops privileges
and runs under a (non-root) user ID of its own or drops the CAP_SYS_RESOURCE
capability, for example via CapabilityBoundingSet= as discussed above.
Without that a process could simply increase the resource limit again thus
voiding any effect.

Caveat: LimitFSIZE= is pretty brutal. If the service
attempts to write a file with a size > 0, it will immeidately be killed with
the SIGXFSZ which unless caught terminates the process. Also, creating files
with size 0 is still allowed, even if this option is used.

For more information on these and other resource limits, see setrlimit(2).

Controlling Device Node Access of Services

Devices nodes are an important interface to the kernel and its drivers.
Since drivers tend to get much less testing and security checking than the core
kernel they often are a major entry point for security hacks. systemd allows
you to control access to devices individually for each service:

...
[Service]
ExecStart=...
DeviceAllow=/dev/null rw
...

This will limit access to /dev/null and only this device node,
disallowing access to any other device nodes.

The feature is implemented on top of the devices cgroup controller.

Other Options

Besides the easy to use options above there are a number of other security
relevant options available. However they usually require a bit of preparation
in the service itself and hence are probably primarily useful for upstream
developers. These options are RootDirectory= (to set up
chroot() environments for a service) as well as User= and
Group= to drop privileges to the specified user and group. These
options are particularly useful to greatly simplify writing daemons, where all
the complexities of securely dropping privileges can be left to systemd, and
kept out of the daemons themselves.

If you are wondering why these options are not enabled by default: some of
them simply break seamntics of traditional Unix, and to maintain compatibility
we cannot enable them by default. e.g. since traditional Unix enforced that
/tmp was a shared namespace, and processes could use it for IPC we
cannot just go and turn that off globally, just because /tmp‘s role in
IPC is now replaced by /run.

And that’s it for now. If you are working on unit files for upstream or in
your distribution, please consider using one or more of the options listed
above. If you service is secure by default by taking advantage of these options
this will help not only your users but also make the Internet a safer
place.

PulseAudio vs. AudioFlinger

2012-01-16 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/aruns-numbers.html

Arun
put an awesome article up, detailing how PulseAudio compares to Android’s
AudioFlinger in terms of power consumption and suchlike. Suffice to say,
PulseAudio rocks, but go and read the whole thing, it’s worth it.

Apparently, AudioFlinger is a great choice if you want to shorten your
battery life.

Introducing the Journal

2011-11-18 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/the-journal.html

In the past weeks we have been working on a major new addition to systemd
that will hopefully positively change the Linux ecosystem in a number of ways.
But see for yourself, check out the full explanation on what we have
implemented on the design
document we put up on Google Docs.

Kernel Hackers Panel

2011-11-07 Lennart Poettering

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/linuxcon-kernel-panel.html

At LinuxCon Europe/ELCE I had the chance to moderate the kernel hackers
panel with Linus Torvalds, Alan Cox, Paul McKenney and Thomas Gleixner on
stage. I like to believe it went quite well, but check it out for yourself, as
a video recording is now available online:

For me personally I think the most notable topic covered was Control Groups,
and the clarification that they are something that is needed even though their
implementation right now is in many ways less than perfect. But in the end there is no
reasonable way around it, and much like SMP, technology that complicates things
substantially but is ultimately unavoidable.

Other videos from ELCE are online now, too.

Noise

All posts by Lennart Poettering

Berlin Open Source Meetup

Upcoming Hackfests/Sprints

San Diego: libvirt/LXC/systemd/SELinux Integration Hackfest

San Francisco: systemd Journal Sprint

foss.in 2012 CFP Ends in a Few Hours

systemd for Administrators, Part XV

Watchdogs

systemd for Administrators, Part XIV

The Self-Explanatory Boot

Presentation in Warsaw

systemd for Administrators, Part XIII

Log and Service Status

Boot & Base OS Miniconf at Linux Plumbers Conference 2012, San Diego

The Most Awesome, Least-Advertised Fedora 17 Feature

systemd Status Update

Control Groups vs. Control Groups

GUADEC 2012 CFP Ending Soon!

/tmp or not /tmp?

/etc/os-release

The Case for the /usr Merge

Plumbers Wishlist, The Third Edition, a.k.a. "The Thank You Edition"

systemd for Administrators, Part XII

Securing Your Services

Isolating Services from the Network

Service-Private /tmp

Making Directories Appear Read-Only or Inaccessible to Services

Taking Away Capabilities From Services

Disallowing Forking, Limiting File Creation for Services

Controlling Device Node Access of Services

Other Options

PulseAudio vs. AudioFlinger

Introducing the Journal

Kernel Hackers Panel

The collective thoughts of the interwebz