Tag Archives: Projects

Plumbers Wishlist, The Third Edition, a.k.a. "The Thank You Edition"

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/plumbers-wishlist-3.html

Last October we published a
wishlist for plumbing related features
we’d like to see added to the Linux
kernel. Three months later it’s time to publish a short update, and explain
what has been implemented in the kernel, what people have started working on,
and what’s still missing.

The full, updated list is available
on Google Docs
.

In general, I must say that the list turned out to be a great success. It
shows how awesome the Open Source community is: Just ask nicely and there’s a
good chance they’ll fulfill your wishes! Thank you very much, Linux
community!

We’d like to thank everybody who worked on any of the features on that list:
Lucas De Marchi, Andi Kleen, Dan Ballard, Li Zefan, Kirill A. Shutemov,
Davidlohr Bueso, Cong Wang, Lennart Poettering, Kay Sievers.

Of the items on the list 5 have been fully implemented and are already part
of a released kernel, or already merged for inclusion for the next kernels
being released.

For 4 further items patches have been posted, and I am hoping they’ll get
merged eventually. Davidlohr, Wang, Zefan, Kirill, it would be great if you’d
continue working on your patches, as we think they are following the right
approach[1] even if there was some opposition to them on LKML. So,
please keep pushing to solve the outstanding issues and thanks for your work so far!

Footnotes

[1] Yes, I still believe that tmpfs quota should be implemented via
resource limits, as everything else wouldn’t work, as we don’t want to
implement complex and fragile userspace infrastructure to racily upload complex
quota data for all current and future UIDs ever used on the system into each
tmpfs mount point at mount time.

systemd for Administrators, Part XII

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/security.html

Here’s the twelfth installment
of

my ongoing series
on
systemd
for
Administrators:

Securing Your Services

One of the core features of Unix systems is the idea of privilege separation
between the different components of the OS. Many system services run under
their own user IDs thus limiting what they can do, and hence the impact they
may have on the OS in case they get exploited.

This kind of privilege separation only provides very basic protection
however, since in general system services run this way can still do at least as
much as a normal local users, though not as much as root. For security purposes
it is however very interesting to limit even further what services can do, and
shut them off a couple of things that normal users are allowed to do.

A great way to limit the impact of services is by employing MAC technologies
such as SELinux. If you are interested to secure down your server, running
SELinux is a very good idea. systemd enables developers and administrators to
apply additional restrictions to local services independently of a MAC. Thus,
regardless whether you are able to make use of SELinux you may still enforce
certain security limits on your services.

In this iteration of the series we want to focus on a couple of these
security features of systemd and how to make use of them in your services.
These features take advantage of a couple of Linux-specific technologies that have
been available in the kernel for a long time, but never have been exposed in a
widely usable fashion. These systemd features have been designed to be as easy to use
as possible, in order to make them attractive to administrators and upstream
developers:

  • Isolating services from the network
  • Service-private /tmp
  • Making directories appear read-only or inaccessible to services
  • Taking away capabilities from services
  • Disallowing forking, limiting file creation for services
  • Controlling device node access of services

All options described here are documented in systemd’s man pages, notably systemd.exec(5).
Please consult these man pages for further details.

All these options are available on all systemd systems, regardless if
SELinux or any other MAC is enabled, or not.

All these options are relatively cheap, so if in doubt use them. Even if you
might think that your service doesn’t write to /tmp and hence enabling
PrivateTmp=yes (as described below) might not be necessary, due to
today’s complex software it’s still beneficial to enable this feature, simply
because libraries you link to (and plug-ins to those libraries) which you do
not control might need temporary files after all. Example: you never know what
kind of NSS module your local installation has enabled, and what that NSS module
does with /tmp.

These options are hopefully interesting both for administrators to secure
their local systems, and for upstream developers to ship their services secure
by default. We strongly encourage upstream developers to consider using these
options by default in their upstream service units. They are very easy to make
use of and have major benefits for security.

Isolating Services from the Network

A very simple but powerful configuration option you may use in systemd
service definitions is PrivateNetwork=:

...
[Service]
ExecStart=...
PrivateNetwork=yes
...

With this simple switch a service and all the processes it consists of are
entirely disconnected from any kind of networking. Network interfaces became
unavailable to the processes, the only one they’ll see is the loopback device
“lo”, but it is isolated from the real host loopback. This is a very powerful
protection from network attacks.

Caveat: Some services require the network to be operational. Of
course, nobody would consider using PrivateNetwork=yes on a
network-facing service such as Apache. However even for non-network-facing
services network support might be necessary and not always obvious. Example: if
the local system is configured for an LDAP-based user database doing glibc name
lookups with calls such as getpwnam() might end up resulting in network access.
That said, even in those cases it is more often than not OK to use
PrivateNetwork=yes since user IDs of system service users are required to
be resolvable even without any network around. That means as long as the only
user IDs your service needs to resolve are below the magic 1000 boundary using
PrivateNetwork=yes should be OK.

Internally, this feature makes use of network namespaces of the kernel. If
enabled a new network namespace is opened and only the loopback device
configured in it.

Service-Private /tmp

Another very simple but powerful configuration switch is
PrivateTmp=:

...
[Service]
ExecStart=...
PrivateTmp=yes
...

If enabled this option will ensure that the /tmp directory the
service will see is private and isolated from the host system’s /tmp.
/tmp traditionally has been a shared space for all local services and
users. Over the years it has been a major source of security problems for a
multitude of services. Symlink attacks and DoS vulnerabilities due to guessable
/tmp temporary files are common. By isolating the service’s
/tmp from the rest of the host, such vulnerabilities become moot.

For Fedora 17 a feature has
been accepted
in order to enable this option across a large number of
services.

Caveat: Some services actually misuse /tmp as a location
for IPC sockets and other communication primitives, even though this is almost
always a vulnerability (simply because if you use it for communication you need
guessable names, and guessable names make your code vulnerable to DoS and symlink
attacks) and /run is the much safer replacement for this, simply
because it is not a location writable to unprivileged processes. For example,
X11 places it’s communication sockets below /tmp (which is actually
secure — though still not ideal — in this exception since it does so in a
safe subdirectory which is created at early boot.) Services which need to
communicate via such communication primitives in /tmp are no
candidates for PrivateTmp=. Thankfully these days only very few
services misusing /tmp like this remain.

Internally, this feature makes use of file system namespaces of the kernel.
If enabled a new file system namespace is opened inheritng most of the host
hierarchy with the exception of /tmp.

Making Directories Appear Read-Only or Inaccessible to Services

With the ReadOnlyDirectories= and InaccessibleDirectories=
options it is possible to make the specified directories inaccessible for
writing resp. both reading and writing to the service:

...
[Service]
ExecStart=...
InaccessibleDirectories=/home
ReadOnlyDirectories=/var
...

With these two configuration lines the whole tree below /home
becomes inaccessible to the service (i.e. the directory will appear empty and
with 000 access mode), and the tree below /var becomes read-only.

Caveat: Note that ReadOnlyDirectories= currently is not
recursively applied to submounts of the specified directories (i.e. mounts below
/var in the example above stay writable). This is likely to get fixed
soon.

Internally, this is also implemented based on file system namspaces.

Taking Away Capabilities From Services

Another very powerful security option in systemd is
CapabilityBoundingSet= which allows to limit in a relatively fine
grained fashion which kernel capabilities a service started retains:

...
[Service]
ExecStart=...
CapabilityBoundingSet=CAP_CHOWN CAP_KILL
...

In the example above only the CAP_CHOWN and CAP_KILL capabilities are
retained by the service, and the service and any processes it might create have
no chance to ever acquire any other capabilities again, not even via setuid
binaries. The list of currently defined capabilities is available in capabilities(7).
Unfortunately some of the defined capabilities are overly generic (such as
CAP_SYS_ADMIN), however they are still a very useful tool, in particular for
services that otherwise run with full root privileges.

To identify precisely which capabilities are necessary for a service to run
cleanly is not always easy and requires a bit of testing. To simplify this
process a bit, it is possible to blacklist certain capabilities that are
definitely not needed instead of whitelisting all that might be needed. Example: the
CAP_SYS_PTRACE is a particularly powerful and security relevant capability
needed for the implementation of debuggers, since it allows introspecting and
manipulating any local process on the system. A service like Apache obviously
has no business in being a debugger for other processes, hence it is safe to
remove the capability from it:

...
[Service]
ExecStart=...
CapabilityBoundingSet=~CAP_SYS_PTRACE
...

The ~ character the value assignment here is prefixed with inverts
the meaning of the option: instead of listing all capabalities the service
will retain you may list the ones it will not retain.

Caveat: Some services might react confused if certain capabilities are
made unavailable to them. Thus when determining the right set of capabilities
to keep around you need to do this carefully, and it might be a good idea to talk
to the upstream maintainers since they should know best which operations a
service might need to run successfully.

Caveat 2: Capabilities are
not a magic wand.
You probably want to combine them and use them in
conjunction with other security options in order to make them truly useful.

To easily check which processes on your system retain which capabilities use
the pscap tool from the libcap-ng-utils package.

Making use of systemd’s CapabilityBoundingSet= option is often a
simple, discoverable and cheap replacement for patching all system daemons
individually to control the capability bounding set on their own.

Disallowing Forking, Limiting File Creation for Services

Resource Limits may be used to apply certain security limits on services
being run. Primarily, resource limits are useful for resource control (as the
name suggests…) not so much access control. However, two of them can be
useful to disable certain OS features: RLIMIT_NPROC and RLIMIT_FSIZE may be
used to disable forking and disable writing of any files with a size >
0:

...
[Service]
ExecStart=...
LimitNPROC=1
LimitFSIZE=0
...

Note that this will work only if the service in question drops privileges
and runs under a (non-root) user ID of its own or drops the CAP_SYS_RESOURCE
capability, for example via CapabilityBoundingSet= as discussed above.
Without that a process could simply increase the resource limit again thus
voiding any effect.

Caveat: LimitFSIZE= is pretty brutal. If the service
attempts to write a file with a size > 0, it will immeidately be killed with
the SIGXFSZ which unless caught terminates the process. Also, creating files
with size 0 is still allowed, even if this option is used.

For more information on these and other resource limits, see setrlimit(2).

Controlling Device Node Access of Services

Devices nodes are an important interface to the kernel and its drivers.
Since drivers tend to get much less testing and security checking than the core
kernel they often are a major entry point for security hacks. systemd allows
you to control access to devices individually for each service:

...
[Service]
ExecStart=...
DeviceAllow=/dev/null rw
...

This will limit access to /dev/null and only this device node,
disallowing access to any other device nodes.

The feature is implemented on top of the devices cgroup controller.

Other Options

Besides the easy to use options above there are a number of other security
relevant options available. However they usually require a bit of preparation
in the service itself and hence are probably primarily useful for upstream
developers. These options are RootDirectory= (to set up
chroot() environments for a service) as well as User= and
Group= to drop privileges to the specified user and group. These
options are particularly useful to greatly simplify writing daemons, where all
the complexities of securely dropping privileges can be left to systemd, and
kept out of the daemons themselves.

If you are wondering why these options are not enabled by default: some of
them simply break seamntics of traditional Unix, and to maintain compatibility
we cannot enable them by default. e.g. since traditional Unix enforced that
/tmp was a shared namespace, and processes could use it for IPC we
cannot just go and turn that off globally, just because /tmp‘s role in
IPC is now replaced by /run.

And that’s it for now. If you are working on unit files for upstream or in
your distribution, please consider using one or more of the options listed
above. If you service is secure by default by taking advantage of these options
this will help not only your users but also make the Internet a safer
place.

Kernel Hackers Panel

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/linuxcon-kernel-panel.html

At LinuxCon Europe/ELCE I had the chance to moderate the kernel hackers
panel with Linus Torvalds, Alan Cox, Paul McKenney and Thomas Gleixner on
stage
. I like to believe it went quite well, but check it out for yourself, as
a video recording is now available online:

For me personally I think the most notable topic covered was Control Groups,
and the clarification that they are something that is needed even though their
implementation right now is in many ways less than perfect. But in the end there is no
reasonable way around it, and much like SMP, technology that complicates things
substantially but is ultimately unavoidable.

Other videos from ELCE are online now, too.

libabc

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/libabc.html

At the Kernel Summit in Prague last week Kay Sievers and I lead a session on
developing shared userspace libraries, for kernel hackers. More and more
userspace interfaces of the kernel (for example many which deal with storage,
audio, resource management, security, file systems or a number of other
subsystems) nowadays rely on a dedicated userspace component. As people who
work primarily in the plumbing layer of the Linux OS we noticed over and over
again that these libraries written by people who usually are at home on the
kernel side of things make the same mistakes repeatedly, thus making life for
the users of the libraries unnecessarily difficult. In our session we tried to
point out a number of these things, and in particular places where the usual
kernel hacking style translates badly into userspace shared library hacking.
Our hope is that maybe a few kernel developers have a look at our list of
recommendations and consider the points we are raising.

To make things easy we have put together an example skeleton library we
dubbed libabc, whose README
file includes all our points in terse form. It’s available on kernel.org:

The git repository and the README.

This list of recommendations draws inspiration from David Zeuthen’s and
Ulrich Drepper’s well known papers on the topic of writing shared libraries. In
the README linked above we try to distill this wealth of information into a
terse list of recommendations, with a couple of additions and with a strict
focus on a kernel hacker background.

Please have a look, and even if you are not a kernel hacker there might be
something useful to know in it, especially if you work on the lower layers of
our stack.

If you have any questions or additions, just ping us, or comment below!

Prague

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/linuxcon-europe.html

If you make it to Prague the coming week for the LinuxCon/ELCE/GStreamer/Kernel Summit/… superconference, make sure not to miss:

All of that at the Clarion Hotel. See you in Prague!

Plumbers Wishlist, The Second Edition

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/plumbers-wishlist-2.html

Two weeks ago we published a Plumber’s
Wishlist for Linux
. So far, this has already created lively discussions in
the community (as reported on LWN among others), and patches for a few of the
items listed have already been posted (thanks a lot to those who worked on
this, your contributions are much appreciated!).

We
have now prepared a second version of the wish list.
It includes a number
of additions (tmpfs quota! hostname change notifications! and more!) and
updates to the previous items, including links to patches, and references to
other interesting material.

We hope to update this wishlist from time, so stay tuned!

And now, go and read the new wishlist!

Google doesn’t like my name

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/google-doesnt-like-my-name.html

Nice one, Google suspended my Google+ account because I created it under,
well, my name, which is “Lennart Poettering”, and Google+ thinks that wasn’t my
name, even though it says so in my passport, and almost every document I own
and I was never aware I had any other name. This is ricidulous. Google, give me
my name back! This is a really uncool move.

A Big Loss

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/a-big-loss.html

Google
announced today that they’ll be shutting down Google Code Search in
January
. I am quite sure that this would be a massive loss for the Free
Software community. The ability to learn from other people’s code is a key
idea of Free Software. There’s simply no better way to do that than with a
source code search engine. The day Google Code Search will be shut down will
be a sad day for the Free Software community.

Of course, there are a couple of alternatives around, but they all have one
thing in common: they, uh, don’t even remotely compare to the completeness,
performance and simplicity of the Google Code Search interface, and have
serious usability issues. (For example: koders.com is really really slow, and
splits up identifiers you search for at underscores, which kinda makes it
useless for looking for almost any kind of code.)

I think it must be of genuine interest to the Free Software community to
have a capable replacement for Google Code Search, for the day it is turned
off. In fact, it probably should be something the various foundations which
promote Free Software should be looking into, like the FSF or the Linux
Foundation. There are very few better ways to get Free Software into the heads
and minds of engineers than by examples — examples consisting of real life
code they can find with a source code search engine. I believe a source code
search engine is probably among the best vehicles to promote Free Software
towards engineers. In particular if it itself was Free Software (in contrast to
Google Code Search).

Ideally, all software available on web sites like SourceForge, Freshmeat, or
github should be indexed. But there’s also a chance for distributions here:
indexing the sources of all packages a distribution like Debian or Fedora
include would be a great tool for developers. In fact, a distribution offering
this functionality might benefit from such functionality, as it attracts
developer interest in the distribution.

It’s sad that Google Code Search will be gone soon. But maybe there’s
something positive in the bad news here, and a chance to create something better,
more comprehensive, that is free, and promotes our ideals better than Google
ever could. Maybe there’s a chance here for the Open Source foundations, for
the distributions and for the communities to create a better replacement!

A Plumber’s Wish List for Linux

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/plumbers-wishlist.html

Here’s a mail
we just sent to LKML
, for your consideration. Enjoy:

Subject: A Plumber’s Wish List for Linux

We’d like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we’d be delighted. Please ping us in case you need
clarifications or more information on specific items.

Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers


An here’s the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we’d
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we’d like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren’t. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
“inverted” semantics of CapBnd in /proc/$PID/status.

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn’t be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the “cpu” cgroup controller by default without
breaking RT. Right now creating a cgroup in the “cpu” hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in “cpu” whose
processes get an non-RT weight applied, but for RT take advantage of the
parent’s RT budget. We want the separation of RT and non-RT budget
assignment in the “cpu” hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of “cpu” hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.

What You Need to Know When Becoming a Free Software Hacker

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/hinter-den-kulissen.html

Earlier today I gave a presentation at the Technical University Berlin about
things you need to know, things you should expect and things you shouldn’t
expect when your are aspiring to become a successful Free Software Hacker.

I have put my slides up on Google Docs in case you are interested, either
because you are the target audience (i.e. a university student) or because you
need inspiration for a similar talk about the same topic.

The first two slides are in German language, so skip over them. The
interesting bits are all in English. I hope it’s quite comprehensive (though of
course terse). Enjoy:

In case your feed reader/planet messes this up, here’s the non-embedded version.

Oh, and thanks to everybody who reviewed and suggested additions to the the slides on +.

systemd for Administrators, Part XI

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/inetd.html

Here’s the eleventh installment
of
my ongoing series
on
systemd
for
Administrators:

Converting inetd Services

In a
previous episode of this series
I covered how to convert a SysV
init script to a systemd unit file. In this story I hope to explain
how to convert inetd services into systemd units.

Let’s start with a bit of background. inetd has a long tradition as
one of the classic Unix services. As a superserver it listens on
an Internet socket on behalf of another service and then activate that
service on an incoming connection, thus implementing an on-demand
socket activation system. This allowed Unix machines with limited
resources to provide a large variety of services, without the need to
run processes and invest resources for all of them all of the
time. Over the years a number of independent implementations of inetd
have been shipped on Linux distributions. The most prominent being the
ones based on BSD inetd and xinetd. While inetd used to be installed
on most distributions by default, it nowadays is used only for very
few selected services and the common services are all run
unconditionally at boot, primarily for (perceived) performance
reasons.

One of the core feature of systemd (and Apple’s launchd for the
matter) is socket activation, a scheme pioneered by inetd, however
back then with a different focus. Systemd-style socket activation focusses on
local sockets (AF_UNIX), not so much Internet sockets (AF_INET), even
though both are supported. And more importantly even, socket
activation in systemd is not primarily about the on-demand aspect that
was key in inetd, but more on increasing parallelization (socket
activation allows starting clients and servers of the socket at the
same time), simplicity (since the need to configure explicit
dependencies between services is removed) and robustness (since
services can be restarted or may crash without loss of connectivity of the
socket). However, systemd can also activate services on-demand when
connections are incoming, if configured that way.

Socket activation of any kind requires support in the services
themselves. systemd provides a very simple interface that services may
implement to provide socket activation, built around sd_listen_fds(). As such
it is already a very minimal, simple scheme
. However, the
traditional inetd interface is even simpler. It allows passing only a
single socket to the activated service: the socket fd is simply
duplicated to STDIN and STDOUT of the process spawned, and that’s
already it. In order to provide compatibility systemd optionally
offers the same interface to processes, thus taking advantage of the
many services that already support inetd-style socket activation, but not yet
systemd’s native activation.

Before we continue with a concrete example, let’s have a look at
three different schemes to make use of socket activation:

  1. Socket activation for parallelization, simplicity,
    robustness:
    sockets are bound during early boot and a singleton
    service instance to serve all client requests is immediately started
    at boot. This is useful for all services that are very likely used
    frequently and continously, and hence starting them early and in
    parallel with the rest of the system is advisable. Examples: D-Bus,
    Syslog.
  2. On-demand socket activation for singleton services: sockets
    are bound during early boot and a singleton service instance is
    executed on incoming traffic. This is useful for services that are
    seldom used, where it is advisable to save the resources and time at
    boot and delay activation until they are actually needed. Example: CUPS.
  3. On-demand socket activation for per-connection service
    instances:
    sockets are bound during early boot and for each
    incoming connection a new service instance is instantiated and the
    connection socket (and not the listening one) is passed to it. This is
    useful for services that are seldom used, and where performance is not
    critical, i.e. where the cost of spawning a new service process for
    each incoming connection is limited. Example: SSH.

The three schemes provide different performance characteristics. After
the service finishes starting up the performance provided by the first two
schemes is identical to a stand-alone service (i.e. one that is
started without a super-server, without socket activation), since the
listening socket is passed to the actual service, and code paths from
then on are identical to those of a stand-alone service and all
connections are processes exactly the same way as they are in a
stand-alone service. On the other hand, performance of the third scheme
is usually not as good: since for each connection a new service needs
to be started the resource cost is much higher. However, it also has a
number of advantages: for example client connections are better
isolated and it is easier to develop services activated this way.

For systemd primarily the first scheme is in focus, however the
other two schemes are supported as well. (In fact, the blog story I
covered the necessary code changes for systemd-style socket activation
in
was about a service of the second type, i.e. CUPS). inetd
primarily focusses on the third scheme, however the second scheme is
supported too. (The first one isn’t. Presumably due the focus on the
third scheme inetd got its — a bit unfair — reputation for being
“slow”.)

So much about the background, let’s cut to the beef now and show an
inetd service can be integrated into systemd’s socket
activation. We’ll focus on SSH, a very common service that is widely
installed and used but on the vast majority of machines probably not
started more often than 1/h in average (and usually even much
less). SSH has supported inetd-style activation since a long time,
following the third scheme mentioned above. Since it is started only
every now and then and only with a limited number of connections at
the same time it is a very good candidate for this scheme as the extra
resource cost is negligble: if made socket-activatable SSH is
basically free as long as nobody uses it. And as soon as somebody logs
in via SSH it will be started and the moment he or she disconnects all
its resources are freed again. Let’s find out how to make SSH
socket-activatable in systemd taking advantage of the provided inetd
compatibility!

Here’s the configuration line used to hook up SSH with classic inetd:

ssh stream tcp nowait root /usr/sbin/sshd sshd -i

And the same as xinetd configuration fragment:

service ssh {
        socket_type = stream
        protocol = tcp
        wait = no
        user = root
        server = /usr/sbin/sshd
        server_args = -i
}

Most of this should be fairly easy to understand, as these two
fragments express very much the same information. The non-obvious
parts: the port number (22) is not configured in inetd configuration,
but indirectly via the service database in /etc/services: the
service name is used as lookup key in that database and translated to
a port number. This indirection via /etc/services has been
part of Unix tradition though has been getting more and more out of
fashion, and the newer xinetd hence optionally allows configuration
with explicit port numbers. The most interesting setting here is the
not very intuitively named nowait (resp. wait=no)
option. It configures whether a service is of the second
(wait) resp. third (nowait) scheme mentioned
above. Finally the -i switch is used to enabled inetd mode in
SSH.

The systemd translation of these configuration fragments are the
following two units. First: sshd.socket is a unit encapsulating
information about a socket to listen on:

[Unit]
Description=SSH Socket for Per-Connection Servers

[Socket]
ListenStream=22
Accept=yes

[Install]
WantedBy=sockets.target

Most of this should be self-explanatory. A few notes:
Accept=yes corresponds to nowait. It’s hopefully
better named, referring to the fact that for nowait the
superserver calls accept() on the listening socket, where for
wait this is the job of the executed
service process. WantedBy=sockets.target is used to ensure that when
enabled this unit is activated at boot at the right time.

And here’s the matching service file [email protected]:

[Unit]
Description=SSH Per-Connection Server

[Service]
ExecStart=-/usr/sbin/sshd -i
StandardInput=socket

This too should be mostly self-explanatory. Interesting is
StandardInput=socket, the option that enables inetd
compatibility for this service. StandardInput= may be used to
configure what STDIN of the service should be connected for this
service (see the man
page for details
). By setting it to socket we make sure
to pass the connection socket here, as expected in the simple inetd
interface. Note that we do not need to explicitly configure
StandardOutput= here, since by default the setting from
StandardInput= is inherited if nothing else is
configured. Important is the “-” in front of the binary name. This
ensures that the exit status of the per-connection sshd process is
forgotten by systemd. Normally, systemd will store the exit status of
a all service instances that die abnormally. SSH will sometimes die
abnormally with an exit code of 1 or similar, and we want to make sure
that this doesn’t cause systemd to keep around information for
numerous previous connections that died this way (until this
information is forgotten with systemctl reset-failed).

[email protected] is an instantiated service, as described in the preceeding
installment of this series
. For each incoming connection systemd
will instantiate a new instance of [email protected], with the
instance identifier named after the connection credentials.

You may wonder why in systemd configuration of an inetd service
requires two unit files instead of one. The reason for this is that to
simplify things we want to make sure that the relation between live
units and unit files is obvious, while at the same time we can order
the socket unit and the service units independently in the dependency
graph and control the units as independently as possible. (Think: this
allows you to shutdown the socket independently from the instances,
and each instance individually.)

Now, let’s see how this works in real life. If we drop these files
into /etc/systemd/system we are ready to enable the socket and
start it:

# systemctl enable sshd.socket
ln -s '/etc/systemd/system/sshd.socket' '/etc/systemd/system/sockets.target.wants/sshd.socket'
# systemctl start sshd.socket
# systemctl status sshd.socket
sshd.socket - SSH Socket for Per-Connection Servers
	  Loaded: loaded (/etc/systemd/system/sshd.socket; enabled)
	  Active: active (listening) since Mon, 26 Sep 2011 20:24:31 +0200; 14s ago
	Accepted: 0; Connected: 0
	  CGroup: name=systemd:/system/sshd.socket

This shows that the socket is listening, and so far no connections
have been made (Accepted: will show you how many connections
have been made in total since the socket was started,
Connected: how many connections are currently active.)

Now, let’s connect to this from two different hosts, and see which services are now active:

$ systemctl --full | grep ssh
[email protected]:22-172.31.0.4:47779.service  loaded active running       SSH Per-Connection Server
[email protected]:22-172.31.0.54:52985.service loaded active running       SSH Per-Connection Server
sshd.socket                                   loaded active listening     SSH Socket for Per-Connection Servers

As expected, there are now two service instances running, for the
two connections, and they are named after the source and destination
address of the TCP connection as well as the port numbers. (For
AF_UNIX sockets the instance identifier will carry the PID and UID of
the connecting client.) This allows us to invidiually introspect or
kill specific sshd instances, in case you want to terminate the
session of a specific client:

# systemctl kill [email protected]:22-172.31.0.4:47779.service

And that’s probably already most of what you need to know for
hooking up inetd services with systemd and how to use them afterwards.

In the case of SSH it is probably a good suggestion for most
distributions in order to save resources to default to this kind of
inetd-style socket activation, but provide a stand-alone unit file to
sshd as well which can be enabled optionally. I’ll soon file a
wishlist bug about this against our SSH package in Fedora.

A few final notes on how xinetd and systemd compare feature-wise,
and whether xinetd is fully obsoleted by systemd. The short answer
here is that systemd does not provide the full xinetd feature set and
that is does not fully obsolete xinetd. The longer answer is a bit
more complex: if you look at the multitude of options
xinetd provides you’ll notice that systemd does not compare. For
example, systemd does not come with built-in echo,
time, daytime or discard servers, and never
will include those. TCPMUX is not supported, and neither are RPC
services. However, you will also find that most of these are either
irrelevant on today’s Internet or became other way out-of-fashion. The
vast majority of inetd services do not directly take advantage of
these additional features. In fact, none of the xinetd services
shipped on Fedora make use of these options. That said, there are a
couple of useful features that systemd does not support, for example
IP ACL management. However, most administrators will probably agree
that firewalls are the better solution for these kinds of problems and
on top of that, systemd supports ACL management via tcpwrap for those
who indulge in retro technologies like this. On the other hand systemd
also provides numerous features xinetd does not provide,
starting with the individual control of instances shown above, or the
more expressive configurability of the execution
context for the instances
. I believe that what systemd provides is
quite comprehensive, comes with little legacy cruft but should provide
you with everything you need. And if there’s something systemd does
not cover, xinetd will always be there to fill the void as
you can easily run it in conjunction with systemd. For the
majority of uses systemd should cover what is necessary, and allows
you cut down on the required components to build your system from. In
a way, systemd brings back the functionality of classic Unix inetd and
turns it again into a center piece of a Linux system.

And that’s all for now. Thanks for reading this long piece. And
now, get going and convert your services over! Even better, do this
work in the individual packages upstream or in your distribution!

systemd for Administrators, Part X

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/instances.html

Here’s the tenth installment
of
my ongoing series
on
systemd
for
Administrators:

Instantiated Services

Most services on Linux/Unix are singleton services: there’s
usually only one instance of Syslog, Postfix, or Apache running on a
specific system at the same time. On the other hand some select
services may run in multiple instances on the same host. For example,
an Internet service like the Dovecot IMAP service could run in
multiple instances on different IP ports or different local IP
addresses. A more common example that exists on all installations is
getty, the mini service that runs once for each TTY and
presents a login prompt on it. On most systems this service is
instantiated once for each of the first six virtual consoles
tty1 to tty6. On some servers depending on
administrator configuration or boot-time parameters an additional
getty is instantiated for a serial or virtualizer console. Another
common instantiated service in the systemd world is fsck, the
file system checker that is instantiated once for each block device
that needs to be checked. Finally, in systemd socket activated
per-connection services (think classic inetd!) are also implemented
via instantiated services: a new instance is created for each incoming
connection. In this installment I hope to explain a bit how systemd
implements instantiated services and how to take advantage of them as
an administrator.

If you followed the previous episodes of this series you are
probably aware that services in systemd are named according to the
pattern foobar.service, where foobar is an
identification string for the service, and .service simply a
fixed suffix that is identical for all service units. The definition files
for these services are searched for in /etc/systemd/system
and /lib/systemd/system (and possibly other directories) under this name. For
instantiated services this pattern is extended a bit: the service name becomes
foobar@quux.service where foobar is the
common service identifier, and quux the instance
identifier. Example: [email protected] is the serial
getty service instantiated for ttyS2.

Service instances can be created dynamically as needed. Without
further configuration you may easily start a new getty on a serial
port simply by invoking a systemctl start command for the new
instance:

# systemctl start [email protected]

If a command like the above is run systemd will first look for a
unit configuration file by the exact name you requested. If this
service file is not found (and usually it isn’t if you use
instantiated services like this) then the instance id is removed from
the name and a unit configuration file by the resulting
template name searched. In other words, in the above example,
if the precise [email protected] unit file cannot
be found, [email protected] is loaded instead. This unit
template file will hence be common for all instances of this
service. For the serial getty we ship a template unit file in systemd
(/lib/systemd/system/[email protected]) that looks
something like this:

[Unit]
Description=Serial Getty on %I
BindTo=dev-%i.device
After=dev-%i.device systemd-user-sessions.service

[Service]
ExecStart=-/sbin/agetty -s %I 115200,38400,9600
Restart=always
RestartSec=0

(Note that the unit template file we actually ship along with
systemd for the serial gettys is a bit longer. If you are interested,
have a look at the actual
file
which includes additional directives for compatibility with
SysV, to clear the screen and remove previous users from the TTY
device. To keep things simple I have shortened the unit file to the
relevant lines here.)

This file looks mostly like any other unit file, with one
distinction: the specifiers %I and %i are used at
multiple locations. At unit load time %I and %i are
replaced by systemd with the instance identifier of the service. In
our example above, if a service is instantiated as
[email protected] the specifiers %I and
%i will be replaced by ttyUSB0. If you introspect
the instanciated unit with systemctl status
[email protected]
you will see these replacements
having taken place:

$ systemctl status [email protected]
[email protected] - Getty on ttyUSB0
	  Loaded: loaded (/lib/systemd/system/[email protected]; static)
	  Active: active (running) since Mon, 26 Sep 2011 04:20:44 +0200; 2s ago
	Main PID: 5443 (agetty)
	  CGroup: name=systemd:/system/[email protected]/ttyUSB0
		  └ 5443 /sbin/agetty -s ttyUSB0 115200,38400,9600

And that is already the core idea of instantiated services in
systemd. As you can see systemd provides a very simple templating
system, which can be used to dynamically instantiate services as
needed. To make effective use of this, a few more notes:

You may instantiate these services on-the-fly in
.wants/ symbolic links in the file system. For example, to
make sure the serial getty on ttyUSB0 is started
automatically at every boot, create a symlink like this:

# ln -s /lib/systemd/system/[email protected] /etc/systemd/system/getty.target.wants/serial-getty@ttyUSB0.service

systemd will instantiate the symlinked unit file with the
instance name specified in the symlink name.

You cannot instantiate a unit template without specifying an
instance identifier. In other words systemctl start
[email protected]
will necessarily fail since the instance
name was left unspecified.

Sometimes it is useful to opt-out of the generic template
for one specific instance. For these cases make use of the fact that
systemd always searches first for the full instance file name before
falling back to the template file name: make sure to place a unit file
under the fully instantiated name in /etc/systemd/system and
it will override the generic templated version for this specific
instance.

The unit file shown above uses %i at some places and
%I at others. You may wonder what the difference between
these specifiers are. %i is replaced by the exact characters
of the instance identifier. For %I on the other hand the
instance identifier is first passed through a simple unescaping
algorithm. In the case of a simple instance identifier like
ttyUSB0 there is no effective difference. However, if the
device name includes one or more slashes (“/“) this cannot be
part of a unit name (or Unix file name). Before such a device name can
be used as instance identifier it needs to be escaped so that “/”
becomes “-” and most other special characters (including “-“) are
replaced by “\xAB” where AB is the ASCII code of the character in
hexadecimal notation[1]. Example: to refer to a USB serial port by its
bus path we want to use a port name like
serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0. The
escaped version of this name is
serial-by\x2dpath-pci\x2d0000:00:1d.0\x2dusb\x2d0:1.4:1.1\x2dport0. %I
will then refer to former, %i to the latter. Effectively this
means %i is useful wherever it is necessary to refer to other
units, for example to express additional dependencies. On the other
hand %I is useful for usage in command lines, or inclusion in
pretty description strings. Let’s check how this looks with the above unit file:

# systemctl start 'serial-getty@serial-by\x2dpath-pci\x2d0000:00:1d.0\x2dusb\x2d0:1.4:1.1\x2dport0.service'
# systemctl status 'serial-getty@serial-by\x2dpath-pci\x2d0000:00:1d.0\x2dusb\x2d0:1.4:1.1\x2dport0.service'
serial-getty@serial-by\x2dpath-pci\x2d0000:00:1d.0\x2dusb\x2d0:1.4:1.1\x2dport0.service - Serial Getty on serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0
	  Loaded: loaded (/lib/systemd/system/[email protected]; static)
	  Active: active (running) since Mon, 26 Sep 2011 05:08:52 +0200; 1s ago
	Main PID: 5788 (agetty)
	  CGroup: name=systemd:/system/[email protected]/serial-by\x2dpath-pci\x2d0000:00:1d.0\x2dusb\x2d0:1.4:1.1\x2dport0
		  └ 5788 /sbin/agetty -s serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0 115200 38400 9600

As we can see the while the instance identifier is the escaped
string the command line and the description string actually use the
unescaped version, as expected.

(Side note: there are more specifiers available than just
%i and %I, and many of them are actually
available in all unit files, not just templates for service
instances. For more details see the man
page
which includes a full list and terse explanations.)

And at this point this shall be all for now. Stay tuned for a
follow-up article on how instantiated services are used for
inetd-style socket activation.

Footnotes

[1] Yupp, this escaping algorithm doesn’t really result in
particularly pretty escaped strings, but then again, most escaping
algorithms don’t help readability. The algorithm we used here is
inspired by what udev does in a similar case, with one change. In the
end, we had to pick something. If you’ll plan to comment on the
escaping algorithm please also mention where you live so that I can
come around and paint your bike shed yellow with blue stripes. Thanks!

systemd US Tour Dates

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/us-tour-dates.html

Kay Sievers, Harald Hoyer and I will tour the US in the next weeks. If you
have any questions on systemd, udev or dracut (or any of the related
technologies), then please do get in touch with us on the following occasions:

Linux Plumbers Conference, Santa Rosa, CA, Sep 7-9th
Google, Googleplex, Mountain View, CA, Sep 12th
Red Hat, Westford, MA, Sep 13-14th

As usual LPC is going to rock, so make sure to be there!

How to Write syslog Daemons Which Cooperate Nicely With systemd

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/syslog.html

I just finished putting together a text on the systemd wiki explaining what
to do to write a syslog service that is nicely integrated with systemd, and
does all the right things. It’s supposed to be a checklist for all syslog
hackers:

Read it now.

rsyslog already implements everything on this list afaics, and that’s
pretty cool. If other implementations want to catch up, please consider
following these recommendations, too.

I put this together since I have changed systemd 35 to set
StandardOutput=syslog as default, so that all stdout/stderr of all
services automatically ends up in syslog. And since that change requires some
(minimal) changes to all syslog implementations I decided to document this all
properly (if you are curious: they need to set StandardOutput=null to
opt out of this default in order to avoid logging loops).

Anyway, please have a peek and comment if you spot a mistake or
something I forgot. Or if you have questions, just ask.

How to Behave Nicely in the cgroup Trees

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/pax-cgroups.html

The Linux cgroup hierarchies of the various kernel controllers are a shared
resource. Recently many components of Linux userspace started making use of these
hierarchies. In order to avoid that the various programs step on each others
toes while manipulating this shared resource we have put together a list of
recommendations. Programs following these guidelines should work together
nicely without interfering with other users of the hierarchies.

These
guidelines are available in the systemd wiki.
I’d be very interested in
feedback, and would like to ask you to ping me in case we forgot something or left something too vague.

And please, if you are writing software that interfaces with the cgroup
tree consider following these recommendations. Thank you.