Tag Archives: Projects

systemd Status Update

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/systemd-update-3.html

It
has been way too long since my last status update on
systemd
. Here’s another short, incomprehensive status update on
what we worked on for systemd since
then.

We have been working hard to turn systemd into the most viable set
of components to build operating systems, appliances and devices from,
and make it the best choice for servers, for desktops and for embedded
environments alike. I think we have a really convincing set of
features now, but we are actively working on making it even
better.

Here’s a list of some more and some less interesting features, in
no particular order:

  1. We added an automatic pager to systemctl (and related tools), similar
    to how git has it.
  2. systemctl learnt a new switch --failed, to show only
    failed services.
  3. You may now start services immediately, overrding all dependency
    logic by passing --ignore-dependencies to
    systemctl. This is mostly a debugging tool and nothing people
    should use in real life.
  4. Sending SIGKILL as final part of the implicit shutdown
    logic of services is now optional and may be configured with the
    SendSIGKILL= option individually for each service.
  5. We split off the Vala/Gtk tools into its own project systemd-ui.
  6. systemd-tmpfiles learnt file globbing and creating FIFO
    special files as well as character and block device nodes, and
    symlinks. It also is capable of relabelling certain directories at
    boot now (in the SELinux sense).
  7. Immediately before shuttding dow we will now invoke all binaries
    found in /lib/systemd/system-shutdown/, which is useful for
    debugging late shutdown.
  8. You may now globally control where STDOUT/STDERR of services goes
    (unless individual service configuration overrides it).
  9. There’s a new ConditionVirtualization= option, that makes
    systemd skip a specific service if a certain virtualization technology
    is found or not found. Similar, we now have a new option to detect
    whether a certain security technology (such as SELinux) is available,
    called ConditionSecurity=. There’s also
    ConditionCapability= to check whether a certain process
    capability is in the capability bounding set of the system. There’s
    also a new ConditionFileIsExecutable=,
    ConditionPathIsMountPoint=,
    ConditionPathIsReadWrite=,
    ConditionPathIsSymbolicLink=.
  10. The file system condition directives now support globbing.
  11. Service conditions may now be “triggering” and “mandatory”, meaning that
    they can be a necessary requirement to hold for a service to start, or
    simply one trigger among many.
  12. At boot time we now print warnings if: /usr
    is on a split-off partition but not already mounted by an initrd
    ;
    if /etc/mtab is not a symlink to /proc/mounts; CONFIG_CGROUPS
    is not enabled in the kernel
    . We’ll also expose this as
    tainted flag on the bus.
  13. You may now boot the same OS image on a bare metal machine and in
    Linux namespace containers and will get a clean boot in both
    cases. This is more complicated than it sounds since device management
    with udev or write access to /sys, /proc/sys or
    things like /dev/kmsg is not available in a container. This
    makes systemd a first-class choice for managing thin container
    setups. This is all tested with systemd’s own systemd-nspawn
    tool but should work fine in LXC setups, too. Basically this means
    that you do not have to adjust your OS manually to make it work in a
    container environment, but will just work out of the box. It also
    makes it easier to convert real systems into containers.
  14. We now automatically spawn gettys on HVC ttys when booting in VMs.
  15. We introduced /etc/machine-id as a generalization of
    D-Bus machine ID logic. See this
    blog story for more information
    . On stateless/read-only systems
    the machine ID is initialized randomly at boot. In virtualized
    environments it may be passed in from the machine manager (with qemu’s
    -uuid switch, or via the container
    interface
    ).
  16. All of the systemd-specific /etc/fstab mount options are
    now in the x-systemd-xyz format.
  17. To make it easy to find non-converted services we will now
    implicitly prefix all LSB and SysV init script descriptions with the
    strings “LSB:” resp. “SYSV:“.
  18. We introduced /run and made it a hard dependency of
    systemd. This directory is now widely accepted and implemented on all
    relevant Linux distributions.
  19. systemctl can now execute all its operations remotely too (-H switch).
  20. We now ship systemd-nspawn,
    a really powerful tool that can be used to start containers for
    debugging, building and testing, much like chroot(1). It is useful to
    just get a shell inside a build tree, but is good enough to boot up a
    full system in it, too.
  21. If we query the user for a hard disk password at boot he may hit
    TAB to hide the asterisks we normally show for each key that is
    entered, for extra paranoia.
  22. We don’t enable udev-settle.service anymore, which is
    only required for certain legacy software that still hasn’t been
    updated to follow devices coming and going cleanly.
  23. We now include a tool that can plot boot speed graphs, similar to
    bootchartd, called systemd-analyze.
  24. At boot, we now initialize the kernel’s binfmt_misc logic with the data from /etc/binfmt.d.
  25. systemctl now recognizes if it is run in a chroot()
    environment and will work accordingly (i.e. apply changes to the tree
    it is run in, instead of talking to the actual PID 1 for this). It also has a new --root= switch to work on an OS tree from outside of it.
  26. There’s a new unit dependency type OnFailureIsolate= that
    allows entering a different target whenever a certain unit fails. For
    example, this is interesting to enter emergency mode if file system
    checks of crucial file systems failed.
  27. Socket units may now listen on Netlink sockets, special files
    from /proc and POSIX message queues, too.
  28. There’s a new IgnoreOnIsolate= flag which may be used to
    ensure certain units are left untouched by isolation requests. There’s
    a new IgnoreOnSnapshot= flag which may be used to exclude
    certain units from snapshot units when they are created.
  29. There’s now small mechanism services for
    changing the local hostname and other host meta data
    , changing
    the system locale and console settings
    and the system
    clock
    .
  30. We now limit the capability bounding set for a number of our
    internal services by default.
  31. Plymouth may now be disabled globally with
    plymouth.enable=0 on the kernel command line.
  32. We now disallocate VTs when a getty finished running (and
    optionally other tools run on VTs). This adds extra security since it
    clears up the scrollback buffer so that subsequent users cannot get
    access to a user’s session output.
  33. In socket units there are now options to control the
    IP_TRANSPARENT, SO_BROADCAST, SO_PASSCRED,
    SO_PASSSEC socket options.
  34. The receive and send buffers of socket units may now be set larger
    than the default system settings if needed by using
    SO_{RCV,SND}BUFFORCE.
  35. We now set the hardware timezone as one of the first things in PID
    1, in order to avoid time jumps during normal userspace operation, and
    to guarantee sensible times on all generated logs. We also no longer
    save the system clock to the RTC on shutdown, assuming that this is
    done by the clock control tool when the user modifies the time, or
    automatically by the kernel if NTP is enabled.
  36. The SELinux directory got moved from /selinux to
    /sys/fs/selinux.
  37. We added a small service systemd-logind that keeps tracks
    of logged in users and their sessions. It creates control groups for
    them, implements the XDG_RUNTIME_DIR
    specification
    for them, maintains seats and device node ACLs and
    implements shutdown/idle inhibiting for clients. It auto-spawns gettys
    on all local VTs when the user switches to them (instead of starting
    six of them unconditionally), thus reducing the resource foot print by
    default. It has a D-Bus interface as well as a
    simple synchronous library interface
    . This mechanism obsoletes
    ConsoleKit which is now deprecated and should no longer be used.
  38. There’s now full, automatic multi-seat support, and this is
    enabled in GNOME 3.4. Just by pluging in new seat hardware you get a
    new login screen on your seat’s screen.
  39. There is now an option ControlGroupModify= to allow
    services to change the properties of their control groups dynamically,
    and one to make control groups persistent in the tree
    (ControlGroupPersistent=) so that they can be created and
    maintained by external tools.
  40. We now jump back into the initrd in shutdown, so that it can
    detach the root file system and the storage devices backing it. This
    allows (for the first time!) to reliably undo complex storage setups
    on shutdown and leave them in a clean state.
  41. systemctl now supports presets, a way for distributions and
    administrators to define their own policies on whether services should
    be enabled or disabled by default on package installation.
  42. systemctl now has high-level verbs for masking/unmasking
    units. There’s also a new command (systemctl list-unit-files)
    for determining the list of all installed unit file files and whether
    they are enabled or not.
  43. We now apply sysctl variables to each new network device, as it
    appears. This makes /etc/sysctl.d compatible with hot-plug
    network devices.
  44. There’s limited profiling for SELinux start-up perfomance built
    into PID 1.
  45. There’s a new switch PrivateNetwork=
    to turn of any network access for a specific service.
  46. Service units may now include configuration for control group
    parameters. A few (such as MemoryLimit=) are exposed with
    high-level options, and all others are available via the generic
    ControlGroupAttribute= setting.
  47. There’s now the option to mount certain cgroup controllers
    jointly at boot. We do this now for cpu and
    cpuacct by default.
  48. We added the
    journal
    and turned it on by default.
  49. All service output is now written to the Journal by default,
    regardless whether it is sent via syslog or simply written to
    stdout/stderr. Both message streams end up in the same location and
    are interleaved the way they should. All log messages even from the
    kernel and from early boot end up in the journal. Now, no service
    output gets unnoticed and is saved and indexed at the same
    location.
  50. systemctl status will now show the last 10 log lines for
    each service, directly from the journal.
  51. We now show the progress of fsck at boot on the console,
    again. We also show the much loved colorful [ OK ] status
    messages at boot again, as known from most SysV implementations.
  52. We merged udev into systemd.
  53. We implemented and documented interfaces to container
    managers
    and initrds
    for passing execution data to systemd. We also implemented and
    documented an
    interface for storage daemons that are required to back the root file
    system
    .
  54. There are two new options in service files to propagate reload requests between several units.
  55. systemd-cgls won’t show kernel threads by default anymore, or show empty control groups.
  56. We added a new tool systemd-cgtop that shows resource
    usage of whole services in a top(1) like fasion.
  57. systemd may now supervise services in watchdog style. If enabled
    for a service the daemon daemon has to ping PID 1 in regular intervals
    or is otherwise considered failed (which might then result in
    restarting it, or even rebooting the machine, as configured). Also,
    PID 1 is capable of pinging a hardware watchdog. Putting this
    together, the hardware watchdogs PID 1 and PID 1 then watchdogs
    specific services. This is highly useful for high-availability servers
    as well as embedded machines. Since watchdog hardware is noawadays
    built into all modern chipsets (including desktop chipsets), this
    should hopefully help to make this a more widely used
    functionality.
  58. We added support for a new kernel command line option
    systemd.setenv= to set an environment variable
    system-wide.
  59. By default services which are started by systemd will have SIGPIPE
    set to ignored. The Unix SIGPIPE logic is used to reliably implement
    shell pipelines and when left enabled in services is usually just a
    source of bugs and problems.
  60. You may now configure the rate limiting that is applied to
    restarts of specific services. Previously the rate limiting parameters
    were hard-coded (similar to SysV).
  61. There’s now support for loading the IMA integrity policy into the
    kernel early in PID 1, similar to how we already did it with the
    SELinux policy.
  62. There’s now an official API to schedule and query scheduled shutdowns.
  63. We changed the license from GPL2+ to LGPL2.1+.
  64. We made systemd-detect-virt
    an official tool in the tool set. Since we already had code to detect
    certain VM and container environments we now added an official tool
    for administrators to make use of in shell scripts and suchlike.
  65. We documented numerous
    interfaces
    systemd introduced.

Much of the stuff above is already available in Fedora 15 and 16,
or will be made available in the upcoming Fedora 17.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

I’d like to thank everybody who contributed to systemd over the past years.

Thanks for your interest!

Control Groups vs. Control Groups

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/cgroups-vs-cgroups.html

TL;DR: systemd does not
require the performance-sensitive bits of Linux control groups enabled in the kernel.
However, it does require some non-performance-sensitive bits of the control
group logic.

In some areas of the community there’s still some confusion about Linux
control groups and their performance impact, and what precisely it is that
systemd requires of them. In the hope to clear this up a bit, I’d like to point
out a few things:

Control Groups are two things: (A) a way to hierarchally group and
label processes
, and (B) a way to then apply resource limits
to these groups. systemd only requires the former (A), and not the latter (B).
That means you can compile your kernel without any control group resource
controllers (B) and systemd will work perfectly on it. However, if you in
addition disable the grouping feature entirely (A) then systemd will loudly
complain at boot and proceed only reluctantly with a big warning and in a
limited functionality mode.

At compile time, the grouping/labelling feature in the kernel is enabled by
CONFIG_CGROUPS=y, the individual controllers by CONFIG_CGROUP_FREEZER=y,
CONFIG_CGROUP_DEVICE=y, CONFIG_CGROUP_CPUACCT=y, CONFIG_CGROUP_MEM_RES_CTLR=y,
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, CONFIG_CGROUP_MEM_RES_CTLR_KMEM=y,
CONFIG_CGROUP_PERF=y, CONFIG_CGROUP_SCHED=y, CONFIG_BLK_CGROUP=y,
CONFIG_NET_CLS_CGROUP=y, CONFIG_NETPRIO_CGROUP=y. And since (as mentioned) we
only need the former (A), not the latter (B) you may disable all of the latter
options while enabling CONFIG_CGROUPS=y, if you want to run systemd on your
system.

What about the performance impact of these options? Well, every bit of code
comes at some price, so none of these options come entirely for free. However,
the grouping feature (A) alters the general logic very little, it just sticks
hierarchial labels on processes, and its impact is minimal since that is
usually not in any hot path of the OS. This is different for the various
controllers (B) which have a much bigger impact since they influence the resource
management of the OS and are full of hot paths. This means that the kernel
feature that systemd mandatorily requires (A) has a minimal effect on system
performance, but the actually performance-sensitive features of control groups
(B) are entirely optional.

On boot, systemd will mount all controller hierarchies it finds enabled
in the kernel to individual directories below /sys/fs/cgroup/. This is
the official place where kernel controllers are mounted to these days. The
/sys/fs/cgroup/ mount point in the kernel was created precisely for
this purpose. Since the control group controllers are a shared facility that
might be used by a number of different subsystems a few
projects have agreed on a set of rules in order to avoid that the various bits
of code step on each other’s toes when using these directories
.

systemd will also maintain its own, private, controller-less, named control
group hierarchy which is mounted to /sys/fs/cgroup/systemd/. This
hierarchy is private property of systemd, and other software should not try to
interfere with it. This hierarchy is how systemd makes use of the naming and
grouping feature of control groups (A) without actually requiring any kernel
controller enabled for that.

Now, you might notice that by default systemd does create per-service
cgroups in the “cpu” controller if it finds it enabled in the kernel. This is
entirely optional, however. We chose to make use of it by default to even out
CPU usage between system services. Example: On a traditional web server machine
Apache might end up having 100 CGI worker processes around, while MySQL only
has 5 processes running. Without the use of the “cpu” controller this means
that Apache all together ends up having 20x more CPU available than MySQL since
the kernel tries to provide every process with the same amount of CPU time. On
the other hand, if we add these two services to the “cpu” controller in
individual groups by default, Apache and MySQL get the same amount of CPU,
which we think is a good default.

Note that if the CPU controller is not enabled in the kernel systemd will not
attempt to make use of the “cpu” hierarchy as described above. Also, even if it is enabled in the kernel it
is trivial to tell systemd not to make use of it: Simply edit
/etc/systemd/system.conf and set DefaultControllers= to the
empty string.

Let’s discuss a few frequently heard complaints regarding systemd’s use of control groups:

  • systemd mounts all controllers to /sys/fs/cgroup/ even though
    my software requires it at /dev/cgroup/ (or some other place)!
    The
    standardization of /sys/fs/cgroup/ as mount point of the hierarchies
    is a relatively recent change in the kernel. Some software has not been updated
    yet for it. If you cannot change the software in question you are welcome to
    unmount the hierarchies from /sys/fs/cgroup/ and mount them wherever
    you need them instead. However, make sure to leave
    /sys/fs/cgroup/systemd/ untouched.
  • systemd makes use of the “cpu” hierarchy, but it should leave its dirty
    fingers from it!
    As mentioned above, just set the
    DefaultControllers= option of systemd to the empty string.
  • I need my two controllers “foo” and “bar” mounted into one hierarchy,
    but systemd mounts them in two!
    Use the JoinControllers= setting
    in /etc/systemd/system.conf to mount several controllers into a single
    hierarchy.
  • Control groups are evil and they make everything slower! Well,
    please read the text above and understand the difference between
    “control-groups-as-in-naming-and-grouping” (A) and “cgroups-as-in-controllers”
    (B). Then, please turn off all controllers in you kernel build (B) but leave
    CONFIG_CGROUPS=y (A) enabled.
  • I have heard some kernel developers really hate control groups
    and think systemd is evil because it requires them!
    Well, there are a
    couple of things behind the dislike of control groups by some folks.
    Primarily, this is probably caused because the hackers in question do not
    distuingish the naming-and-grouping bits of the control group logic (A) and the
    controllers that are based on it (B). Mainly, their beef is with the latter
    (which systemd does not require, which is the key point I am trying to make in
    the text above), but there are other issues as well: for example, the code of
    the grouping logic is not the most beautiful bit of code ever written by man
    (which is thankfully likely to get better now, since the control groups
    subsystem now has an active maintainer again). And then for some
    developers it is important that they can compare the runtime behaviour of many
    historic kernel versions in order to find bugs (git bisect). Since systemd
    requires kernels with basic control group support enabled, and this is a
    relatively recent feature addition to the kernel, this makes it difficult for
    them to use a newer distribution with all these old kernels
    that predate cgroups. Anyway, the summary is probably that what matters to
    developers is different from what matters to users and
    administrators.

I hope this explanation was useful for a reader or two! Thank you for your time!

/tmp or not /tmp?

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/tmp.html

A number of Linux distributions have recently switched (or started
switching) to /tmp on tmpfs by default (ArchLinux, Debian among
others). Other distributions have plans/are discussing doing the same (Ubuntu, OpenSUSE).
Since we believe this is a good idea and it’s good to keep the delta between
the distributions minimal we are proposing
the same for Fedora 18, too
. On Solaris a similar change has already been
implemented in 1994 (and other Unixes have made a similar change long ago,
too). Yet, not all of our software is written in a way that it works nicely
together with /tmp on tmpfs.

Another Fedora
feature (for Fedora 17)
changed the semantics of /tmp for many
system services to make them more secure, by isolating the /tmp namespaces of the
various services. Handling of temporary files in /tmp has been
security sensitive since it has been introduced since it traditionally has been
a world writable, shared namespace and unless all user code safely uses randomized file names
it is vulnerable to DoS attacks and worse.

In this blog story I’d like to shed some light on proper usage of
/tmp and what your Linux application should use for what purpose. We’ll not
discuss why /tmp on tmpfs is a good idea, for that refer to the Fedora feature
page
. Here we’ll just discuss what /tmp should be used for and for
what it shouldn’t be, as well as what should be used instead. All that in order
to make sure your application remains compatible with these new features
introduced to many newer Linux distributions.

/tmp is (as the name suggests) an area where temporary files
applications require during operation may be placed. Of course, temporary files
differ very much in their properties:

  • They can be large, or very small
  • They might be used for sharing between users, or be private to users
  • They might need to be persistent across boots, or very volatile
  • They might need to be machine-local or shared on the network

Traditionally, /tmp has not only been the place where actual
temporary files are stored, but some software used to place (and often still
continues to place) communication primitives such as sockets, FIFOs, shared
memory there as well. Notably X11, but many others too. Usage of world-writable
shared namespaces for communication purposes has always been problematic, since
to establish communication you need stable names, but stable names open the
doors for DoS attacks. This can be corrected partially, by establishing
protected per-app directories for certain services during early boot (like we
do for X11), but this only fixes the problem partially, since this only works
correctly if every package installation is followed by a reboot.

Besides /tmp there are various other places where temporary files
(or other files that traditionally have been stored in /tmp) can be
stored. Here’s a quick overview of the candidates:

  • /tmp, POSIX suggests this is flushed as boot, FHS says that files
    do not need to be persistent between two runs of the application. Old files are
    often cleaned up automatically after a time (“aging”). Usually it is
    recommended to use $TMPDIR if it is set before falling back to /tmp
    directly. As mentioned, this is a tmpfs on many Linuxes/Unixes (and most likely
    will be for most soon), and hence should be used only for small files. It’s
    generally a shared namespace, hence the only APIs for using it should be mkstemp(), mkdtemp() (and friends)
    to be entirely safe.[1] Recently, improvements have been made to
    turn this shared namespace into a private namespace (see above), but that doesn’t
    relieve developers from writing secure code that is also safe if /tmp is a shared
    namespace. Because /tmp is no longer necessarily a shared namespace it
    is generally unsuitable as a location for communication primitives. It is
    machine-private and local. It’s usually fully featured (locking, …). This
    directory is world writable and thus available for both privileged and
    unprivileged code.
  • /var/tmp, according to FHS “more persistent” than /tmp,
    and is less often cleaned up (it’s persistent across reboots, for example). It’s not on a tmpfs, but on a real disk, and
    hence can be used to store much larger files. The same namespace problems apply
    as with /tmp, hence also exclusively use
    mkstemp()/mkdtemp() for this directory. It is also
    automatically cleaned up by time. It is machine-private. It’s not necessarily
    fully featured (no locking, …). This directory is world writable and thus
    available for both privileged and unprivileged code. We suggest to also check
    $TMPDIR before falling back to /var/tmp. That way if
    $TMPDIR is set this overrides usage of both /tmp and
    /var/tmp.
  • /run (traditionally /var/run) where privileged daemons
    can store runtime data, such as communication primitives. This is where your
    daemon should place its sockets. It’s guaranteed to be a shared namespace, but
    is only writable by privileged code and hence very safe to use. This file
    system is guaranteed to be a tmpfs and is hence automatically flushed at boots.
    No automatic clean-up is done beyond that. It is machine-private and local. It
    is fully-featured, and provides all functionality the local OS can provide
    (locking, sockets, …).
  • $XDG_RUNTIME_DIR
    where unprivileged user software can store runtime data, such as communication
    primitives. This is similar to /run but for user applications. It’s a
    user private namespace, and hence very safe to use. It’s cleaned up
    automatically at logout and also is cleaned up by time via “aging”. It is
    machine-private and fully featured. In GLib applications use
    g_get_user_runtime_dir() to query the path of this directory.
  • $XDG_CACHE_HOME
    where unprivileged user software can store non-essential data. It’s a private
    namespace of the user. It might be shared between machines. It is not
    automatically cleaned up, and not fully featured (no locking, and so on, due to
    NFS). In GLib applications use g_get_user_cache_dir() to query this
    directory.
  • $XDG_DOWNLOAD_DIR
    where unprivileged user software can store downloads and downloads in progress.
    It should only be used for downloads, and is a private namespace fo the user,
    but might be shared between machines. It is not automatically cleaned up and
    not fully featured. In GLib applications use g_get_user_special_dir()
    to query the path of this directory.

Now that we have introduced the contestants, here’s a rough guide how we
suggest you (a Linux application developer) pick the right directory to use:

  1. You need a place to put your socket (or other communication primitive) and your code runs privileged: use a subdirectory beneath /run. (Or beneath /var/run for extra compatibility.)
  2. You need a place to put your socket (or other communication primitive) and your code runs unprivileged: use a subdirectory beneath $XDG_RUNTIME_DIR.
  3. You need a place to put your larger downloads and downloads in progress and run unprivileged: use $XDG_DOWNLOAD_DIR.
  4. You need a place to put cache files which should be persistent and run unprivileged: use $XDG_CACHE_HOME.
  5. Nothing of the above applies and you need to place a small file that needs no persistency: use $TMPDIR with a fallback on /tmp. And use mkstemp(), and mkdtemp() and nothing homegrown.
  6. Otherwise use $TMPDIR with a fallback on /var/tmp. Also use mkstemp()/mkdtemp().

Note that these rules above are only suggested by us. These rules
take into account everything we know about this topic and avoid problems with
current and future distributions, as far as we can see them. Please consider
updating your projects to follow these rules, and keep them in mind if you
write new code.

One thing we’d like to stress is that /tmp and /var/tmp
more often than not are actually not the right choice for your usecase. There
are valid uses of these directories, but quite often another directory might
actually be the better place. So, be careful, consider the other options, but
if you do go for /tmp or /var/tmp then at least make sure to
use mkstemp()/mkdtemp().

Thank you for your interest!

Oh, and if you now complain that we don’t understand Unix, and that we are
morons and worse, then please read this again, and you might notice that this
is just a best practice guide, not a specification we have written. Nothing that
introduces anything new, just something that explains how things are.

If you want to complain about the tmp-on-tmpfs or
ServicesPrivateTmp feature, then this is not the right place either,
because this blog post is not really about that. Please direct this to
fedora-devel instead. Thank you very much.

Footnotes

[1] Well, or to turn this around: unless you have a PhD in advanced
Unixology and are not using mkstemp()/mkdtemp() but use
/tmp nonetheless it’s very likely you are writing vulnerable
code.

/etc/os-release

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/os-release.html

One of
the new configuration files systemd introduced is /etc/os-release
.
It replaces the multitude of per-distribution release files[1] with
a single one. Yesterday we decided
to drop
support for systems lacking /etc/os-release
in systemd since recently the majority of the big distributions adopted
/etc/os-release and many small ones did, too[2]. It’s our
hope that by dropping support for non-compliant distributions we gently put
some pressure on the remaining hold-outs to adopt this scheme as well.

I’d like to take the opportunity to explain a bit what the new file offers,
why application developers should care, and why the distributions should adopt
it. Of course, this file is pretty much a triviality in many ways,
but I guess it’s still one that deserves explanation.

So, you ask why this all?

  • It relieves application developers who just want to know the
    distribution they are running on to check for a multitude of individual release files.
  • It provides both a “pretty” name (i.e. one to show to the user), and
    machine parsable version/OS identifiers (i.e. for use in build systems).
  • It is extensible, can easily learn new fields if needed. For example, since
    we want to print a welcome message in the color of your distribution at boot
    we make it possible to configure the ANSI color for that in the file.

FAQs

There’s already the lsb_release tool for this, why don’t you
just use that?
Well, it’s a very strange interface: a shell script you have
to invoke (and hence spawn asynchronously from your C code), and it’s not
written to be extensible. It’s an optional package in many distributions, and
nothing we’d be happy to invoke as part of early boot in order to show a
welcome message. (In times with sub-second userspace boot times we really don’t
want to invoke a huge shell script for a triviality like showing the welcome
message). The lsb_release tool to us appears to be an attempt of
abstracting distribution checks, where standardization of distribution checks
is needed. It’s simply a badly designed interface. In our opinion, it
has its use as an interface to determine the LSB version itself, but not for
checking the distribution or version.

Why haven’t you adopted one of the generic release files, such as
Fedora’s /etc/system-release?
Well, they are much nicer than
lsb_release, so much is true. However, they are not extensible and
are not really parsable, if the distribution needs to be identified
programmatically or a specific version needs to be verified.

Why didn’t you call this file /etc/bikeshed instead? The name
/etc/os-release sucks!
In a way, I think you kind of answered your
own question there already.

Does this mean my distribution can now drop our equivalent of
/etc/fedora-release?
Unlikely, too much code exists that still
checks for the individual release files, and you probably shouldn’t break that.
This new file makes things easy for applications, not for distributions:
applications can now rely on a single file only, and use it in a nice way.
Distributions will have to continue to ship the old files unless they are
willing to break compatibility here.

This is so useless! My application needs to be compatible with distros
from 1998, so how could I ever make use of the new file? I will have to
continue using the old ones!
True, if you need compatibility with really
old distributions you do. But for new code this might not be an issue, and in
general new APIs are new APIs. So if you decide to depend on it, you add a
dependency on it. However, even if you need to stay compatible it might make
sense to check /etc/os-release first and just fall back to the old
files if it doesn’t exist. The least it does for you is that you don’t need 25+
open() attempts on modern distributions, but just one.

You evil people are forcing my beloved distro $XYZ to adopt your awful
systemd schemes. I hate you!
You hate too much, my friend. Also, I am
pretty sure it’s not difficult to see the benefit of this new file
independently of systemd, and it’s truly useful on systems without systemd,
too.

I hate what you people do, can I just ignore this? Well, you really
need to work on your constant feelings of hate, my friend. But, to a certain
degree yes, you can ignore this for a while longer. But already, there are a
number of applications making use of this file. You lose compatibility with
those. Also, you are kinda working towards the further balkanization of the
Linux landscape, but maybe that’s your intention?

You guys add a new file because you think there are already too many? You
guys are so confused!
None of the existing files is generic and extensible
enough to do what we want it to do. Hence we had to introduce a new one. We
acknowledge the irony, however.

The file is extensible? Awesome! I want a new field XYZ= in it! Sure,
it’s extensible, and we are happy if distributions extend it. Please prefix
your keys with your distribution’s name however. Or even better: talk to us and
we might be able update the documentation and make your field standard, if you
convince us that it makes sense.

Anyway, to summarize all this: if you work on an application that needs to
identify the OS it is being built on or is being run on, please consider making
use of this new file, we created it for you. If you work on a distribution, and
your distribution doesn’t support this file yet, please consider adopting this
file, too.

If you are working on a small/embedded distribution, or a legacy-free
distribution we encourage you to adopt only this file and not establish any
other per-distro release file.

Read the documentation for /etc/os-release.

Footnotes

[1] Yes, multitude, there’s at least: /etc/redhat-release,
/etc/SuSE-release, /etc/debian_version,
/etc/arch-release, /etc/gentoo-release,
/etc/slackware-version, /etc/frugalware-release,
/etc/altlinux-release, /etc/mandriva-release,
/etc/meego-release, /etc/angstrom-version,
/etc/mageia-release. And some distributions even have multiple, for
example Fedora has already four different files.

[2] To our knowledge at least OpenSUSE, Fedora, ArchLinux, Angstrom,
Frugalware have adopted this. (This list is not comprehensive, there are
probably more.)

The Case for the /usr Merge

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/the-usr-merge.html

One of the features of Fedora 17 is the /usr merge, put
forward by Harald Hoyer and Kay Sievers[1]. In the time since this
feature has been proposed repetitive discussions took place all over the various
Free Software communities, and usually the same questions were asked: what the reasons
behind this feature were, and whether it makes sense to adopt the same scheme for
distribution XYZ, too.

Especially in the Non-Fedora world it appears to be socially unacceptable to
actually have a look at the Fedora feature page
(where many of the questions are already brought up and answered) which is very unfortunate. To
improve the situation I spent some time today to summarize the reasons for the
/usr merge independently. I’d hence like to direct you to this new page I put
up which tries to summarize the reasons for this, with an emphasis on the
compatibility point of view:

The Case for the /usr Merge

Note that even though this page is in the systemd wiki, what it covers is
mostly orthogonal to systemd. systemd supports both systems with a merged /usr
and with a split /usr, and the /usr merge should be interesting for non-systemd
distributions as well.

Primarily I put this together to have a nice place to point all those folks
who continue to write me annoyed emails, even though I am actually not even
working on all of this…

Enjoy the read!

Footnotes:

[1] And not actually by me, I am just a supportive spectator and am
not doing any work on it. Unfortunately some tech press folks created the false
impression I was behind this. But credit where credit is due, this is all
Harald’s and Kay’s work.

Plumbers Wishlist, The Third Edition, a.k.a. "The Thank You Edition"

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/plumbers-wishlist-3.html

Last October we published a
wishlist for plumbing related features
we’d like to see added to the Linux
kernel. Three months later it’s time to publish a short update, and explain
what has been implemented in the kernel, what people have started working on,
and what’s still missing.

The full, updated list is available
on Google Docs
.

In general, I must say that the list turned out to be a great success. It
shows how awesome the Open Source community is: Just ask nicely and there’s a
good chance they’ll fulfill your wishes! Thank you very much, Linux
community!

We’d like to thank everybody who worked on any of the features on that list:
Lucas De Marchi, Andi Kleen, Dan Ballard, Li Zefan, Kirill A. Shutemov,
Davidlohr Bueso, Cong Wang, Lennart Poettering, Kay Sievers.

Of the items on the list 5 have been fully implemented and are already part
of a released kernel, or already merged for inclusion for the next kernels
being released.

For 4 further items patches have been posted, and I am hoping they’ll get
merged eventually. Davidlohr, Wang, Zefan, Kirill, it would be great if you’d
continue working on your patches, as we think they are following the right
approach[1] even if there was some opposition to them on LKML. So,
please keep pushing to solve the outstanding issues and thanks for your work so far!

Footnotes

[1] Yes, I still believe that tmpfs quota should be implemented via
resource limits, as everything else wouldn’t work, as we don’t want to
implement complex and fragile userspace infrastructure to racily upload complex
quota data for all current and future UIDs ever used on the system into each
tmpfs mount point at mount time.

systemd for Administrators, Part XII

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/security.html

Here’s the twelfth installment
of

my ongoing series
on
systemd
for
Administrators:

Securing Your Services

One of the core features of Unix systems is the idea of privilege separation
between the different components of the OS. Many system services run under
their own user IDs thus limiting what they can do, and hence the impact they
may have on the OS in case they get exploited.

This kind of privilege separation only provides very basic protection
however, since in general system services run this way can still do at least as
much as a normal local users, though not as much as root. For security purposes
it is however very interesting to limit even further what services can do, and
shut them off a couple of things that normal users are allowed to do.

A great way to limit the impact of services is by employing MAC technologies
such as SELinux. If you are interested to secure down your server, running
SELinux is a very good idea. systemd enables developers and administrators to
apply additional restrictions to local services independently of a MAC. Thus,
regardless whether you are able to make use of SELinux you may still enforce
certain security limits on your services.

In this iteration of the series we want to focus on a couple of these
security features of systemd and how to make use of them in your services.
These features take advantage of a couple of Linux-specific technologies that have
been available in the kernel for a long time, but never have been exposed in a
widely usable fashion. These systemd features have been designed to be as easy to use
as possible, in order to make them attractive to administrators and upstream
developers:

  • Isolating services from the network
  • Service-private /tmp
  • Making directories appear read-only or inaccessible to services
  • Taking away capabilities from services
  • Disallowing forking, limiting file creation for services
  • Controlling device node access of services

All options described here are documented in systemd’s man pages, notably systemd.exec(5).
Please consult these man pages for further details.

All these options are available on all systemd systems, regardless if
SELinux or any other MAC is enabled, or not.

All these options are relatively cheap, so if in doubt use them. Even if you
might think that your service doesn’t write to /tmp and hence enabling
PrivateTmp=yes (as described below) might not be necessary, due to
today’s complex software it’s still beneficial to enable this feature, simply
because libraries you link to (and plug-ins to those libraries) which you do
not control might need temporary files after all. Example: you never know what
kind of NSS module your local installation has enabled, and what that NSS module
does with /tmp.

These options are hopefully interesting both for administrators to secure
their local systems, and for upstream developers to ship their services secure
by default. We strongly encourage upstream developers to consider using these
options by default in their upstream service units. They are very easy to make
use of and have major benefits for security.

Isolating Services from the Network

A very simple but powerful configuration option you may use in systemd
service definitions is PrivateNetwork=:

...
[Service]
ExecStart=...
PrivateNetwork=yes
...

With this simple switch a service and all the processes it consists of are
entirely disconnected from any kind of networking. Network interfaces became
unavailable to the processes, the only one they’ll see is the loopback device
“lo”, but it is isolated from the real host loopback. This is a very powerful
protection from network attacks.

Caveat: Some services require the network to be operational. Of
course, nobody would consider using PrivateNetwork=yes on a
network-facing service such as Apache. However even for non-network-facing
services network support might be necessary and not always obvious. Example: if
the local system is configured for an LDAP-based user database doing glibc name
lookups with calls such as getpwnam() might end up resulting in network access.
That said, even in those cases it is more often than not OK to use
PrivateNetwork=yes since user IDs of system service users are required to
be resolvable even without any network around. That means as long as the only
user IDs your service needs to resolve are below the magic 1000 boundary using
PrivateNetwork=yes should be OK.

Internally, this feature makes use of network namespaces of the kernel. If
enabled a new network namespace is opened and only the loopback device
configured in it.

Service-Private /tmp

Another very simple but powerful configuration switch is
PrivateTmp=:

...
[Service]
ExecStart=...
PrivateTmp=yes
...

If enabled this option will ensure that the /tmp directory the
service will see is private and isolated from the host system’s /tmp.
/tmp traditionally has been a shared space for all local services and
users. Over the years it has been a major source of security problems for a
multitude of services. Symlink attacks and DoS vulnerabilities due to guessable
/tmp temporary files are common. By isolating the service’s
/tmp from the rest of the host, such vulnerabilities become moot.

For Fedora 17 a feature has
been accepted
in order to enable this option across a large number of
services.

Caveat: Some services actually misuse /tmp as a location
for IPC sockets and other communication primitives, even though this is almost
always a vulnerability (simply because if you use it for communication you need
guessable names, and guessable names make your code vulnerable to DoS and symlink
attacks) and /run is the much safer replacement for this, simply
because it is not a location writable to unprivileged processes. For example,
X11 places it’s communication sockets below /tmp (which is actually
secure — though still not ideal — in this exception since it does so in a
safe subdirectory which is created at early boot.) Services which need to
communicate via such communication primitives in /tmp are no
candidates for PrivateTmp=. Thankfully these days only very few
services misusing /tmp like this remain.

Internally, this feature makes use of file system namespaces of the kernel.
If enabled a new file system namespace is opened inheritng most of the host
hierarchy with the exception of /tmp.

Making Directories Appear Read-Only or Inaccessible to Services

With the ReadOnlyDirectories= and InaccessibleDirectories=
options it is possible to make the specified directories inaccessible for
writing resp. both reading and writing to the service:

...
[Service]
ExecStart=...
InaccessibleDirectories=/home
ReadOnlyDirectories=/var
...

With these two configuration lines the whole tree below /home
becomes inaccessible to the service (i.e. the directory will appear empty and
with 000 access mode), and the tree below /var becomes read-only.

Caveat: Note that ReadOnlyDirectories= currently is not
recursively applied to submounts of the specified directories (i.e. mounts below
/var in the example above stay writable). This is likely to get fixed
soon.

Internally, this is also implemented based on file system namspaces.

Taking Away Capabilities From Services

Another very powerful security option in systemd is
CapabilityBoundingSet= which allows to limit in a relatively fine
grained fashion which kernel capabilities a service started retains:

...
[Service]
ExecStart=...
CapabilityBoundingSet=CAP_CHOWN CAP_KILL
...

In the example above only the CAP_CHOWN and CAP_KILL capabilities are
retained by the service, and the service and any processes it might create have
no chance to ever acquire any other capabilities again, not even via setuid
binaries. The list of currently defined capabilities is available in capabilities(7).
Unfortunately some of the defined capabilities are overly generic (such as
CAP_SYS_ADMIN), however they are still a very useful tool, in particular for
services that otherwise run with full root privileges.

To identify precisely which capabilities are necessary for a service to run
cleanly is not always easy and requires a bit of testing. To simplify this
process a bit, it is possible to blacklist certain capabilities that are
definitely not needed instead of whitelisting all that might be needed. Example: the
CAP_SYS_PTRACE is a particularly powerful and security relevant capability
needed for the implementation of debuggers, since it allows introspecting and
manipulating any local process on the system. A service like Apache obviously
has no business in being a debugger for other processes, hence it is safe to
remove the capability from it:

...
[Service]
ExecStart=...
CapabilityBoundingSet=~CAP_SYS_PTRACE
...

The ~ character the value assignment here is prefixed with inverts
the meaning of the option: instead of listing all capabalities the service
will retain you may list the ones it will not retain.

Caveat: Some services might react confused if certain capabilities are
made unavailable to them. Thus when determining the right set of capabilities
to keep around you need to do this carefully, and it might be a good idea to talk
to the upstream maintainers since they should know best which operations a
service might need to run successfully.

Caveat 2: Capabilities are
not a magic wand.
You probably want to combine them and use them in
conjunction with other security options in order to make them truly useful.

To easily check which processes on your system retain which capabilities use
the pscap tool from the libcap-ng-utils package.

Making use of systemd’s CapabilityBoundingSet= option is often a
simple, discoverable and cheap replacement for patching all system daemons
individually to control the capability bounding set on their own.

Disallowing Forking, Limiting File Creation for Services

Resource Limits may be used to apply certain security limits on services
being run. Primarily, resource limits are useful for resource control (as the
name suggests…) not so much access control. However, two of them can be
useful to disable certain OS features: RLIMIT_NPROC and RLIMIT_FSIZE may be
used to disable forking and disable writing of any files with a size >
0:

...
[Service]
ExecStart=...
LimitNPROC=1
LimitFSIZE=0
...

Note that this will work only if the service in question drops privileges
and runs under a (non-root) user ID of its own or drops the CAP_SYS_RESOURCE
capability, for example via CapabilityBoundingSet= as discussed above.
Without that a process could simply increase the resource limit again thus
voiding any effect.

Caveat: LimitFSIZE= is pretty brutal. If the service
attempts to write a file with a size > 0, it will immeidately be killed with
the SIGXFSZ which unless caught terminates the process. Also, creating files
with size 0 is still allowed, even if this option is used.

For more information on these and other resource limits, see setrlimit(2).

Controlling Device Node Access of Services

Devices nodes are an important interface to the kernel and its drivers.
Since drivers tend to get much less testing and security checking than the core
kernel they often are a major entry point for security hacks. systemd allows
you to control access to devices individually for each service:

...
[Service]
ExecStart=...
DeviceAllow=/dev/null rw
...

This will limit access to /dev/null and only this device node,
disallowing access to any other device nodes.

The feature is implemented on top of the devices cgroup controller.

Other Options

Besides the easy to use options above there are a number of other security
relevant options available. However they usually require a bit of preparation
in the service itself and hence are probably primarily useful for upstream
developers. These options are RootDirectory= (to set up
chroot() environments for a service) as well as User= and
Group= to drop privileges to the specified user and group. These
options are particularly useful to greatly simplify writing daemons, where all
the complexities of securely dropping privileges can be left to systemd, and
kept out of the daemons themselves.

If you are wondering why these options are not enabled by default: some of
them simply break seamntics of traditional Unix, and to maintain compatibility
we cannot enable them by default. e.g. since traditional Unix enforced that
/tmp was a shared namespace, and processes could use it for IPC we
cannot just go and turn that off globally, just because /tmp‘s role in
IPC is now replaced by /run.

And that’s it for now. If you are working on unit files for upstream or in
your distribution, please consider using one or more of the options listed
above. If you service is secure by default by taking advantage of these options
this will help not only your users but also make the Internet a safer
place.

Kernel Hackers Panel

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/linuxcon-kernel-panel.html

At LinuxCon Europe/ELCE I had the chance to moderate the kernel hackers
panel with Linus Torvalds, Alan Cox, Paul McKenney and Thomas Gleixner on
stage
. I like to believe it went quite well, but check it out for yourself, as
a video recording is now available online:

For me personally I think the most notable topic covered was Control Groups,
and the clarification that they are something that is needed even though their
implementation right now is in many ways less than perfect. But in the end there is no
reasonable way around it, and much like SMP, technology that complicates things
substantially but is ultimately unavoidable.

Other videos from ELCE are online now, too.

libabc

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/libabc.html

At the Kernel Summit in Prague last week Kay Sievers and I lead a session on
developing shared userspace libraries, for kernel hackers. More and more
userspace interfaces of the kernel (for example many which deal with storage,
audio, resource management, security, file systems or a number of other
subsystems) nowadays rely on a dedicated userspace component. As people who
work primarily in the plumbing layer of the Linux OS we noticed over and over
again that these libraries written by people who usually are at home on the
kernel side of things make the same mistakes repeatedly, thus making life for
the users of the libraries unnecessarily difficult. In our session we tried to
point out a number of these things, and in particular places where the usual
kernel hacking style translates badly into userspace shared library hacking.
Our hope is that maybe a few kernel developers have a look at our list of
recommendations and consider the points we are raising.

To make things easy we have put together an example skeleton library we
dubbed libabc, whose README
file includes all our points in terse form. It’s available on kernel.org:

The git repository and the README.

This list of recommendations draws inspiration from David Zeuthen’s and
Ulrich Drepper’s well known papers on the topic of writing shared libraries. In
the README linked above we try to distill this wealth of information into a
terse list of recommendations, with a couple of additions and with a strict
focus on a kernel hacker background.

Please have a look, and even if you are not a kernel hacker there might be
something useful to know in it, especially if you work on the lower layers of
our stack.

If you have any questions or additions, just ping us, or comment below!

Prague

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/linuxcon-europe.html

If you make it to Prague the coming week for the LinuxCon/ELCE/GStreamer/Kernel Summit/… superconference, make sure not to miss:

All of that at the Clarion Hotel. See you in Prague!

Plumbers Wishlist, The Second Edition

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/plumbers-wishlist-2.html

Two weeks ago we published a Plumber’s
Wishlist for Linux
. So far, this has already created lively discussions in
the community (as reported on LWN among others), and patches for a few of the
items listed have already been posted (thanks a lot to those who worked on
this, your contributions are much appreciated!).

We
have now prepared a second version of the wish list.
It includes a number
of additions (tmpfs quota! hostname change notifications! and more!) and
updates to the previous items, including links to patches, and references to
other interesting material.

We hope to update this wishlist from time, so stay tuned!

And now, go and read the new wishlist!

Google doesn’t like my name

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/google-doesnt-like-my-name.html

Nice one, Google suspended my Google+ account because I created it under,
well, my name, which is “Lennart Poettering”, and Google+ thinks that wasn’t my
name, even though it says so in my passport, and almost every document I own
and I was never aware I had any other name. This is ricidulous. Google, give me
my name back! This is a really uncool move.

A Big Loss

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/a-big-loss.html

Google
announced today that they’ll be shutting down Google Code Search in
January
. I am quite sure that this would be a massive loss for the Free
Software community. The ability to learn from other people’s code is a key
idea of Free Software. There’s simply no better way to do that than with a
source code search engine. The day Google Code Search will be shut down will
be a sad day for the Free Software community.

Of course, there are a couple of alternatives around, but they all have one
thing in common: they, uh, don’t even remotely compare to the completeness,
performance and simplicity of the Google Code Search interface, and have
serious usability issues. (For example: koders.com is really really slow, and
splits up identifiers you search for at underscores, which kinda makes it
useless for looking for almost any kind of code.)

I think it must be of genuine interest to the Free Software community to
have a capable replacement for Google Code Search, for the day it is turned
off. In fact, it probably should be something the various foundations which
promote Free Software should be looking into, like the FSF or the Linux
Foundation. There are very few better ways to get Free Software into the heads
and minds of engineers than by examples — examples consisting of real life
code they can find with a source code search engine. I believe a source code
search engine is probably among the best vehicles to promote Free Software
towards engineers. In particular if it itself was Free Software (in contrast to
Google Code Search).

Ideally, all software available on web sites like SourceForge, Freshmeat, or
github should be indexed. But there’s also a chance for distributions here:
indexing the sources of all packages a distribution like Debian or Fedora
include would be a great tool for developers. In fact, a distribution offering
this functionality might benefit from such functionality, as it attracts
developer interest in the distribution.

It’s sad that Google Code Search will be gone soon. But maybe there’s
something positive in the bad news here, and a chance to create something better,
more comprehensive, that is free, and promotes our ideals better than Google
ever could. Maybe there’s a chance here for the Open Source foundations, for
the distributions and for the communities to create a better replacement!

A Plumber’s Wish List for Linux

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/plumbers-wishlist.html

Here’s a mail
we just sent to LKML
, for your consideration. Enjoy:

Subject: A Plumber’s Wish List for Linux

We’d like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we’d be delighted. Please ping us in case you need
clarifications or more information on specific items.

Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers


An here’s the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we’d
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we’d like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren’t. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
“inverted” semantics of CapBnd in /proc/$PID/status.

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn’t be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the “cpu” cgroup controller by default without
breaking RT. Right now creating a cgroup in the “cpu” hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in “cpu” whose
processes get an non-RT weight applied, but for RT take advantage of the
parent’s RT budget. We want the separation of RT and non-RT budget
assignment in the “cpu” hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of “cpu” hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.

What You Need to Know When Becoming a Free Software Hacker

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/hinter-den-kulissen.html

Earlier today I gave a presentation at the Technical University Berlin about
things you need to know, things you should expect and things you shouldn’t
expect when your are aspiring to become a successful Free Software Hacker.

I have put my slides up on Google Docs in case you are interested, either
because you are the target audience (i.e. a university student) or because you
need inspiration for a similar talk about the same topic.

The first two slides are in German language, so skip over them. The
interesting bits are all in English. I hope it’s quite comprehensive (though of
course terse). Enjoy:

In case your feed reader/planet messes this up, here’s the non-embedded version.

Oh, and thanks to everybody who reviewed and suggested additions to the the slides on +.