Tag Archives: Rust

Why systemd?

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/why.html

systemd is
still a young project, but it is not a baby anymore. The initial
announcement
I posted precisely a year ago. Since then most of the
big distributions have decided to adopt it in one way or another, many
smaller distributions have already switched. The first big
distribution with systemd by default will be Fedora 15, due end of
May. It is expected that the others will follow the lead a bit later
(with one exception). Many
embedded developers have already adopted it too, and there’s even a company specializing on engineering and
consulting services for systemd
. In short: within one year
systemd became a really successful project.

However, there are still folks who we haven’t won over yet. If you
fall into one of the following categories, then please have a look on
the comparison of init systems below:

  • You are working on an embedded project and are wondering whether
    it should be based on systemd.
  • You are a user or administrator and wondering which distribution
    to pick, and are pondering whether it should be based on systemd or
    not.
  • You are a user or administrator and wondering why your favourite
    distribution has switched to systemd, if everything already worked so
    well before.
  • You are developing a distribution that hasn’t switched yet, and
    you are wondering whether to invest the work and go systemd.

And even if you don’t fall into any of these categories, you might still
find the comparison interesting.

We’ll be comparing the three most relevant init systems for Linux:
sysvinit, Upstart and systemd. Of course there are other init systems
in existance, but they play virtually no role in the big
picture. Unless you run Android (which is a completely different beast
anyway), you’ll almost definitely run one of these three init systems
on your Linux kernel. (OK, or busybox, but then you are basically not
running any init system at all.) Unless you have a soft spot for
exotic init systems there’s little need to look further. Also, I am
kinda lazy, and don’t want to spend the time on analyzing those other
systems in enough detail to be completely fair to them.

Speaking of fairness: I am of course one of the creators of
systemd. I will try my best to be fair to the other two contenders,
but in the end, take it with a grain of salt. I am sure though that
should I be grossly unfair or otherwise incorrect somebody will point
it out in the comments of this story, so consider having a look on
those, before you put too much trust in what I say.

We’ll look at the currently implemented features in a released
version. Grand plans don’t count.

General Features

sysvinitUpstartsystemd
Interfacing via D-Busnoyesyes
Shell-free bootupnonoyes
Modular C coded early boot services includednonoyes
Read-Aheadnono[1]yes
Socket-based Activationnono[2]yes
Socket-based Activation: inetd compatibilitynono[2]yes
Bus-based Activationnono[3]yes
Device-based Activationnono[4]yes
Configuration of device dependencies with udev rulesnonoyes
Path-based Activation (inotify)nonoyes
Timer-based Activationnonoyes
Mount handlingnono[5]yes
fsck handlingnono[5]yes
Quota handlingnonoyes
Automount handlingnonoyes
Swap handlingnonoyes
Snapshotting of system statenonoyes
XDG_RUNTIME_DIR Supportnonoyes
Optionally kills remaining processes of users logging outnonoyes
Linux Control Groups Integrationnonoyes
Audit record generation for started servicesnonoyes
SELinux integrationnonoyes
PAM integrationnonoyes
Encrypted hard disk handling (LUKS)nonoyes
SSL Certificate/LUKS Password handling, including Plymouth, Console, wall(1), TTY and GNOME agentsnonoyes
Network Loopback device handlingnonoyes
binfmt_misc handlingnonoyes
System-wide locale handlingnonoyes
Console and keyboard setupnonoyes
Infrastructure for creating, removing, cleaning up of temporary and volatile filesnonoyes
Handling for /proc/sys sysctlnonoyes
Plymouth integrationnoyesyes
Save/restore random seednonoyes
Static loading of kernel modulesnonoyes
Automatic serial console handlingnonoyes
Unique Machine ID handlingnonoyes
Dynamic host name and machine meta data handlingnonoyes
Reliable termination of servicesnonoyes
Early boot /dev/log loggingnonoyes
Minimal kmsg-based syslog daemon for embedded usenonoyes
Respawning on service crash without losing connectivitynonoyes
Gapless service upgradesnonoyes
Graphical UInonoyes
Built-In Profiling and Toolsnonoyes
Instantiated servicesnoyesyes
PolicyKit integrationnonoyes
Remote access/Cluster support built into client toolsnonoyes
Can list all processes of a servicenonoyes
Can identify service of a processnonoyes
Automatic per-service CPU cgroups to even out CPU usage between themnonoyes
Automatic per-user cgroupsnonoyes
SysV compatibilityyesyesyes
SysV services controllable like native servicesyesnoyes
SysV-compatible /dev/initctlyesnoyes
Reexecution with full serialization of stateyesnoyes
Interactive boot-upno[6]no[6]yes
Container support (as advanced chroot() replacement)nonoyes
Dependency-based bootupno[7]noyes
Disabling of services without editing filesyesnoyes
Masking of services without editing filesnonoyes
Robust system shutdown within PID 1nonoyes
Built-in kexec supportnonoyes
Dynamic service generationnonoyes
Upstream support in various other OS componentsyesnoyes
Service files compatible between distributionsnonoyes
Signal delivery to servicesnonoyes
Reliable termination of user sessions before shutdownnonoyes
utmp/wtmp supportyesyesyes
Easily writable, extensible and parseable service files, suitable for manipulation with enterprise management toolsnonoyes

[1] Read-Ahead implementation for Upstart available in separate package ureadahead, requires non-standard kernel patch.

[2] Socket activation implementation for Upstart available as preview, lacks parallelization support hence entirely misses the point of socket activation.

[3] Bus activation implementation for Upstart posted as patch, not merged.

[4] udev device event bridge implementation for Upstart available as preview, forwards entire udev database into Upstart, not practical.

[5] Mount handling utility mountall for Upstart available in separate package, covers only boot-time mounts, very limited dependency system.

[6] Some distributions offer this implemented in shell.

[7] LSB init scripts support this, if they are used.

Available Native Service Settings

sysvinitUpstartsystemd
OOM Adjustmentnoyes[1]yes
Working Directorynoyesyes
Root Directory (chroot())noyesyes
Environment Variablesnoyesyes
Environment Variables from external filenonoyes
Resource Limitsnosome[2]yes
umasknoyesyes
User/Group/Supplementary Groupsnonoyes
IO Scheduling Class/Prioritynonoyes
CPU Scheduling Nice Valuenoyesyes
CPU Scheduling Policy/Prioritynonoyes
CPU Scheduling Reset on fork() controlnonoyes
CPU affinitynonoyes
Timer Slacknonoyes
Capabilities Controlnonoyes
Secure Bits Controlnonoyes
Control Group Controlnonoyes
High-level file system namespace control: making directories inacessiblenonoyes
High-level file system namespace control: making directories read-onlynonoyes
High-level file system namespace control: private /tmpnonoyes
High-level file system namespace control: mount inheritancenonoyes
Input on Consoleyesyesyes
Output on Syslognonoyes
Output on kmsg/dmesgnonoyes
Output on arbitrary TTYnonoyes
Kill signal controlnonoyes
Conditional execution: by identified CPU virtualization/containernonoyes
Conditional execution: by file existancenonoyes
Conditional execution: by security frameworknonoyes
Conditional execution: by kernel command linenonoyes

[1] Upstart supports only the deprecated oom_score_adj mechanism, not the current oom_adj logic.

[2] Upstart lacks support for RLIMIT_RTTIME and RLIMIT_RTPRIO.

Note that some of these options are relatively easily added to SysV
init scripts, by editing the shell sources. The table above focusses
on easily accessible options that do not require source code
editing.

Miscellaneous

sysvinitUpstartsystemd
Maturity> 15 years6 years1 year
Specialized professional consulting and engineering services availablenonoyes
SCMSubversionBazaargit
Copyright-assignment-free contributingyesnoyes

Summary

As the tables above hopefully show in all clarity systemd
has left behind both sysvinit and Upstart in almost every
aspect. With the exception of the project’s age/maturity systemd wins
in every category. At this point in time it will be very hard for
sysvinit and Upstart to catch up with the features systemd provides
today. In one year we managed to push systemd forward much further
than Upstart has been pushed in six.

It is our intention to drive forward the development of the Linux
platform with systemd. In the next release cycle we will focus more
strongly on providing the same features and speed improvement we
already offer for the system to the user login session. This will
bring much closer integration with the other parts of the OS and
applications, making the most of the features the service manager
provides, and making it available to login sessions. Certain
components such as ConsoleKit will be made redundant by these
upgrades, and services relying on them will be updated. The
burden for maintaining these then obsolete components
will be passed on the vendors who plan to continue to rely on
them.

If you are wondering whether or not to adopt systemd, then systemd
obviously wins when it comes to mere features. Of course that should
not be the only aspect to keep in mind. In the long run, sticking with
the existing infrastructure (such as ConsoleKit) comes at a price:
porting work needs to take place, and additional maintainance work for
bitrotting code needs to be done. Going it on your own means increased
workload.

That said, adopting systemd is also not free. Especially if you
made investments in the other two solutions adopting systemd means
work. The basic work to adopt systemd is relatively minimal for
porting over SysV systems (since compatibility is provided), but can
mean substantial work when coming from Upstart. If you plan to go for
a 100% systemd system without any SysV compatibility (recommended for
embedded, long run goal for the big distributions) you need to be
willing to invest some work to rewrite init scripts as simple systemd
unit files.

systemd is in the process of becoming a comprehensive, integrated
and modular platform providing everything needed to bootstrap and
maintain an operating system’s userspace. It includes C rewrites of
all basic early boot init scripts that are shipped with the various
distributions. Especially for the embedded case adopting systemd
provides you in one step with almost everything you need, and you can
pick the modules you want. The other two init systems are singular
individual components, which to be useful need a great number of
additional components with differing interfaces. The emphasis of
systemd to provide a platform instead of just a component allows for
closer integration, and cleaner APIs. Sooner or later this will
trickle up to the applications. Already, there are accepted XDG
specifications (e.g. XDG basedir spec, more specifically
XDG_RUNTIME_DIR) that are not supported on the other init systems.

systemd is also a big opportunity for Linux standardization. Since
it standardizes many interfaces of the system that previously have
been differing on every distribution, on every implementation,
adopting it helps to work against the balkanization of the Linux
interfaces. Choosing systemd means redefining more closely
what the Linux platform is about. This improves the lifes of
programmers, users and administrators alike.

I believe that momentum is clearly with systemd. We invite you to
join our community and be part of that momentum.

Emulated atomic operations and real-time scheduling

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/atomic-rt.html

Unfortunately not all CPU architectures have native support for atomic
operations
, or only support a very limited subset. Most
prominently ARMv5 (and
older) hasn’t any support besides the most basic atomic swap
operation[1]. Now, more and more free code is starting to
use atomic operations and lock-free
algorithms
, one being my own project, PulseAudio. If you have ever done real-time
programming
you probably know that you cannot really do it without
support for atomic operations. One question remains however: what to
do on CPUs which support only the most basic atomic operations
natively?

On the kernel side atomic ops are very easy to emulate: just disable
interrupts temporarily, then do your operation non-atomically, and afterwards
enable them again. That’s relatively cheap and works fine (unless you are on SMP — which
fortunately you usually are not for those CPUs). The Linux
kernel does it this way and it is good. But what to do in user-space, where you cannot just go and disable interrupts?

Let’s see how the different userspace libraries/frameworks do it
for ARMv5, a very relevant architecture that only knows an atomic swap (exchange)
but no CAS
or even atomic arithmetics. Let’s start with an excerpt from glibc’s
atomic operations implementation for ARM
:

/* Atomic compare and exchange.  These sequences are not actually atomic;
   there is a race if *MEM != OLDVAL and we are preempted between the two
   swaps.  However, they are very close to atomic, and are the best that a
   pre-ARMv6 implementation can do without operating system support.
   LinuxThreads has been using these sequences for many years.  */

This comment says it all. Not good. The more you make use of atomic
operations the more likely you’re going to to hit this race. Let’s
hope glibc is not a heavy user of atomic operations. PulseAudio however is, and
PulseAudio happens to be my focus.

Let’s have a look on how Qt4
does it:

extern Q_CORE_EXPORT char q_atomic_lock;

inline char q_atomic_swp(volatile char *ptr, char newval)
{
    register int ret;
    asm volatile("swpb %0,%1,[%2]"
                 : "=&r"(ret)
                 : "r"(newval), "r"(ptr)
                 : "cc", "memory");
    return ret;
}

inline int q_atomic_test_and_set_int(volatile int *ptr, int expected, int newval)
{
    int ret = 0;
    while (q_atomic_swp(&q_atomic_lock, ~0) != 0);
    if (*ptr == expected) {
	*ptr = newval;
	ret = 1;
    }
    q_atomic_swp(&q_atomic_lock, 0);
    return ret;
}

So, what do we have here? A slightly better version. In standard
situations it actually works. But it sucks big time, too. Why? It
contains a spin lock: the variable q_atomic_lock is used for
locking the atomic operation. The code tries to set it to non-zero,
and if that fails it tries again, until it succeeds, in the hope that
the other thread — which currently holds the lock — gives it up. The
big problem here is: it might take a while until that happens, up to
1/HZ time on Linux. Usually you want to use atomic operations to
minimize the need for mutexes and thus speed things up. Now, here you
got a lock, and it’s the worst kind: the spinning lock. Not
good. Also, if used from a real-time thread the machine simply locks
up when we enter the loop in contended state, because preemption is
disabled for RT threads and thus the loop will spin forever. Evil. And
then, there’s another problem: it’s a big bottleneck, because all
atomic operations are synchronized via a single variable which is
q_atomic_lock. Not good either. And let’s not forget that
only code that has access to q_atomic_lock actually can
execute this code safely. If you want to use it for
lock-free IPC via shared memory this is going to break. And let’s not
forget that it is unusable from signal handlers (which probably
doesn’t matter much, though). So, in summary: this code sucks,
too.

Next try, let’s have a look on how glib
does it:

static volatile int atomic_spin = 0;

static int atomic_spin_trylock (void)
{
  int result;

  asm volatile (
    "swp %0, %1, [%2]\n"
    : "=&r,&r" (result)
    : "r,0" (1), "r,r" (&atomic_spin)
    : "memory");
  if (result == 0)
    return 0;
  else
    return -1;
}

static void atomic_spin_lock (void)
{
  while (atomic_spin_trylock())
    sched_yield();
}

static void atomic_spin_unlock (void)
{
  atomic_spin = 0;
}

gint
g_atomic_int_exchange_and_add (volatile gint *atomic,
			       gint           val)
{
  gint result;

  atomic_spin_lock();
  result = *atomic;
  *atomic += val;
  atomic_spin_unlock();

  return result;
}

Once again, a spin loop. However, this implementation makes use of
sched_yield() for asking the OS to reschedule. It’s a bit
better than the Qt version, since it doesn’t spin just burning CPU,
but instead tells the kernel to execute something else, increasing the
chance that the thread currently holding the lock is scheduled. It’s a
bit friendlier, but it’s not great either because this might still delay
execution quite a bit. It’s better then the Qt version. And probably
one of the very few ligitimate occasions where using
sched_yield() is OK. It still doesn’t work for RT — because
sched_yield() in most cases is a NOP on for RT threads, so
you still get a machine lockup. And it still has the
one-lock-to-rule-them-all bottleneck. And it still is not compatible
with shared memory.

Then, there’s libatomic_ops. It’s
the most complex code, so I’ll spare you to paste it here. Basically
it uses the same spin loop. With three differences however:

  1. 16 lock variables instead of a single one are used. The variable
    that is used is picked via simple hashing of the pointer to the atomic variable
    that shall be modified. This removes the one-lock-to-rule-them-all
    bottleneck.
  2. Instead of pthread_yield() it uses select() with
    a small timeval parameter to give the current holder of the lock some
    time to give it up. To make sure that the select() is not
    optimized away by the kernel and the thread thus never is preempted
    the sleep time is increased on every loop iteration.
  3. It explicitly disables signals before doing the atomic operation.

It’s certainly the best implementation of the ones discussed here:
It doesn’t suffer by the one-lock-to-rule-them-all bottleneck. It’s
(supposedly) signal handler safe (which however comes at the cost of
doing two syscalls on every atomic operation — probably a very high
price). It actually works on RT, due to sleeping for an explicit
time. However it still doesn’t deal with priority
inversion
problems — which is a big issue for real-time
programming. Also, the time slept in the select() call might
be relatively long, since at least on Linux the time passed to
select() is rounded up to 1/HZ — not good for RT either. And
then, it still doesn’t work for shared memory IPC.

So, what do we learn from this? At least one thing: better don’t do
real-time programming with ARMv5[2]. But more practically, how
could a good emulation for atomic ops, solely based on atomic swap
look like? Here are a few ideas:

  • Use an implementation inspired by libatomic_ops. Right
    now it’s the best available. It’s probably a good idea, though, to
    replace select() by a nanosleep(), since on recent
    kernels the latter doesn’t round up to 1/HZ anymore, at least when you
    have high-resolution timers[3] Then, if you can live
    without signal handler safety, drop the signal mask changing.
  • If you use something based on libatomic_ops and want to
    use it for shared memory IPC, then you have the option to move the
    lock variables into shared memory too. Note however, that this allows
    evil applications to lock up your process by taking the locks and
    never giving them up. (Which however is always a problem if not all
    atomic operations you need are available in hardware) So if you do
    this, make sure that only trusted processes can attach to your memory
    segment.
  • Alternatively, spend some time and investigate if it is possible
    to use futexes to sleep on the lock variables. This is not trivial
    though, since futexes right now expect the availability of an atomic
    increment operation. But it might be possible to emulate this good
    enough with the swap operation. There’s now even a FUTEX_LOCK_PI
    operation which would allow priority inheritance.
  • Alternatively, find a a way to allow user space disabling
    interrupts cheaply (requires kernel patching). Since enabling
    RT scheduling is a priviliged operation already (since you may easily
    lock up your machine with it), it might not be too problematic to
    extend the ability to disable interrupts to user space: it’s just yet
    another way to lock up your machine.
  • For the libatomic_ops based algorithm: if you’re lucky
    and defined a struct type for your atomic integer types, like the
    kernel does, or like I do in PulseAudio with pa_atomic_t,
    then you can stick the lock variable directly into your
    structure. This makes shared memory support transparent, and removes
    the one-lock-to-rule-them-all bottleneck completely. Of course, OTOH it
    increases the memory consumption a bit and increases cache pressure
    (though I’d assume that this is neglible).
  • For the libatomic_ops based algorithm: start sleeping for
    the time returned by clock_getres()
    (cache the result!). You cannot sleep shorter than that anyway.

Yepp, that’s as good as it gets. Unfortunately I cannot serve you
the optimal solution on a silver platter. I never actually did
development for ARMv5, this blog story just sums up my thoughts on all
the code I saw which emulates atomic ops on ARMv5. But maybe someone
who actually cares about atomic operations on ARM finds this
interesting and maybe invests some time to prepare patches for Qt,
glib, glibc — and PulseAudio.

Update: I added two more ideas to the list above.

Update 2: Andrew Haley just posted something like the optimal solution for the problem. It would be great if people would start using this.

Footnotes

[1] The Nokia 770 has an ARMv5 chip, N800 has ARMv6. The OpenMoko phone apparently uses ARMv5.

[2] And let’s not even think about CPUs which don’t even have an atomic swap!

[3] Which however you probably won’t, given that they’re only available on x86 on stable Linux kernels for now — but still, it’s cleaner.

ZeroConf in Ubuntu

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/zeroconf-ubuntu.html

(Disclaimer: I am not an Ubuntu user myself. But I happen to be the lead developer of Avahi.)

It came to my attention that Ubuntu is
discussing
whether
to
enable Zeroconf/Avahi in default installations. I would like to point out a few
things:

The “No Open Ports” policy: This policy (or at least the
way many people interprete it) seems to be thought out by someone who
doesn’t have much experience with TCP/IP networking. While it might make sense
to enforce this for application-level protocols like HTTP or FTP it doesn’t
make sense to apply it to transport-level protocols such as DHCP, DNS or in
this case mDNS (the underlying protocol of Zeroconf/Avahi/Bonjour):

  • Even the simplest DNS lookup requires the opening of an UDP port for a
    short period of time to be able to recieve the response. This is usually not
    visible to the administrator, because the time is too short to show up in
    netstat -uln, but nonetheless it is an open port. (UDP is not
    session-based (like TCP is) so incoming packets are accepted regardless where
    they come from)
  • DHCP clients listen on UDP port 68 during their entire lifetime (which in
    most cases is the same as the uptime of the machine). DHCP may be misused for
    much worse things than mDNS. Evildoers can forge DHCP packets to change IP
    addresses and routing of machines. This is definitely something that cannot be
    done with mDNS.

All three protocols, DNS, DHCP and mDNS, require a little bit of trust in
the local LAN. They (usually) don’t come with any sort of authentication and
they all are very easy to forge. The impact of forged mDNS packets is clearly
less dangerous than forged DHCP or DNS packets. Why? Because mDNS doesn’t
allow you to change the IP address or routing setup (which forged DHCP allows)
and because it cannot be used to spoof host names outside the .local
domain (which forged DNS allows).

Enforcing the “No Open ports” policy everywhere in Ubuntu would require that
both DNS and DHCP are disabled by default. However, as everybody probably
agrees, this would be ridiculous because a standard Ubuntu installation
couldn’t even be used for the most basic things like web browsing.

Oh, and BTW: DNS lookups are usually done by an NSS plugin which is loaded
by the libc into every process which uses gethostbyname() (the function for doing host name resolutions). So, in
effect every single process that uses this function has an open port for a
short time. And the DNS client code runs with user priviliges, so an exploit
really hurts. dhclient (the DHCP client) runs as root during the entire
runtime, so an exploit of it hurts even more. Avahi in contrast runs as its own user and
chroot()s
.

It is not my intention to force anyone to use my
software
. However, enforcing the “No Open Ports” policy unconditionally is
not a good idea. Currently Ubuntu makes exceptions for DHCP/DNS and so
it should for mDNS.

I do agree that publishing all kinds of local services with Avahi in a
default install is indeed problematic. However, if the “No Open Ports” policy
is enforced on all other application-level software, there shouldn’t be any
application that would want to register a service with Avahi.

Starting Avahi “on-demand” is not an option either, because it offers useful
services even when no local application is accessing is. Most notably this is
host name resolution for the local host name. (Hey, yeah, Zeroconf is more than
just stealing music.)

Remember: Zeroconf is about
Zero Configuration. Requiring the user to toggle some obscure
configuration option before he can use Zeroconf would make it a paradox.
Zeroconf was designed to make things “just work”. If it isn’t enabled by
default it is impossible to reach that goal.

Oh, and I enabled commmenting in my blog, if anyone wants to flame me on this…

Announcing SECCURE

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/seccure.html

Yesterday my brother released his second Free Software package, the SECCURE Elliptic Curve Crypto Utility for Reliable Encryption. (Recursive acronyms, yay!)

The seccure toolset implements a selection of asymmetric algorithms based on elliptic curve cryptography (ECC). In particular, it offers public key encryption / decryption and signature generation / verification. ECC schemes offer a much better key size to security ratio than classical systems (RSA, DSA). Keys are short enough to make direct specification of keys on the command line possible (sometimes this is more convenient than the management of PGP-like key rings). seccure builds on this feature and therefore is the tool of choice whenever lightweight asymmetric cryptography — independent of key servers, revocation certificates, the Web of Trust, or even configuration files — is required.

Anyone willing to work on the Debian RFP?

(The first Free Software package of him is ssss, an implementation of Shamir’s secret sharing scheme)