Tag Archives: Projects

Attending and Speaking at GNOME.Asia 2017 Summit

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/attending-and-speaking-at-gnomeasia-2017-summit.html

The GNOME.Asia Summit 2017 organizers
invited to me to speak at their conference in Chongqing/China, and it
was an excellent event! Here’s my brief report:

Because we arrived one day early in Chongqing, my GNOME friends Sri,
Matthias, Jonathan, David and I started our journey with an excursion
to the Dazu Rock
Carvings
, a short
bus trip from Chongqing, and an excellent (and sometimes quite
surprising) sight. I mean, where else can you see a buddha with 1000+
hands, and centuries old, holding a cell Nexus 5 cell phone? Here’s
proof:

The GNOME.Asia schedule was excellent, with various good talks,
including some about Flatpak, Endless OS, rpm-ostree, Blockchains and
more. My own talk was about The Path to a Fully Protected GNOME
Desktop OS Image
(Slides available
here
). In the
hallway track I did my best to advocate
casync to whoever was willing to
listen, and I think enough were ;-). As we all know attending
conferences is at least as much about the hallway track as about the
talks, and GNOME.Asia was a fantastic way to meet the Chinese GNOME
and Open Source communities.

The day after the conference the organizers of GNOME.Asia organized a
Chongqing day trip. A particular highlight was the ubiqutious hot pot,
sometimes with the local speciality: fresh pig brain.

Here some random photos from the trip: sights, food, social event and
more.














I’d like to thank the GNOME Foundation for funding my trip to
GNOME.Asia. And that’s all for now. But let me close with an old
chinese wisdom:

   The Trials Of A Long Journey Always Feeling, Civilized Travel Pass Reputation.

All Systems Go! 2017 Videos Online!

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/all-systems-go-2017-videos-online.html

For those living under a rock, the videos from everybody’s favourite
Userspace Linux Conference All Systems Go!
2017
are now available online.

All videos

The videos for my own two talks are available here:

Synchronizing Images with
casync

(Slides)

Containers without a Container Manager, with
systemd

(Slides)

Of course, this is the stellar work of the CCC
VOC
folks, who are hard to beat when it comes to
videotaping of community conferences.

IP Accounting and Access Lists with systemd

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/ip-accounting-and-access-lists-with-systemd.html

TL;DR: systemd now can do per-service IP traffic accounting, as well
as access control for IP address ranges.

Last Friday we released systemd
235
. I
already blogged about its Dynamic User feature in
detail
, but
there’s one more piece of new functionality that I think deserves special
attention: IP accounting and access control.

Before v235 systemd already provided per-unit resource management
hooks for a number of different kinds of resources: consumed CPU time,
disk I/O, memory usage and number of tasks. With v235 another kind of
resource can be controlled per-unit with systemd: network traffic
(specifically IP).

Three new unit file settings have been added in this context:

  1. IPAccounting= is a boolean setting. If enabled for a unit, all IP
    traffic sent and received by processes associated with it is counted
    both in terms of bytes and of packets.

  2. IPAddressDeny= takes an IP address prefix (that means: an IP
    address with a network mask). All traffic from and to this address will be
    prohibited for processes of the service.

  3. IPAddressAllow= is the matching positive counterpart to
    IPAddressDeny=. All traffic matching this IP address/network mask
    combination will be allowed, even if otherwise listed in
    IPAddressDeny=.

The three options are thin wrappers around kernel functionality
introduced with Linux 4.11: the control group eBPF hooks. The actual
work is done by the kernel, systemd just provides a number of new
settings to configure this facet of it. Note that cgroup/eBPF is
unrelated to classic Linux firewalling,
i.e. NetFilter/iptables. It’s up to you whether you use one or the
other, or both in combination (or of course neither).

IP Accounting

Let’s have a closer look at the IP accounting logic mentioned
above. Let’s write a simple unit
/etc/systemd/system/ip-accounting-test.service:

[Service]
ExecStart=/usr/bin/ping 8.8.8.8
IPAccounting=yes

This simple unit invokes the
ping(8) command to
send a series of ICMP/IP ping packets to the IP address 8.8.8.8 (which
is the Google DNS server IP; we use it for testing here, since it’s
easy to remember, reachable everywhere and known to react to ICMP
pings; any other IP address responding to pings would be fine to use,
too). The IPAccounting= option is used to turn on IP accounting for
the unit.

Let’s start this service after writing the file. Let’s then have a
look at the status output of systemctl:

# systemctl daemon-reload
# systemctl start ip-accounting-test
# systemctl status ip-accounting-test ip-accounting-test.service
   Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 1s ago
 Main PID: 32152 (ping)
       IP: 168B in, 168B out
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/ip-accounting-test.service
           └─32152 /usr/bin/ping 8.8.8.8

Okt 09 18:05:47 sigma systemd[1]: Started ip-accounting-test.service.
Okt 09 18:05:47 sigma ping[32152]: PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
Okt 09 18:05:47 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=29.2 ms
Okt 09 18:05:48 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=2 ttl=59 time=28.0 ms

This shows the ping command running — it’s currently at its second
ping cycle as we can see in the logs at the end of the output. More
interesting however is the IP: line further up showing the current
IP byte counters. It currently shows 168 bytes have been received, and
168 bytes have been sent. That the two counters are at the same value
is not surprising: ICMP ping requests and responses are supposed to
have the same size. Note that this line is shown only if
IPAccounting= is turned on for the service, as only then this data
is collected.

Let’s wait a bit, and invoke systemctl status again:

# systemctl status ip-accounting-test ip-accounting-test.service
   Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 4min 28s ago
 Main PID: 32152 (ping)
       IP: 22.2K in, 22.2K out
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/ip-accounting-test.service
           └─32152 /usr/bin/ping 8.8.8.8

Okt 09 18:10:07 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=260 ttl=59 time=27.7 ms
Okt 09 18:10:08 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=261 ttl=59 time=28.0 ms
Okt 09 18:10:09 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=262 ttl=59 time=33.8 ms
Okt 09 18:10:10 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=263 ttl=59 time=48.9 ms
Okt 09 18:10:11 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=264 ttl=59 time=27.2 ms
Okt 09 18:10:12 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=265 ttl=59 time=27.0 ms
Okt 09 18:10:13 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=266 ttl=59 time=26.8 ms
Okt 09 18:10:14 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=267 ttl=59 time=27.4 ms
Okt 09 18:10:15 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=268 ttl=59 time=29.7 ms
Okt 09 18:10:16 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=269 ttl=59 time=27.6 ms

As we can see, after 269 pings the counters are much higher: at 22K.

Note that while systemctl status shows only the byte counters,
packet counters are kept as well. Use the low-level systemctl show
command to query the current raw values of the in and out packet and
byte counters:

# systemctl show ip-accounting-test -p IPIngressBytes -p IPIngressPackets -p IPEgressBytes -p IPEgressPackets
IPIngressBytes=37776
IPIngressPackets=449
IPEgressBytes=37776
IPEgressPackets=449

Of course, the same information is also available via the D-Bus
APIs. If you want to process this data further consider talking proper
D-Bus, rather than scraping the output of systemctl show.

Now, let’s stop the service again:

# systemctl stop ip-accounting-test

When a service with such accounting turned on terminates, a log line
about all its consumed resources is written to the logs. Let’s check
with journalctl:

# journalctl -u ip-accounting-test -n 5
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:17:02 CEST. --
Okt 09 18:15:50 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=603 ttl=59 time=26.9 ms
Okt 09 18:15:51 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=604 ttl=59 time=27.2 ms
Okt 09 18:15:52 sigma systemd[1]: Stopping ip-accounting-test.service...
Okt 09 18:15:52 sigma systemd[1]: Stopped ip-accounting-test.service.
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.5K IP traffic, sent 49.5K IP traffic

The last line shown is the interesting one, that shows the accounting
data. It’s actually a structured log message, and among its metadata
fields it contains the more comprehensive raw data:

# journalctl -u ip-accounting-test -n 1 -o verbose
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:18:50 CEST. --
Mon 2017-10-09 18:15:52.649028 CEST [s=89a2cc877fdf4dafb2269a7631afedad;i=14d7;b=4c7e7adcba0c45b69d612857270716d3;m=137592e75e;t=55b1f81298605;x=c3c9b57b28c9490e]
    PRIORITY=6
    _BOOT_ID=4c7e7adcba0c45b69d612857270716d3
    _MACHINE_ID=e87bfd866aea4ae4b761aff06c9c3cb3
    _HOSTNAME=sigma
    SYSLOG_FACILITY=3
    SYSLOG_IDENTIFIER=systemd
    _UID=0
    _GID=0
    _TRANSPORT=journal
    _PID=1
    _COMM=systemd
    _EXE=/usr/lib/systemd/systemd
    _CAP_EFFECTIVE=3fffffffff
    _SYSTEMD_CGROUP=/init.scope
    _SYSTEMD_UNIT=init.scope
    _SYSTEMD_SLICE=-.slice
    CODE_FILE=../src/core/unit.c
    _CMDLINE=/usr/lib/systemd/systemd --switched-root --system --deserialize 25
    _SELINUX_CONTEXT=system_u:system_r:init_t:s0
    UNIT=ip-accounting-test.service
    CODE_LINE=2115
    CODE_FUNC=unit_log_resources
    MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
    INVOCATION_ID=98a6e756fa9d421d8dfc82b6df06a9c3
    IP_METRIC_INGRESS_BYTES=50880
    IP_METRIC_INGRESS_PACKETS=605
    IP_METRIC_EGRESS_BYTES=50880
    IP_METRIC_EGRESS_PACKETS=605
    MESSAGE=ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
    _SOURCE_REALTIME_TIMESTAMP=1507565752649028

The interesting fields of this log message are of course
IP_METRIC_INGRESS_BYTES=, IP_METRIC_INGRESS_PACKETS=,
IP_METRIC_EGRESS_BYTES=, IP_METRIC_EGRESS_PACKETS= that show the
consumed data.

The log message carries a message
ID

that may be used to quickly search for all such resource log messages
(ae8f7b866b0347b9af31fe1c80b127c0). We can combine a search term for
messages of this ID with journalctl‘s -u switch to quickly find
out about the resource usage of any invocation of a specific
service. Let’s try:

# journalctl -u ip-accounting-test MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:25:27 CEST. --
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic

Of course, the output above shows only one message at the moment,
since we started the service only once, but a new one will appear
every time you start and stop it again.

The IP accounting logic is also hooked up with
systemd-run,
which is useful for transiently running a command as systemd service
with IP accounting turned on. Let’s try it:

# systemd-run -p IPAccounting=yes --wait wget https://cfp.all-systems-go.io/en/ASG2017/public/schedule/2.pdf
Running as unit: run-u2761.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 878ms
IP traffic received: 231.0K
IP traffic sent: 3.7K

This uses wget to download the
PDF version of the 2nd day
schedule

of everybody’s favorite Linux user-space conference All Systems Go!
2017
(BTW, have you already booked your
ticket
? We are very close to
selling out, be quick!). The IP traffic this command generated was
231K ingress and 4K egress. In the systemd-run command line two
parameters are important. First of all, we use -p IPAccounting=yes
to turn on IP accounting for the transient service (as above). And
secondly we use --wait to tell systemd-run to wait for the service
to exit. If --wait is used, systemd-run will also show you various
statistics about the service that just ran and terminated, including
the IP statistics you are seeing if IP accounting has been turned on.

It’s fun to combine this sort of IP accounting with interactive
transient units. Let’s try that:

# systemd-run -p IPAccounting=1 -t /bin/sh
Running as unit: run-u2779.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# dnf update
…
sh-4.4# dnf install firefox
…
sh-4.4# exit
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 5.297s
IP traffic received: …B
IP traffic sent: …B

This uses systemd-run‘s --pty switch (or short: -t), which opens
an interactive pseudo-TTY connection to the invoked service process,
which is a bourne shell in this case. Doing this means we have a full,
comprehensive shell with job control and everything. Since the shell
is running as part of a service with IP accounting turned on, all IP
traffic we generate or receive will be accounted for. And as soon as
we exit the shell, we’ll see what it consumed. (For the sake of
brevity I actually didn’t paste the whole output above, but truncated
core parts. Try it out for yourself, if you want to see the output in
full.)

Sometimes it might make sense to turn on IP accounting for a unit that
is already running. For that, use systemctl set-property
foobar.service IPAccounting=yes
, which will instantly turn on
accounting for it. Note that it won’t count retroactively though: only
the traffic sent/received after the point in time you turned it on
will be collected. You may turn off accounting for the unit with the
same command.

Of course, sometimes it’s interesting to collect IP accounting data
for all services, and turning on IPAccounting=yes in every single
unit is cumbersome. To deal with that there’s a global option
DefaultIPAccounting=
available which can be set in /etc/systemd/system.conf.

IP Access Lists

So much about IP accounting. Let’s now have a look at IP access
control with systemd 235. As mentioned above, the two new unit file
settings, IPAddressAllow= and IPAddressDeny= maybe be used for
that. They operate in the following way:

  1. If the source address of an incoming packet or the destination
    address of an outgoing packet matches one of the IP addresses/network
    masks in the relevant unit’s IPAddressAllow= setting then it will be
    allowed to go through.

  2. Otherwise, if a packet matches an IPAddressDeny= entry configured
    for the service it is dropped.

  3. If the packet matches neither of the above it is allowed to go
    through.

Or in other words, IPAddressDeny= implements a blacklist, but
IPAddressAllow= takes precedence.

Let’s try that out. Let’s modify our last example above in order to
get a transient service running an interactive shell which has such an
access list set:

# systemd-run -p IPAddressDeny=any -p IPAddressAllow=8.8.8.8 -p IPAddressAllow=127.0.0.0/8 -t /bin/sh
Running as unit: run-u2850.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# ping 8.8.8.8 -c1
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=27.9 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 27.957/27.957/27.957/0.000 ms
sh-4.4# ping 8.8.4.4 -c1
PING 8.8.4.4 (8.8.4.4) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
^C
--- 8.8.4.4 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
sh-4.4# ping 127.0.0.2 -c1
PING 127.0.0.1 (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.116 ms

--- 127.0.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.116/0.116/0.116/0.000 ms
sh-4.4# exit

The access list we set up uses IPAddressDeny=any in order to define
an IP white-list: all traffic will be prohibited for the session,
except for what is explicitly white-listed. In this command line, we
white-listed two address prefixes: 8.8.8.8 (with no explicit network
mask, which means the mask with all bits turned on is implied,
i.e. /32), and 127.0.0.0/8. Thus, the service can communicate with
Google’s DNS server and everything on the local loop-back, but nothing
else. The commands run in this interactive shell show this: First we
try pinging 8.8.8.8 which happily responds. Then, we try to ping
8.8.4.4 (that’s Google’s other DNS server, but excluded from this
white-list), and as we see it is immediately refused with an Operation
not permitted
error. As last step we ping 127.0.0.2 (which is on the
local loop-back), and we see it works fine again, as expected.

In the example above we used IPAddressDeny=any. The any
identifier is a shortcut for writing 0.0.0.0/0 ::/0, i.e. it’s a
shortcut for everything, on both IPv4 and IPv6. A number of other
such shortcuts exist. For example, instead of spelling out
127.0.0.0/8 we could also have used the more descriptive shortcut
localhost which is expanded to 127.0.0.0/8 ::1/128, i.e. everything
on the local loopback device, on both IPv4 and IPv6.

Being able to configure IP access lists individually for each unit is
pretty nice already. However, typically one wants to configure this
comprehensively, not just for individual units, but for a set of units
in one go or even the system as a whole. In systemd, that’s possible
by making use of
.slice
units (for those who don’t know systemd that well, slice units are a
concept for organizing services in hierarchical tree for the purpose of
resource management): the IP access list in effect for a unit is the
combination of the individual IP access lists configured for the unit
itself and those of all slice units it is contained in.

By default, system services are assigned to
system.slice,
which in turn is a child of the root slice
-.slice. Either
of these two slice units are hence suitable for locking down all
system services at once. If an access list is configured on
system.slice it will only apply to system services, however, if
configured on -.slice it will apply to all user processes of the
system, including all user session processes (i.e. which are by
default assigned to user.slice which is a child of -.slice) in
addition to the system services.

Let’s make use of this:

# systemctl set-property system.slice IPAddressDeny=any IPAddressAllow=localhost
# systemctl set-property apache.service IPAddressAllow=10.0.0.0/8

The two commands above are a very powerful way to first turn off all
IP communication for all system services (with the exception of
loop-back traffic), followed by an explicit white-listing of
10.0.0.0/8 (which could refer to the local company network, you get
the idea) but only for the Apache service.

Use-cases

After playing around a bit with this, let’s talk about use-cases. Here
are a few ideas:

  1. The IP access list logic can in many ways provide a more modern
    replacement for the venerable TCP
    Wrapper
    , but unlike it it
    applies to all IP sockets of a service unconditionally, and requires
    no explicit support in any way in the service’s code: no patching
    required. On the other hand, TCP wrappers have a number of features
    this scheme cannot cover, most importantly systemd’s IP access lists
    operate solely on the level of IP addresses and network masks, there
    is no way to configure access by DNS name (though quite frankly, that
    is a very dubious feature anyway, as doing networking — unsecured
    networking even – in order to restrict networking sounds quite
    questionable, at least to me).

  2. It can also replace (or augment) some facets of IP firewalling,
    i.e. Linux NetFilter/iptables. Right now, systemd’s access lists are
    of course a lot more minimal than NetFilter, but they have one major
    benefit: they understand the service concept, and thus are a lot more
    context-aware than NetFilter. Classic firewalls, such as NetFilter,
    derive most service context from the IP port number alone, but we live
    in a world where IP port numbers are a lot more dynamic than they used
    to be. As one example, a BitTorrent client or server may use any IP
    port it likes for its file transfer, and writing IP firewalling rules
    matching that precisely is hence hard. With the systemd IP access list
    implementing this is easy: just set the list for your BitTorrent
    service unit, and all is good.

    Let me stress though that you should be careful when comparing
    NetFilter with systemd’s IP address list logic, it’s really like
    comparing apples and oranges: to start with, the IP address list
    logic has a clearly local focus, it only knows what a local
    service is and manages access of it. NetFilter on the other hand
    may run on border gateways, at a point where the traffic flowing
    through is pure IP, carrying no information about a systemd unit
    concept or anything like that.

  3. It’s a simple way to lock down distribution/vendor supplied system
    services by default. For example, if you ship a service that you know
    never needs to access the network, then simply set IPAddressDeny=any
    (possibly combined with IPAddressAllow=localhost) for it, and it
    will live in a very tight networking sand-box it cannot escape
    from. systemd itself makes use of this for a number of its services by
    default now. For example, the logging service
    systemd-journald.service, the login manager systemd-logind or the
    core-dump processing unit [email protected] all have such a
    rule set out-of-the-box, because we know that neither of these
    services should be able to access the network, under any
    circumstances.

  4. Because the IP access list logic can be combined with transient
    units, it can be used to quickly and effectively sandbox arbitrary
    commands, and even include them in shell pipelines and such. For
    example, let’s say we don’t trust our
    curl implementation (maybe it
    got modified locally by a hacker, and phones home?), but want to use
    it anyway to download the the slides of my most recent casync
    talk
    in order to
    print it, but want to make sure it doesn’t connect anywhere except
    where we tell it to (and to make this even more fun, let’s minimize
    privileges further, by setting
    DynamicUser=yes):

    # systemd-resolve 0pointer.de
    0pointer.de: 85.214.157.71
                 2a01:238:43ed:c300:10c3:bcf3:3266:da74
    -- Information acquired via protocol DNS in 2.8ms.
    -- Data is authenticated: no
    # systemd-run --pipe -p IPAddressDeny=any \
                         -p IPAddressAllow=85.214.157.71 \
                         -p IPAddressAllow=2a01:238:43ed:c300:10c3:bcf3:3266:da74 \
                         -p DynamicUser=yes \
                         curl http://0pointer.de/public/casync-kinvolk2017.pdf | lp
    

So much about use-cases. This is by no means a comprehensive list of
what you can do with it, after all both IP accounting and IP access
lists are very generic concepts. But I do hope the above inspires your
fantasy.

What does that mean for packagers?

IP accounting and IP access control are primarily concepts for the
local administrator. However, As suggested above, it’s a very good
idea to ship services that by design have no network-facing
functionality with an access list of IPAddressDeny=any (and possibly
IPAddressAllow=localhost), in order to improve the out-of-the-box
security of our systems.

An option for security-minded distributions might be a more radical
approach: ship the system with -.slice or system.slice configured
to IPAddressDeny=any by default, and ask the administrator to punch
holes into that for each network facing service with systemctl
set-property … IPAddressAllow=…
. But of course, that’s only an
option for distributions willing to break compatibility with what was
before.

Notes

A couple of additional notes:

  1. IP accounting and access lists may be mixed with socket
    activation. In this case, it’s a good idea to configure access lists
    and accounting for both the socket unit that activates and the service
    unit that is activated, as both units maintain fully separate
    settings. Note that IP accounting and access lists configured on the
    socket unit applies to all sockets created on behalf of that unit, and
    even if these sockets are passed on to the activated services, they
    will still remain in effect and belong to the socket unit. This also
    means that IP traffic done on such sockets will be accounted to the
    socket unit, not the service unit. The fact that IP access lists are
    maintained separately for the kernel sockets created on behalf of the
    socket unit and for the kernel sockets created by the service code
    itself enables some interesting uses. For example, it’s possible to
    set a relatively open access list on the socket unit, but a very
    restrictive access list on the service unit, thus making the sockets
    configured through the socket unit the only way in and out of the
    service.

  2. systemd’s IP accounting and access lists apply to IP sockets only,
    not to sockets of any other address families. That also means that
    AF_PACKET (i.e. raw) sockets are not covered. This means it’s a good
    idea to combine IP access lists with RestrictAddressFamilies=AF_UNIX
    AF_INET
    AF_INET6

    in order to lock this down.

  3. You may wonder if the per-unit resource log message and
    systemd-run --wait may also show you details about other types or
    resources consumed by a service. The answer is yes: if you turn on
    CPUAccounting= for a service, you’ll also see a summary of consumed
    CPU time in the log message and the command output. And we are
    planning to hook-up IOAccounting= the same way too, soon.

  4. Note that IP accounting and access lists aren’t entirely
    free. systemd inserts an eBPF program into the IP pipeline to make
    this functionality work. However, eBPF execution has been optimized
    for speed in the last kernel versions already, and given that it
    currently is in the focus of interest to many I’d expect to be
    optimized even further, so that the cost for enabling these features
    will be negligible, if it isn’t already.

  5. IP accounting is currently not recursive. That means you cannot use
    a slice unit to join the accounting of multiple units into one. This
    is something we definitely want to add, but requires some more kernel
    work first.

  6. You might wonder how the
    PrivateNetwork=
    setting relates to IPAccessDeny=any. Superficially they have similar
    effects: they make the network unavailable to services. However,
    looking more closely there are a number of
    differences. PrivateNetwork= is implemented using Linux network
    name-spaces. As such it entirely detaches all networking of a service
    from the host, including non-IP networking. It does so by creating a
    private little environment the service lives in where communication
    with itself is still allowed though. In addition using the
    JoinsNamespaceOf=
    dependency additional services may be added to the same environment,
    thus permitting communication with each other but not with anything
    outside of this group. IPAddressAllow= and IPAddressDeny= are much
    less invasive. First of all they apply to IP networking only, and can
    match against specific IP addresses. A service running with
    PrivateNetwork= turned off but IPAddressDeny=any turned on, may
    enumerate the network interfaces and their IP configured even though
    it cannot actually do any IP communication. On the other hand if you
    turn on PrivateNetwork= all network interfaces besides lo
    disappear. Long story short: depending on your use-case one, the other,
    both or neither might be suitable for sand-boxing of your service. If
    possible I’d always turn on both, for best security, and that’s what
    we do for all of systemd’s own long-running services.

And that’s all for now. Have fun with per-unit IP accounting and
access lists!

Dynamic Users with systemd

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/dynamic-users-with-systemd.html

TL;DR: you may now configure systemd to dynamically allocate a UNIX
user ID for service processes when it starts them and release it when
it stops them. It’s pretty secure, mixes well with transient services,
socket activated services and service templating.

Today we released systemd
235
. Among
other improvements this greatly extends the dynamic user logic of
systemd. Dynamic users are a powerful but little known concept,
supported in its basic form since systemd 232. With this blog story I
hope to make it a bit better known.

The UNIX user concept is the most basic and well-understood security
concept in POSIX operating systems. It is UNIX/POSIX’ primary security
concept, the one everybody can agree on, and most security concepts
that came after it (such as process capabilities, SELinux and other
MACs, user name-spaces, …) in some form or another build on it, extend
it or at least interface with it. If you build a Linux kernel with all
security features turned off, the user concept is pretty much the one
you’ll still retain.

Originally, the user concept was introduced to make multi-user systems
a reality, i.e. systems enabling multiple human users to share the
same system at the same time, cleanly separating their resources and
protecting them from each other. The majority of today’s UNIX systems
don’t really use the user concept like that anymore though. Most of
today’s systems probably have only one actual human user (or even
less!), but their user databases (/etc/passwd) list a good number
more entries than that. Today, the majority of UNIX users in most
environments are system users, i.e. users that are not the technical
representation of a human sitting in front of a PC anymore, but the
security identity a system service — an executable program — runs
as. Even though traditional, simultaneous multi-user systems slowly
became less relevant, their ground-breaking basic concept became the
cornerstone of UNIX security. The OS is nowadays partitioned into
isolated services — and each service runs as its own system user, and
thus within its own, minimal security context.

The people behind the Android OS realized the relevance of the UNIX
user concept as the primary security concept on UNIX, and took its use
even further: on Android not only system services take benefit of the
UNIX user concept, but each UI app gets its own, individual user
identity too — thus neatly separating app resources from each other,
and protecting app processes from each other, too.

Back in the more traditional Linux world things are a bit less
advanced in this area. Even though users are the quintessential UNIX
security concept, allocation and management of system users is still a
pretty limited, raw and static affair. In most cases, RPM or DEB
package installation scripts allocate a fixed number of (usually one)
system users when you install the package of a service that wants to
take benefit of the user concept, and from that point on the system
user remains allocated on the system and is never deallocated again,
even if the package is later removed again. Most Linux distributions
limit the number of system users to 1000 (which isn’t particularly a
lot). Allocating a system user is hence expensive: the number of
available users is limited, and there’s no defined way to dispose of
them after use. If you make use of system users too liberally, you are
very likely to run out of them sooner rather than later.

You may wonder why system users are generally not deallocated when the
package that registered them is uninstalled from a system (at least on
most distributions). The reason for that is one relevant property of
the user concept (you might even want to call this a design flaw):
user IDs are sticky to files (and other objects such as IPC
objects). If a service running as a specific system user creates a
file at some location, and is then terminated and its package and user
removed, then the created file still belongs to the numeric ID (“UID”)
the system user originally got assigned. When the next system user is
allocated and — due to ID recycling — happens to get assigned the same
numeric ID, then it will also gain access to the file, and that’s
generally considered a problem, given that the file belonged to a
potentially very different service once upon a time, and likely should
not be readable or changeable by anything coming after
it. Distributions hence tend to avoid UID recycling which means system
users remain registered forever on a system after they have been
allocated once.

The above is a description of the status quo ante. Let’s now focus on
what systemd’s dynamic user concept brings to the table, to improve
the situation.

Introducing Dynamic Users

With systemd dynamic users we hope to make make it easier and cheaper
to allocate system users on-the-fly, thus substantially increasing the
possible uses of this core UNIX security concept.

If you write a systemd service unit file, you may enable the dynamic
user logic for it by setting the
DynamicUser=
option in its [Service] section to yes. If you do a system user is
dynamically allocated the instant the service binary is invoked, and
released again when the service terminates. The user is automatically
allocated from the UID range 61184–65519, by looking for a so far
unused UID.

Now you may wonder, how does this concept deal with the sticky user
issue discussed above? In order to counter the problem, two strategies
easily come to mind:

  1. Prohibit the service from creating any files/directories or IPC objects

  2. Automatically removing the files/directories or IPC objects the
    service created when it shuts down.

In systemd we implemented both strategies, but for different parts of
the execution environment. Specifically:

  1. Setting DynamicUser=yes implies
    ProtectSystem=strict
    and
    ProtectHome=read-only. These
    sand-boxing options turn off write access to pretty much the whole OS
    directory tree, with a few relevant exceptions, such as the API file
    systems /proc, /sys and so on, as well as /tmp and
    /var/tmp. (BTW: setting these two options on your regular services
    that do not use DynamicUser= is a good idea too, as it drastically
    reduces the exposure of the system to exploited services.)

  2. Setting DynamicUser=yes implies
    PrivateTmp=yes. This
    option sets up /tmp and /var/tmp for the service in a way that it
    gets its own, disconnected version of these directories, that are not
    shared by other services, and whose life-cycle is bound to the
    service’s own life-cycle. Thus if the service goes down, the user is
    removed and all its temporary files and directories with it. (BTW: as
    above, consider setting this option for your regular services that do
    not use DynamicUser= too, it’s a great way to lock things down
    security-wise.)

  3. Setting DynamicUser=yes implies
    RemoveIPC=yes. This
    option ensures that when the service goes down all SysV and POSIX IPC
    objects (shared memory, message queues, semaphores) owned by the
    service’s user are removed. Thus, the life-cycle of the IPC objects is
    bound to the life-cycle of the dynamic user and service, too. (BTW:
    yes, here too, consider using this in your regular services, too!)

With these four settings in effect, services with dynamic users are
nicely sand-boxed. They cannot create files or directories, except in
/tmp and /var/tmp, where they will be removed automatically when
the service shuts down, as will any IPC objects created. Sticky
ownership of files/directories and IPC objects is hence dealt with
effectively.

The
RuntimeDirectory=
option may be used to open up a bit the sandbox to external
programs. If you set it to a directory name of your choice, it will be
created below /run when the service is started, and removed in its
entirety when it is terminated. The ownership of the directory is
assigned to the service’s dynamic user. This way, a dynamic user
service can expose API interfaces (AF_UNIX sockets, …) to other
services at a well-defined place and again bind the life-cycle of it to
the service’s own run-time. Example: set RuntimeDirectory=foobar in
your service, and watch how a directory /run/foobar appears at the
moment you start the service, and disappears the moment you stop
it again. (BTW: Much like the other settings discussed above,
RuntimeDirectory= may be used outside of the DynamicUser= context
too, and is a nice way to run any service with a properly owned,
life-cycle-managed run-time directory.)

Persistent Data

Of course, a service running in such an environment (although already
very useful for many cases!), has a major limitation: it cannot leave
persistent data around it can reuse on a later run. As pretty much the
whole OS directory tree is read-only to it, there’s simply no place it
could put the data that survives from one service invocation to the
next.

With systemd 235 this limitation is removed: there are now three new
settings:
StateDirectory=,
LogsDirectory= and CacheDirectory=. In many ways they operate like
RuntimeDirectory=, but create sub-directories below /var/lib,
/var/log and /var/cache, respectively. There’s one major
difference beyond that however: directories created that way are
persistent, they will survive the run-time cycle of a service, and
thus may be used to store data that is supposed to stay around between
invocations of the service.

Of course, the obvious question to ask now is: how do these three
settings deal with the sticky file ownership problem?

For that we lifted a concept from container managers. Container
managers have a very similar problem: each container and the host
typically end up using a very similar set of numeric UIDs, and unless
user name-spacing is deployed this means that host users might be able
to access the data of specific containers that also have a user by the
same numeric UID assigned, even though it actually refers to a very
different identity in a different context. (Actually, it’s even worse
than just getting access, due to the existence of setuid file bits,
access might translate to privilege elevation.) The way container
managers protect the container images from the host (and from each
other to some level) is by placing the container trees below a
boundary directory, with very restrictive access modes and ownership
(0700 and root:root or so). A host user hence cannot take advantage
of the files/directories of a container user of the same UID inside of
a local container tree, simply because the boundary directory makes it
impossible to even reference files in it. After all on UNIX, in order
to get access to a specific path you need access to every single
component of it.

How is that applied to dynamic user services? Let’s say
StateDirectory=foobar is set for a service that has DynamicUser=
turned off. The instant the service is started, /var/lib/foobar is
created as state directory, owned by the service’s user and remains in
existence when the service is stopped. If the same service now is run
with DynamicUser= turned on, the implementation is slightly
altered. Instead of a directory /var/lib/foobar a symbolic link by
the same path is created (owned by root), pointing to
/var/lib/private/foobar (the latter being owned by the service’s
dynamic user). The /var/lib/private directory is created as boundary
directory: it’s owned by root:root, and has a restrictive access
mode of 0700. Both the symlink and the service’s state directory will
survive the service’s life-cycle, but the state directory will remain,
and continues to be owned by the now disposed dynamic UID — however it
is protected from other host users (and other services which might get
the same dynamic UID assigned due to UID recycling) by the boundary
directory.

The obvious question to ask now is: but if the boundary directory
prohibits access to the directory from unprivileged processes, how can
the service itself which runs under its own dynamic UID access it
anyway? This is achieved by invoking the service process in a slightly
modified mount name-space: it will see most of the file hierarchy the
same way as everything else on the system (modulo /tmp and
/var/tmp as mentioned above), except for /var/lib/private, which
is over-mounted with a read-only tmpfs file system instance, with a
slightly more liberal access mode permitting the service read
access. Inside of this tmpfs file system instance another mount is
placed: a bind mount to the host’s real /var/lib/private/foobar
directory, onto the same name. Putting this together these means that
superficially everything looks the same and is available at the same
place on the host and from inside the service, but two important
changes have been made: the /var/lib/private boundary directory lost
its restrictive character inside the service, and has been emptied of
the state directories of any other service, thus making the protection
complete. Note that the symlink /var/lib/foobar hides the fact that
the boundary directory is used (making it little more than an
implementation detail), as the directory is available this way under
the same name as it would be if DynamicUser= was not used. Long
story short: for the daemon and from the view from the host the
indirection through /var/lib/private is mostly transparent.

This logic of course raises another question: what happens to the
state directory if a dynamic user service is started with a state
directory configured, gets UID X assigned on this first invocation,
then terminates and is restarted and now gets UID Y assigned on the
second invocation, with X ≠ Y? On the second invocation the directory
— and all the files and directories below it — will still be owned by
the original UID X so how could the second instance running as Y
access it? Our way out is simple: systemd will recursively change the
ownership of the directory and everything contained within it to UID Y
before invoking the service’s executable.

Of course, such recursive ownership changing (chown()ing) of whole
directory trees can become expensive (though according to my
experiences, IRL and for most services it’s much cheaper than you
might think), hence in order to optimize behavior in this regard, the
allocation of dynamic UIDs has been tweaked in two ways to avoid the
necessity to do this expensive operation in most cases: firstly, when
a dynamic UID is allocated for a service an allocation loop is
employed that starts out with a UID hashed from the service’s
name. This means a service by the same name is likely to always use
the same numeric UID. That means that a stable service name translates
into a stable dynamic UID, and that means recursive file ownership
adjustments can be skipped (of course, after validation). Secondly, if
the configured state directory already exists, and is owned by a
suitable currently unused dynamic UID, it’s preferably used above
everything else, thus maximizing the chance we can avoid the
chown()ing. (That all said, ultimately we have to face it, the
currently available UID space of 4K+ is very small still, and
conflicts are pretty likely sooner or later, thus a chown()ing has to
be expected every now and then when this feature is used extensively).

Note that CacheDirectory= and LogsDirectory= work very similar to
StateDirectory=. The only difference is that they manage directories
below the /var/cache and /var/logs directories, and their boundary
directory hence is /var/cache/private and /var/log/private,
respectively.

Examples

So, after all this introduction, let’s have a look how this all can be
put together. Here’s a trivial example:

# cat > /etc/systemd/system/dynamic-user-test.service <<EOF
[Service]
ExecStart=/usr/bin/sleep 4711
DynamicUser=yes
EOF
# systemctl daemon-reload
# systemctl start dynamic-user-test
# systemctl status dynamic-user-test dynamic-user-test.service
   Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago
 Main PID: 2967 (sleep)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/dynamic-user-test.service
           └─2967 /usr/bin/sleep 4711

Okt 06 13:12:25 sigma systemd[1]: Started dynamic-user-test.service.
# ps -e -o pid,comm,user | grep 2967
 2967 sleep           dynamic-user-test
# id dynamic-user-test
uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test)
# systemctl stop dynamic-user-test
# id dynamic-user-test
id: ‘dynamic-user-test’: no such user

In this example, we create a unit file with DynamicUser= turned on,
start it, check if it’s running correctly, have a look at the service
process’ user (which is named like the service; systemd does this
automatically if the service name is suitable as user name, and you
didn’t configure any user name to use explicitly), stop the service
and verify that the user ceased to exist too.

That’s already pretty cool. Let’s step it up a notch, by doing the
same in an interactive transient service (for those who don’t know
systemd well: a transient service is a service that is defined and
started dynamically at run-time, for example via the
systemd-run
command from the shell. Think: run a service without having to write a
unit file first):

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u15750.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ id
uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0
sh-4.4$ ls -al /var/lib/private/
total 0
drwxr-xr-x. 3 root       root        60  6. Okt 13:21 .
drwxr-xr-x. 1 root       root       852  6. Okt 13:21 ..
drwxr-xr-x. 1 run-u15750 run-u15750   8  6. Okt 13:22 wuff
sh-4.4$ ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
sh-4.4$ ls -ld /var/lib/wuff/
drwxr-xr-x. 1 run-u15750 run-u15750 0  6. Okt 13:21 /var/lib/wuff/
sh-4.4$ echo hello > /var/lib/wuff/test
sh-4.4$ exit
exit
# id run-u15750
id: ‘run-u15750’: no such user
# ls -al /var/lib/private
total 0
drwx------. 1 root  root   66  6. Okt 13:21 .
drwxr-xr-x. 1 root  root  852  6. Okt 13:21 ..
drwxr-xr-x. 1 63122 63122   8  6. Okt 13:22 wuff
# ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
# ls -ld /var/lib/wuff/
drwxr-xr-x. 1 63122 63122 8  6. Okt 13:22 /var/lib/wuff/
# cat /var/lib/wuff/test
hello

The above invokes an interactive shell as transient service
run-u15750.service (systemd-run picked that name automatically,
since we didn’t specify anything explicitly) with a dynamic user whose
name is derived automatically from the service name. Because
StateDirectory=wuff is used, a persistent state directory for the
service is made available as /var/lib/wuff. In the interactive shell
running inside the service, the ls commands show the
/var/lib/private boundary directory and its contents, as well as the
symlink that is placed for the service. Finally, before exiting the
shell, a file is created in the state directory. Back in the original
command shell we check if the user is still allocated: it is not, of
course, since the service ceased to exist when we exited the shell and
with it the dynamic user associated with it. From the host we check
the state directory of the service, with similar commands as we did
from inside of it. We see that things are set up pretty much the same
way in both cases, except for two things: first of all the user/group
of the files is now shown as raw numeric UIDs instead of the
user/group names derived from the unit name. That’s because the user
ceased to exist at this point, and “ls” shows the raw UID for files
owned by users that don’t exist. Secondly, the access mode of the
boundary directory is different: when we look at it from outside of
the service it is not readable by anyone but root, when we looked from
inside we saw it it being world readable.

Now, let’s see how things look if we start another transient service,
reusing the state directory from the first invocation:

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u16087.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ cat /var/lib/wuff/test
hello
sh-4.4$ ls -al /var/lib/wuff/
total 4
drwxr-xr-x. 1 run-u16087 run-u16087  8  6. Okt 13:22 .
drwxr-xr-x. 3 root       root       60  6. Okt 15:42 ..
-rw-r--r--. 1 run-u16087 run-u16087  6  6. Okt 13:22 test
sh-4.4$ id
uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0
sh-4.4$ exit
exit

Here, systemd-run picked a different auto-generated unit name, but
the used dynamic UID is still the same, as it was read from the
pre-existing state directory, and was otherwise unused. As we can see
the test file we generated earlier is accessible and still contains
the data we left in there. Do note that the user name is different
this time (as it is derived from the unit name, which is different),
but the UID it is assigned to is the same one as on the first
invocation. We can thus see that the mentioned optimization of the UID
allocation logic (i.e. that we start the allocation loop from the UID
owner of any existing state directory) took effect, so that no
recursive chown()ing was required.

And that’s the end of our example, which hopefully illustrated a bit
how this concept and implementation works.

Use-cases

Now that we had a look at how to enable this logic for a unit and how
it is implemented, let’s discuss where this actually could be useful
in real life.

  • One major benefit of dynamic user IDs is that running a
    privilege-separated service leaves no artifacts in the system. A
    system user is allocated and made use of, but it is discarded
    automatically in a safe and secure way after use, in a fashion that is
    safe for later recycling. Thus, quickly invoking a short-lived service
    for processing some job can be protected properly through a user ID
    without having to pre-allocate it and without this draining the
    available UID pool any longer than necessary.

  • In many cases, starting a service no longer requires
    package-specific preparation. Or in other words, quite often
    useradd/mkdir/chown/chmod invocations in “post-inst” package
    scripts, as well as
    sysusers.d
    and
    tmpfiles.d
    drop-ins become unnecessary, as the DynamicUser= and
    StateDirectory=/CacheDirectory=/LogsDirectory= logic can do the
    necessary work automatically, on-demand and with a well-defined
    life-cycle.

  • By combining dynamic user IDs with the transient unit concept, new
    creative ways of sand-boxing are made available. For example, let’s say
    you don’t trust the correct implementation of the sort command. You
    can now lock it into a simple, robust, dynamic UID sandbox with a
    simple systemd-run and still integrate it into a shell pipeline like
    any other command. Here’s an example, showcasing a shell pipeline
    whose middle element runs as a dynamically on-the-fly allocated UID,
    that is released when the pipelines ends.

    # cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
    
  • By combining dynamic user IDs with the systemd templating logic it
    is now possible to do much more fine-grained and fully automatic UID
    management. For example, let’s say you have a template unit file
    /etc/systemd/system/[email protected]:

    [Service]
    ExecStart=/usr/bin/myfoobarserviced
    DynamicUser=1
    StateDirectory=foobar/%i
    

    Now, let’s say you want to start one instance of this service for
    each of your customers. All you need to do now for that is:

    # systemctl enable foobard@customerxyz.service --now
    

    And you are done. (Invoke this as many times as you like, each time
    replacing customerxyz by some customer identifier, you get the
    idea.)

  • By combining dynamic user IDs with socket activation you may easily
    implement a system where each incoming connection is served by a
    process instance running as a different, fresh, newly allocated UID
    within its own sandbox. Here’s an example waldo.socket:

    [Socket]
    ListenStream=2048
    Accept=yes
    

    With a matching [email protected]:

    [Service]
    ExecStart=-/usr/bin/myservicebinary
    DynamicUser=yes
    

    With the two unit files above, systemd will listen on TCP/IP port
    2048, and for each incoming connection invoke a fresh instance of
    [email protected], each time utilizing a different, new,
    dynamically allocated UID, neatly isolated from any other
    instance.

  • Dynamic user IDs combine very well with state-less systems,
    i.e. systems that come up with an unpopulated /etc and /var. A
    service using dynamic user IDs and the StateDirectory=,
    CacheDirectory=, LogsDirectory= and RuntimeDirectory= concepts
    will implicitly allocate the users and directories it needs for
    running, right at the moment where it needs it.

Dynamic users are a very generic concept, hence a multitude of other
uses are thinkable; the list above is just supposed to trigger your
imagination.

What does this mean for you as a packager?

I am pretty sure that a large number of services shipped with today’s
distributions could benefit from using DynamicUser= and
StateDirectory= (and related settings). It often allows removal of
post-inst packaging scripts altogether, as well as any sysusers.d
and tmpfiles.d drop-ins by unifying the needed declarations in the
unit file itself. Hence, as a packager please consider switching your
unit files over. That said, there are a number of conditions where
DynamicUser= and StateDirectory= (and friends) cannot or should
not be used. To name a few:

  1. Service that need to write to files outside of /run/<package>,
    /var/lib/<package>, /var/cache/<package>, /var/log/<package>,
    /var/tmp, /tmp, /dev/shm are generally incompatible with this
    scheme. This rules out daemons that upgrade the system as one example,
    as that involves writing to /usr.

  2. Services that maintain a herd of processes with different user
    IDs. Some SMTP services are like this. If your service has such a
    super-server design, UID management needs to be done by the
    super-server itself, which rules out systemd doing its dynamic UID
    magic for it.

  3. Services which run as root (obviously…) or are otherwise
    privileged.

  4. Services that need to live in the same mount name-space as the host
    system (for example, because they want to establish mount points
    visible system-wide). As mentioned DynamicUser= implies
    ProtectSystem=, PrivateTmp= and related options, which all require
    the service to run in its own mount name-space.

  5. Your focus is older distributions, i.e. distributions that do not
    have systemd 232 (for DynamicUser=) or systemd 235 (for
    StateDirectory= and friends) yet.

  6. If your distribution’s packaging guides don’t allow it. Consult
    your packaging guides, and possibly start a discussion on your
    distribution’s mailing list about this.

Notes

A couple of additional, random notes about the implementation and use
of these features:

  1. Do note that allocating or deallocating a dynamic user leaves
    /etc/passwd untouched. A dynamic user is added into the user
    database through the glibc NSS module
    nss-systemd,
    and this information never hits the disk.

  2. On traditional UNIX systems it was the job of the daemon process
    itself to drop privileges, while the DynamicUser= concept is
    designed around the service manager (i.e. systemd) being responsible
    for that. That said, since v235 there’s a way to marry DynamicUser=
    and such services which want to drop privileges on their own. For
    that, turn on DynamicUser= and set
    User=
    to the user name the service wants to setuid() to. This has the
    effect that systemd will allocate the dynamic user under the specified
    name when the service is started. Then, prefix the command line you
    specify in
    ExecStart=
    with a single ! character. If you do, the user is allocated for the
    service, but the daemon binary is invoked as root instead of the
    allocated user, under the assumption that the daemon changes its UID
    on its own the right way. Note that after registration the user will
    show up instantly in the user database, and is hence resolvable like
    any other by the daemon process. Example:
    ExecStart=!/usr/bin/mydaemond

  3. You may wonder why systemd uses the UID range 61184–65519 for its
    dynamic user allocations (side note: in hexadecimal this reads as
    0xEF00–0xFFEF). That’s because distributions (specifically Fedora)
    tend to allocate regular users from below the 60000 range, and we
    don’t want to step into that. We also want to stay away from 65535 and
    a bit around it, as some of these UIDs have special meanings (65535 is
    often used as special value for “invalid” or “no” UID, as it is
    identical to the 16bit value -1; 65534 is generally mapped to the
    “nobody” user, and is where some kernel subsystems map unmappable
    UIDs). Finally, we want to stay within the 16bit range. In a user
    name-spacing world each container tends to have much less than the full
    32bit UID range available that Linux kernels theoretically
    provide. Everybody apparently can agree that a container should at
    least cover the 16bit range though — already to include a nobody
    user. (And quite frankly, I am pretty sure assigning 64K UIDs per
    container is nicely systematic, as the the higher 16bit of the 32bit
    UID values this way become a container ID, while the lower 16bit
    become the logical UID within each container, if you still follow what
    I am babbling here…). And before you ask: no this range cannot be
    changed right now, it’s compiled in. We might change that eventually
    however.

  4. You might wonder what happens if you already used UIDs from the
    61184–65519 range on your system for other purposes. systemd should
    handle that mostly fine, as long as that usage is properly registered
    in the user database: when allocating a dynamic user we pick a UID,
    see if it is currently used somehow, and if yes pick a different one,
    until we find a free one. Whether a UID is used right now or not is
    checked through NSS calls. Moreover the IPC object lists are checked to
    see if there are any objects owned by the UID we are about to
    pick. This means systemd will avoid using UIDs you have assigned
    otherwise. Note however that this of course makes the pool of
    available UIDs smaller, and in the worst cases this means that
    allocating a dynamic user might fail because there simply are no
    unused UIDs in the range.

  5. If not specified otherwise the name for a dynamically allocated
    user is derived from the service name. Not everything that’s valid in
    a service name is valid in a user-name however, and in some cases a
    randomized name is used instead to deal with this. Often it makes
    sense to pick the user names to register explicitly. For that use
    User= and choose whatever you like.

  6. If you pick a user name with User= and combine it with
    DynamicUser= and the user already exists statically it will be used
    for the service and the dynamic user logic is automatically
    disabled. This permits automatic up- and downgrades between static and
    dynamic UIDs. For example, it provides a nice way to move a system
    from static to dynamic UIDs in a compatible way: as long as you select
    the same User= value before and after switching DynamicUser= on,
    the service will continue to use the statically allocated user if it
    exists, and only operates in the dynamic mode if it does not. This is
    useful for other cases as well, for example to adapt a service that
    normally would use a dynamic user to concepts that require statically
    assigned UIDs, for example to marry classic UID-based file system
    quota with such services.

  7. systemd always allocates a pair of dynamic UID and GID at the same
    time, with the same numeric ID.

  8. If the Linux kernel had a “shiftfs” or similar functionality,
    i.e. a way to mount an existing directory to a second place, but map
    the exposed UIDs/GIDs in some way configurable at mount time, this
    would be excellent for the implementation of StateDirectory= in
    conjunction with DynamicUser=. It would make the recursive
    chown()ing step unnecessary, as the host version of the state
    directory could simply be mounted into a the service’s mount
    name-space, with a shift applied that maps the directory’s owner to the
    services’ UID/GID. But I don’t have high hopes in this regard, as all
    work being done in this area appears to be bound to user name-spacing
    — which is a concept not used here (and I guess one could say user
    name-spacing is probably more a source of problems than a solution to
    one, but you are welcome to disagree on that).

And that’s all for now. Enjoy your dynamic users!

All Systems Go! 2017 Schedule Published

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/all-systems-go-2017-schedule-published.html

The All Systems Go! 2017 schedule has been published!

I am happy to announce that we have published the All Systems Go! 2017 schedule!
We are very happy with the large number and the quality of the
submissions we got, and the resulting schedule is exceptionally
strong.

Without further ado:

Here’s the schedule for the first day (Saturday, 21st of October).

And here’s the schedule for the second day (Sunday, 22nd of October).

Here are a couple of keywords from the topics of the talks:
1password, azure, bluetooth, build systems,
casync, cgroups, cilium, cockpit, containers,
ebpf, flatpak, habitat, IoT, kubernetes,
landlock, meson, OCI, rkt, rust, secureboot,
skydive, systemd, testing, tor, varlink,
virtualization, wifi, and more.

Our speakers are from all across the industry: Chef CoreOS, Covalent,
Facebook, Google, Intel, Kinvolk, Microsoft, Mozilla, Pantheon,
Pengutronix, Red Hat, SUSE and more.

For further information about All Systems Go! visit our conference web site.

Make sure to buy your ticket for All Systems Go! 2017 now! A limited
number of tickets are left at this point, so make sure you get yours
before we are all sold out! Find all details here.

See you in Berlin!

All Systems Go! 2017 CfP Closes Soon!

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/all-systems-go-2017-cfp-closes-soon.html

The All Systems Go! 2017 Call for Participation is Closing on September 3rd!

Please make sure to get your presentation proprosals forAll Systems Go! 2017 in now! The CfP closes on sunday!

In case you haven’t heard about All Systems Go! yet, here’s a quick reminder what kind of conference it is, and why you should attend and speak there:

All Systems Go! is an Open Source community conference focused
on the projects and technologies at the foundation of modern Linux
systems — specifically low-level user-space technologies. Its goal is
to provide a friendly and collaborative gathering place for
individuals and communities working to push these technologies
forward. All Systems Go! 2017 takes place in Berlin,
Germany
on October 21st+22nd. All Systems Go! is a
2-day event with 2-3 talks happening in parallel. Full presentation
slots are 30-45 minutes in length and lightning talk slots are 5-10
minutes.

In particular, we are looking for sessions including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things,
talks about kernel projects are welcome too, as long as they have a
clear and direct relevance for user-space.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

systemd.conf will not take place this year in lieu of All
Systems Go!
. All Systems Go! welcomes all projects that
contribute to Linux user space, which, of course, includes
systemd. Thus, anything you think was appropriate for submission to
systemd.conf is also fitting for All Systems Go!

All Systems Go! 2017 Speakers

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/all-systems-go-2017-speakers.html

The All Systems Go! 2017 Headline Speakers Announced!

Don’t forget to send in your submissions to the All Systems Go! 2017 CfP! Proposals are accepted until September 3rd!

A couple of headline speakers have been announced now:

  • Alban Crequy (Kinvolk)
  • Brian “Redbeard” Harrington (CoreOS)
  • Gianluca Borello (Sysdig)
  • Jon Boulle (NStack/CoreOS)
  • Martin Pitt (Debian)
  • Thomas Graf (covalent.io/Cilium)
  • Vincent Batts (Red Hat/OCI)
  • (and yours truly)

These folks will also review your submissions as part of the papers committee!

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward.

All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html

Introducing mkosi

After blogging about
casync
I realized I never blogged about the
mkosi tool that combines nicely
with it. mkosi has been around for a while already, and its time to
make it a bit better known. mkosi stands for Make Operating System
Image
, and is a tool for precisely that: generating an OS tree or
image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite
well known and popular. But mkosi has a number of features that I
think make it interesting for a variety of use-cases that other tools
don’t cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart?
mkosi is definitely a tool with a focus on developer’s needs for
building OS images, for testing and debugging, but also for generating
production images with cryptographic protection. A typical use-case
would be to add a mkosi.default file to an existing project (for
example, one written in C or Python), and thus making it easy to
generate an OS image for it. mkosi will put together the image with
development headers and tools, compile your code in it, run your test
suite, then throw away the image again, and build a new one, this time
without development headers and tools, and install your build
artifacts in it. This final image is then “production-ready”, and only
contains your built program and the minimal set of packages you
configured otherwise. Such an image could then be deployed with
casync (or any other tool of course) to be delivered to your set of
servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on
today’s technology, not yesteryear’s. Specifically this means that
we’ll generate GPT partition tables, not MBR/DOS ones. When you tell
mkosi to generate a bootable image for you, it will make it bootable
on EFI, not on legacy BIOS. The GPT images generated follow
specifications such as the Discoverable Partitions
Specification
,
so that /etc/fstab can remain unpopulated and tools such as
systemd-nspawn can automatically dissect the image and boot from
them.

So, let’s have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional
options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build
images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same
feature level currently. Also, as mkosi is based on dnf
--installroot
, debootstrap, pacstrap and zypper, and those
packages are not packaged universally on all distributions, you might
not be able to build images for all those distributions on arbitrary
host distributions.

The GPT images are put together in a way that they aren’t just
compatible with UEFI systems, but also with VM and container managers
(that is, at least the smart ones, i.e. VM managers that know UEFI,
and container managers that grok GPT disk images) to a large
degree. In fact, the idea is that you can use mkosi to build a
single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd’s RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used
if available to ensure the image is not tampered with (yes, you read
that right, systemd-nspawn and systemd’s RootImage= setting
automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without
parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you,
will call it image.raw and drop it in the current directory. The
distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is
put together, i.e. select package sets, select the distribution, size
partitions and so on. Most of that you can actually specify on the
command line, but it is recommended to instead create a couple of
mkosi.$SOMETHING files and directories in some directory. Then,
simply change to that directory and run mkosi without any further
arguments. The tool will then look in the current working directory
for these files and directories and make use of them (similar to how
make looks for a Makefile…). Every single file/directory is
optional, but if they exist they are honored. Here’s a list of the
files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you
    can configure what kind of image you want, which distribution, which
    packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy
    everything inside it into the images built. You can place arbitrary
    directory hierarchies in here, and they’ll be copied over whatever is
    already in the image, after it was put together by the distribution’s
    package manager. This is the best way to drop additional static files
    into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build
    script. When it exists, mkosi will build two images, one after the
    other in the mode already mentioned above: the first version is the
    build image, and may include various build-time dependencies such as
    a compiler or development headers. The build script is also copied
    into it, and then run inside it. The script should then build
    whatever shall be built and place the result in $DESTDIR (don’t
    worry, popular build tools such as Automake or Meson all honor
    $DESTDIR anyway, so there’s not much to do here explicitly). It may
    also run a test suite, or anything else you like. After the script
    finished, the build image is removed again, and a second image (the
    final image) is built. This time, no development packages are
    included, and the build script is not copied into the image again —
    however, the build artifacts from the first run (i.e. those placed in
    $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked
    inside the image (inside a systemd-nspawn invocation) and can
    adjust the image as it likes at a very late point in the image
    preparation. If mkosi.build exists, i.e. the dual-phased
    development build process used, then this script will be invoked
    twice: once inside the build image and once inside the final
    image. The first parameter passed to the script clarifies which phase
    it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a
    container configuration file for systemd-nspawn (see
    systemd.nspawn(5)
    for details), which shall be shipped along with the final image and
    shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package
    cache directory for the builds. This directory is effectively bind
    mounted into the image at build time, in order to speed up building
    images. The package installers of the various distributions will
    place their package files here, so that subsequent runs can reuse
    them.

  7. mkosi.passphrase — If this file exists, it should contain a
    pass-phrase to use for the LUKS encryption (if that’s enabled for the
    image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an
    X.509 key pair to use for signing the kernel and initrd for UEFI
    SecureBoot, if that’s enabled.

How to use it

So, let’s come back to our most trivial example, without any of the
mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current
directory. How do we use it? Of course, we could dd it onto some USB
stick and boot it on a bare-metal device. However, it’s much simpler
to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let’s make things more interesting. Let’s still not use any of
the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it’s no
longer GPT + ext4, but GPT + btrfs. Moreover, the system is made
bootable on UEFI systems, and finally, the output is now called
foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except
that this uses full VM virtualization rather than container
virtualization. (Note that the way to run a UEFI qemu/kvm instance
appears to change all the time and is different on the various
distributions. It’s quite annoying, and I can’t really tell you what
the right qemu command line is to make this work on your system.)

Of course, it’s not all raw GPT disk images with mkosi. Let’s try
a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can’t boot
it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask
mkosi to generate a compressed GPT image with a root squashfs,
compress the result with xz, and generate a SHA256SUMS file with
the hashes of the generated artifacts. The package will contain the
SSH client as well as everybody’s favorite editor.

Now, let’s make use of the various mkosi.$SOMETHING files. Let’s
say we are working on some Automake-based project and want to make it
easy to generate a disk image off the development tree with the
version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let’s add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let’s try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you’ll notice that building an image like
this can be quite slow. And slow build times are actively hurtful to
your productivity as a developer. Hence let’s make things a bit
faster. First, let’s make use of a package cache shared between runs:

# mkdir mkosi.cache

Building images now should already be substantially faster (and
generate less network traffic) as the packages will now be downloaded
only once and reused. However, you’ll notice that unpacking all those
packages and the rest of the work is still quite slow. But mkosi can
help you with that. Simply use mkosi‘s incremental build feature. In
this mode mkosi will make a copy of the build and final images
immediately before dropping in your build sources or artifacts, so
that building an image becomes a lot quicker: instead of always
starting totally from scratch a build will now reuse everything it can
reuse from a previous run, and immediately begin with building your
sources rather than the build image to build your sources in. To
enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated
anymore from your distribution’s servers, as the cached copy is made
after all packages are installed, and hence until you actually delete
the cached copy the distribution’s network servers aren’t contacted
again and no RPMs or DEBs are downloaded. This means the distribution
you use becomes “frozen in time” this way. (Which might be a bad
thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you’ll notice that it
won’t overwrite the generated image when it already exists. You can
either delete the file yourself first (rm image.raw) or let mkosi
do it for you right before building a new image, with mkosi -f. You
can also tell mkosi to not only remove any such pre-existing images,
but also remove any cached copies of the incremental feature, by using
-f twice.

I wrote mkosi originally in order to test systemd, and quickly
generate a disk image of various distributions with the most current
systemd version from git, without all that affecting my host system. I
regularly use mkosi for that today, in incremental mode. The two
commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the
very newest set of RPMs provided by Fedora, instead of a cached
snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git
tree:
mkosi.default
and
mkosi.build. This
way, any developer who wants to quickly test something with current
systemd git, or wants to prepare a patch based on it and test it can
check out the systemd repository and simply run mkosi in it and a
few minutes later he has a bootable image he can test in
systemd-nspawn or KVM. casync has similar files:
mkosi.default,
mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled
    disk images if you ask for it. For that use the --verity switch on
    the command line or Verity= setting in mkosi.default. Of course,
    dm-verity implies that the root volume is read-only. In this mode
    the top-level dm-verity hash will be placed along-side the output
    disk image in a file named the same way, but with the .roothash
    suffix. If the image is to be created bootable, the root hash is also
    included on the kernel command line in the roothash= parameter,
    which current systemd versions can use to both find and activate the
    root partition in a dm-verity protected way. BTW: it’s a good idea
    to combine this dm-verity mode with the raw_squashfs image mode,
    to generate a genuinely protected, compressed image suitable for
    running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum
    file SHA256SUMS for you (--checksum) covering all the files it
    outputs (which could be the image file itself, a matching .nspawn
    file using the mkosi.nspawn file mentioned above, as well as the
    .roothash file for the dm-verity root hash.) It can then
    optionally sign this with gpg (--sign). Note that systemd‘s
    machinectl pull-tar and machinectl pull-raw command can download
    these files and the SHA256SUMS file automatically and verify things
    on download. With other words: what mkosi outputs is perfectly
    ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To
    make use of that, place your X.509 key pair in two files
    mkosi.secureboot.crt and mkosi.secureboot.key, and set
    SecureBoot= or --secure-boot. If so, mkosi will sign the
    kernel/initrd/kernel command line combination during the build. Of
    course, if you use this mode, you should also use
    Verity=/--verity=, otherwise the setup makes only partial
    sense. Note that mkosi will not help you with actually enrolling
    the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes
    it is run in a git checkout and you use the mkosi.build script
    stuff, the source tree will be copied into the build image, but will
    all files excluded by .gitignore removed.

  5. There’s support for encryption in place. Use --encrypt= or
    Encrypt=. Note that the UEFI ESP is never encrypted though, and the
    root partition only if explicitly requested. The /home and /srv
    partitions are unconditionally encrypted if that’s enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line
    arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies,
listed in the
README. Most
notably you need a somewhat recent systemd version to make use of its
full feature set: systemd 233. Older versions are already packaged for
various distributions, but much of what I describe above is only
available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn’t
available in Fedora, but there’s a
COPR
.

Future

It is my intention to continue turning mkosi into a tool suitable
for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and
systemd/sd-boot native support for A/B IoT style partition
setups. The idea is that the combination of systemd, casync and
mkosi provides generic building blocks for building secure,
auto-updating devices in a generic way from, even though all pieces
may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like
    $SOMEOTHERPROJECT!
    — Well, to my knowledge there’s no tool that
    integrates this nicely with your project’s development tree, and can
    do dm-verity and UEFI SecureBoot and all that stuff for you. So
    nope, I don’t think this exactly like $SOMEOTHERPROJECT, thank you
    very much.

  2. What about creating MBR/DOS partition images? — That’s really
    out of focus to me. This is an exercise in figuring out how generic
    OSes and devices in the future should be built and an attempt to
    commoditize OS image building. And no, the future doesn’t speak MBR,
    sorry. That said, I’d be quite interested in adding support for
    booting on Raspberry Pi, possibly using a hybrid approach, i.e. using
    a GPT disk label, but arranging things in a way that the Raspberry Pi
    boot protocol (which is built around DOS partition tables), can still
    work.

  3. Is this portable? — Well, depends what you mean by
    portable. No, this tool runs on Linux only, and as it uses
    systemd-nspawn during the build process it doesn’t run on
    non-systemd systems either. But then again, you should be able to
    create images for any architecture you like with it, but of course if
    you want the image bootable on bare-metal systems only systems doing
    UEFI are supported (but systemd-nspawn should still work fine on
    them).

  4. Where can I get this stuff? — Try
    GitHub. And some distributions
    carry packaged versions, but I think none of them the current v3
    yet.

  5. Is this a systemd project? — Yes, it’s hosted under the
    systemd GitHub umbrella. And yes,
    during run-time systemd-nspawn in a current version is required. But
    no, the code-bases are separate otherwise, already because systemd
    is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no?
    Yes, but the feature we need kind of matters (systemd-nspawn‘s
    --overlay= switch), and again, this isn’t supposed to be a tool for
    legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am
    not an LXC nor Docker guy. If you select directory or subvolume
    as image type, LXC should be able to boot the generated images just
    fine, but I didn’t try. Last time I looked, Docker doesn’t permit
    running proper init systems as PID 1 inside the container, as they
    define their own run-time without intention to emulate a proper
    system. Hence, no I don’t think it will work, at least not with an
    unpatched Docker version. That said, again, don’t ask me questions
    about Docker, it’s not precisely my area of expertise, and quite
    frankly I am not a fan. To my knowledge neither LXC nor Docker are
    able to run containers directly off GPT disk images, hence the
    various raw_xyz image types are definitely not compatible with
    either. That means if you want to generate a single raw disk image
    that can be booted unmodified both in a container and on bare-metal,
    then systemd-nspawn is the container manager to go for
    (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that’s up to you really.

If you hack on some complex project and need a quick way to compile
and run your project on a specific current Linux distribution, then
mkosi is an excellent way to do that. Simply drop the mkosi.default
and mkosi.build files in your git tree and everything will be
easy. (And of course, as indicated above: if the project you are
hacking on happens to be called systemd or casync be aware that
those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great
choice too, as it will make it reasonably easy to generate secure
images that are protected against offline modification, by using
dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a
VM or systemd-nspawn container, or a portable service then mkosi
is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd
init systems, old VM managers, Docker, … then no, mkosi is not for
you, but there are plenty of well-established alternatives around that
cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to
accept your patches and other contributions.

Oh, and one unrelated last thing: don’t forget to submit your talk
proposal

and/or buy a ticket for
All Systems Go! 2017 in Berlin — the
conference where things like systemd, casync and mkosi are
discussed, along with a variety of other Linux userspace projects used
for building systems.

casync — A tool for distributing file system images

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html

Introducing casync

In the past months I have been working on a new project:
casync. casync takes
inspiration from the popular rsync file
synchronization tool as well as the probably even more popular
git revision control system. It combines the
idea of the rsync algorithm with the idea of git-style
content-addressable file systems, and creates a new system for
efficiently storing and delivering file system images, optimized for
high-frequency update cycles over the Internet. Its current focus is
on delivering IoT, container, VM, application, portable service or OS
images, but I hope to extend it later in a generic fashion to become
useful for backups and home directory synchronization as well (but
more about that later).

The basic technological building blocks casync is built from are
neither new nor particularly innovative (at least not anymore),
however the way casync combines them is different from existing tools,
and that’s what makes it useful for a variety of use-cases that other
tools can’t cover that well.

Why?

I created casync after studying how today’s popular tools store and
deliver file system images. To briefly name a few: Docker has a
layered tarball approach,
OSTree serves the
individual files directly via HTTP and maintains packed deltas to
speed up updates, while other systems operate on the block layer and
place raw squashfs images (or other archival file systems, such as
IS09660) for download on HTTP shares (in the better cases combined
with zsync data).

Neither of these approaches appeared fully convincing to me when used
in high-frequency update cycle systems. In such systems, it is
important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don’t think any of the tools mentioned above are really good on more
than a small subset of these points.

Specifically: Docker’s layered tarball approach dumps the “delta”
question onto the feet of the image creators: the best way to make
your image downloads minimal is basing your work on an existing image
clients might already have, and inherit its resources, maintaining full
history. Here, revision control (a tool for the developer) is
intermingled with update management (a concept for optimizing
production delivery). As container histories grow individual deltas
are likely to stay small, but on the other hand a brand-new deployment
usually requires downloading the full history onto the deployment
system, even though there’s no use for it there, and likely requires
substantially more disk space and download sizes.

OSTree’s serving of individual files is unfriendly to CDNs (as many
small files in file trees cause an explosion of HTTP GET
requests). To counter that OSTree supports placing pre-calculated
delta images between selected revisions on the delivery servers, which
means a certain amount of revision management, that leaks into the
clients.

Delivering direct squashfs (or other file system) images is almost
beautifully simple, but of course means every update requires a full
download of the newest image, which is both bad for disk usage and
generated traffic. Enhancing it with zsync makes this a much better
option, as it can reduce generated traffic substantially at very
little cost of history/meta-data (no explicit deltas between a large
number of versions need to be prepared server side). On the other hand
server requirements in disk space and functionality (HTTP Range
requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it’s not
my intention to badmouth them. They only point I am trying to make is
that for the use case I care about — file system image delivery with
high high frequency update-cycles — each system comes with certain
drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn’t happy with the security
and reproducibility properties of these systems. In today’s world
where security breaches involving hacking and breaking into connected
systems happen every day, an image delivery system that cannot make
strong guarantees regarding data integrity is out of
date. Specifically, the tarball format is famously nondeterministic:
the very same file tree can result in any number of different
valid serializations depending on the tool used, its version and the
underlying OS and file system. Some tar implementations attempt to
correct that by guaranteeing that each file tree maps to exactly
one valid serialization, but such a property is always only specific
to the tool used. I strongly believe that any good update system must
guarantee on every single link of the chain that there’s only one
valid representation of the data to deliver, that can easily be
verified.

What casync Is

So much about the background why I created casync. Now, let’s have a
look what casync actually is like, and what it does. Here’s the brief
technical overview:

Encoding: Let’s take a large linear data stream, split it into
variable-sized chunks (the size of each being a function of the
chunk’s contents), and store these chunks in individual, compressed
files in some directory, each file named after a strong hash value of
its contents, so that the hash value may be used to as key for
retrieving the full chunk data. Let’s call this directory a “chunk
store”. At the same time, generate a “chunk index” file that lists
these chunk hash values plus their respective chunk sizes in a simple
linear array. The chunking algorithm is supposed to create variable,
but similarly sized chunks from the data stream, and do so in a way
that the same data results in the same chunks even if placed at
varying offsets. For more information see this blog
story
.

Decoding: Let’s take the chunk index file, and reassemble the large
linear data stream by concatenating the uncompressed chunks retrieved
from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible,
random-access serialization format for file trees (think: a more
modern tar), to permit efficient, stable storage of complete file
trees in the system, simply by serializing them and then passing them
into the encoding step explained above.

Finally, let’s put all this on the network: for each image you want to
deliver, generate a chunk index file and place it on an HTTP
server. Do the same with the chunk store, and share it between the
various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result
in mostly the same chunk files in the chunk store. This means it is
very efficient to store many related versions of a data stream in the
same chunk store, thus minimizing disk usage. Moreover, when
transferring linear data streams chunks already known on the receiving
side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well,
one major difference between casync and those tools is that we
remove file boundaries before chunking things up. This means that
small files are lumped together with their siblings and large files
are chopped into pieces, which permits us to recognize similarities in
files and directories beyond file boundaries, and makes sure our chunk
sizes are pretty evenly distributed, without the file boundaries
affecting them.

The “chunking” algorithm is based on a the buzhash rolling hash
function. SHA256 is used as strong hash function to generate digests
of the chunks. xz is used to compress the individual chunks.

Here’s a diagram, hopefully explaining a bit how the encoding process
works, wasn’t it for my crappy drawing skills:

Diagram

The diagram shows the encoding process from top to bottom. It starts
with a block device or a file tree, which is then serialized and
chunked up into variable sized blocks. The compressed chunks are then
placed in the chunk store, while a chunk index file is written listing
the chunk hashes in order. (The original SVG of this graphic may be
found here.)

Details

Note that casync operates on two different layers, depending on the
use-case of the user:

  1. You may use it on the block layer. In this case the raw block data
    on disk is taken as-is, read directly from the block device, split
    into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the
    file tree serialization format mentioned above comes into play:
    the file tree is serialized depth-first (much like tar would do
    it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer
opens it up for a variety of different use-cases. In the VM and IoT
ecosystems shipping images as block-level serializations is more
common, while in the container and application world file-system-level
serializations are more typically used.

Chunk index files referring to block-layer serializations carry the
.caibx suffix, while chunk index files referring to file system
serializations carry the .caidx suffix. Note that you may also use
casync as direct tar replacement, i.e. without the chunking, just
generating the plain linear file tree serialization. Such files
carry the .catar suffix. Internally .caibx are identical to
.caidx files, the only difference is semantical: .caidx files
describe a .catar file, while .caibx files may describe any other
blob. Finally, chunk stores are directories carrying the .castr
suffix.

Features

Here are a couple of other features casync has:

  1. When downloading a new image you may use casync‘s --seed=
    feature: each block device, file, or directory specified is processed
    using the same chunking logic described above, and is used as
    preferred source when putting together the downloaded image locally,
    avoiding network transfer of it. This of course is useful whenever
    updating an image: simply specify one or more old versions as seed and
    only download the chunks that truly changed since then. Note that
    using seeds requires no history relationship between seed and the new
    image to download. This has major benefits: you can even use it to
    speed up downloads of relatively foreign and unrelated data. For
    example, when downloading a container image built using Ubuntu you can
    use your Fedora host OS tree in /usr as seed, and casync will
    automatically use whatever it can from that tree, for example timezone
    and locale data that tends to be identical between
    distributions. Example: casync extract
    http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2
    . This
    will place the block-layer image described by the indicated URL in the
    /dev/sda2 partition, using the existing /dev/sda1 data as seeding
    source. An invocation like this could be typically used by IoT systems
    with an A/B partition setup. Example 2: casync extract
    http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1
    --seed=/srv/container-v2 /src/container-v3
    , is very similar but
    operates on the file system layer, and uses two old container versions
    to seed the new version.

  2. When operating on the file system level, the user has fine-grained
    control on the meta-data included in the serialization. This is
    relevant since different use-cases tend to require a different set of
    saved/restored meta-data. For example, when shipping OS images, file
    access bits/ACLs and ownership matter, while file modification times
    hurt. When doing personal backups OTOH file ownership matters little
    but file modification times are important. Moreover different backing
    file systems support different feature sets, and storing more
    information than necessary might make it impossible to validate a tree
    against an image if the meta-data cannot be replayed in full. Due to
    this, casync provides a set of --with= and --without= parameters
    that allow fine-grained control of the data stored in the file tree
    serialization, including the granularity of modification times and
    more. The precise set of selected meta-data features is also always
    part of the serialization, so that seeding can work correctly and
    automatically.

  3. casync tries to be as accurate as possible when storing file
    system meta-data. This means that besides the usual baseline of file
    meta-data (file ownership and access bits), and more advanced features
    (extended attributes, ACLs, file capabilities) a number of more exotic
    data is stored as well, including Linux
    chattr(1) file attributes, as
    well as FAT file
    attributes

    (you may wonder why the latter? — EFI is FAT, and /efi is part of
    the comprehensive serialization of any host). In the future I intend
    to extend this further, for example storing btrfs sub-volume
    information where available. Note that as described above every single
    type of meta-data may be turned off and on individually, hence if you
    don’t need FAT file bits (and I figure it’s pretty likely you don’t),
    then they won’t be stored.

  4. The user creating .caidx or .caibx files may control the desired
    average chunk length (before compression) freely, using the
    --chunk-size= parameter. Smaller chunks increase the number of
    generated files in the chunk store and increase HTTP GET load on the
    server, but also ensure that sharing between similar images is
    improved, as identical patterns in the images stored are more likely
    to be recognized. By default casync will use a 64K average chunk
    size. Tweaking this can be particularly useful when adapting the
    system to specific CDNs, or when delivering compressed disk images
    such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible,
    well-defined and strictly deterministic. As mentioned above this is a
    requirement to reach the intended security guarantees, but is also
    useful for many other use-cases. For example, the casync digest
    command may be used to calculate a hash value identifying a specific
    directory in all desired detail (use --with= and --without to pick
    the desired detail). Moreover the casync mtree command may be used
    to generate a BSD mtree(5) compatible manifest of a directory tree,
    .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this
    I mean that the serialization of a file tree is the concatenation of
    the serializations of all files and file sub-trees located at the
    top of the tree, with zero meta-data references from any of these
    serializations into the others. This property is essential to ensure
    maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync
    will automatically create
    reflinks
    from any specified seeds if the underlying file system supports it
    (such as btrfs, ocfs, and future xfs). After all, instead of
    copying the desired data from the seed, we can just tell the file
    system to link up the relevant blocks. This works both when extracting
    .caidx and .caibx files — the latter of course only when the
    extracted disk image is placed in a regular raw image file on disk,
    rather than directly on a plain block device, as plain block devices
    do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can
    create traditional UNIX hard-links for identical files in specified
    seeds (--hardlink=yes). This works on all UNIX file systems, and can
    save substantial amounts of disk space. However, this only works for
    very specific use-cases where disk images are considered read-only
    after extraction, as any changes made to one tree will propagate to
    all other trees sharing the same hard-linked files, as that’s the
    nature of hard-links. In this mode, casync exposes OSTree-like
    behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file
    system images. Implicitly, file systems such as procfs and sysfs are
    excluded from serialization, as they expose API objects, not real
    files. Moreover, the “nodump” (+d)
    chattr(1) flag is honored by
    default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an
    automatic or explicit UID/GID shift. This is particularly useful when
    transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP,
    HTTPS, FTP and ssh natively for downloading chunk index files and
    chunks (the ssh mode requires installing casync on the remote host,
    though, but an sftp mode not requiring that should be easy to
    add). When creating index files or chunks, only ssh is supported as
    remote back-end.

  12. When operating on block-layer images, you may expose locally or
    remotely stored images as local block devices. Example: casync mkdev
    http://example.com/myimage.caibx
    exposes the disk image described by
    the indicated URL as local block device in /dev, which you then may
    use the usual block device tools on, such as mount or fdisk (only
    read-only though). Chunks are downloaded on access with high priority,
    and at low priority when idle in the background. Note that in this
    mode, casync also plays a role similar to “dm-verity”, as all blocks
    are validated against the strong digests in the chunk index file
    before passing them on to the kernel’s block layer. This feature is
    implemented though Linux’ NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount
    locally or remotely stored images as regular file systems. Example:
    casync mount http://example.com/mytree.caidx /srv/mytree mounts the
    file tree image described by the indicated URL as a local directory
    /srv/mytree. This feature is implemented though Linux’ FUSE kernel
    facility. Note that special care is taken that the images exposed this
    way can be packed up again with casync make and are guaranteed to
    return the bit-by-bit exact same serialization again that it was
    mounted from. No data is lost or changed while passing things through
    FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s
    hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in
    the two partitions are usually much shorter than the partition size,
    in order to keep some room for later, larger updates. casync is able
    to analyze the super-block of a number of common file systems in order
    to determine the actual size of a file system stored on a block
    device, so that writing a file system to such a partition and reading
    it back again will result in reproducible data. Moreover this speeds
    up the seeding process, as there’s little point in seeding the
    white-space after the file system within the partition.

Example Command Lines

Here’s how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local
directory, and populate the chunk store directory default.castr
located next to it with the chunks of the serialization (you can
change the name for the store directory with --store= if you
like). This command operates on the file-system level. A similar
command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local
directory describing the current contents of the /dev/sda1 block
device, and populates default.castr in the same way as above. Note
that you may as well read a raw disk image from a file instead of a
block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and
the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data
only. Now let’s make this more interesting, and reference remote
resources:

$ casync extract http://example.com/images/foobar.caidx /some/other/directory

This extracts the specified .caidx onto a local directory. This of
course assumes that foobar.caidx was uploaded to the HTTP server in
the first place, along with the chunk store. You can use any command
you like to accomplish that, for example scp or
rsync. Alternatively, you can let casync do this directly when
generating the chunk index:

$ casync make ssh.example.com:images/foobar.caidx /some/directory

This will use ssh to connect to the ssh.example.com server, and then
places the .caidx file and the chunks on it. Note that this mode of
operation is “smart”: this scheme will only upload chunks currently
missing on the server side, and not re-transmit what already is
available.

Note that you can always configure the precise path or URL of the
chunk store via the --store= option. If you do not do that, then the
store path is automatically derived from the path or URL: the last
component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources,
using a local seed is advisable:

$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by
default store meta-data as accurately as possible. Let’s create a chunk
index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization
that has three features above the absolute baseline supported: 1s
granularity time-stamps, symbolic links and a single read-only bit. In
this mode, all the other meta-data bits are not stored, including
nanosecond time-stamps, full UNIX permission bits, file ownership or
even ACLs or extended attributes.

Now let’s make a .caidx file available locally as a mounted file
system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let’s make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device
node path to STDOUT.

As mentioned, casync is big about reproducibility. Let’s make use of
that to calculate the a digest identifying a very specific version of
a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying
file system know about. Usually, to make this useful you want to
configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data
fields. Specifying --with-unix= selects all meta-data that
traditional UNIX file systems support. It is a shortcut for writing out:
--with=16bit-uids --with=permissions --with=sec-time --with=symlinks
--with=device-nodes --with=fifos --with=sockets
.

Note that when calculating digests or creating chunk indexes you may
also use the negative --without= option to remove specific features
but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves
one feature out: chattr(1)‘s
immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list http://example.com/images/foobar.caidx

or

$ casync mtree http://example.com/images/foobar.caidx

The former command will generate a brief list of files and
directories, not too different from tar t or ls -al in its
output. The latter command will generate a BSD
mtree(5) compatible
manifest. Note that casync actually stores substantially more file
meta-data than mtree files can express, though.

What casync isn’t

  1. casync is not an attempt to minimize serialization and downloaded
    deltas to the extreme. Instead, the tool is supposed to find a good
    middle ground, that is good on traffic and disk space, but not at the
    price of convenience or requiring explicit revision control. If you
    care about updates that are absolutely minimal, there are binary delta
    systems around that might be an option for you, such as Google’s
    Courgette
    .

  2. casync is not a replacement for rsync, or git or zsync or
    anything like that. They have very different use-cases and
    semantics. For example, rsync permits you to directly synchronize two
    file trees remotely. casync just cannot do that, and it is unlikely
    it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary
focus for now is delivery of OS images, but I’d like to make it useful
for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have
    pretty concrete plans how to add that. When implemented, the tool
    might become an alternative to restic,
    BorgBackup or
    tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still
    need to validate the downloaded .caidx or .caibx file yourself, for
    example with some gpg signature. It is my intention to integrate with
    gpg in a minimal way so that signing and verifying chunk index files
    is done automatically.

  3. In the longer run, I’d like to build an automatic synchronizer for
    $HOME between systems from this. Each $HOME instance would be
    stored automatically in regular intervals in the cloud using casync,
    and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet
    built as one. Specifically this means that almost all of casync‘s
    functionality is supposed to be available as C API soon, and
    applications can process casync files on every level. It is my
    intention to make this library useful enough so that it will be easy
    to write a module for GNOME’s gvfs subsystem in order to make remote
    or local .caidx files directly available to applications (as an
    alternative to casync mount). In fact the idea is to make this all
    flexible enough that even the remoting back-ends can be replaced
    easily, for example to replace casync‘s default HTTP/HTTPS back-ends
    built on CURL with GNOME’s own HTTP implementation, in order to share
    cookies, certificates, … There’s also an alternative method to
    integrate with casync in place already: simply invoke casync as a
    sub-process. casync will inform you about a certain set of state
    changes using a mechanism compatible with
    sd_notify(3). In
    future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from
    the local network. After downloading the new .caidx file off the
    Internet casync would then search for the listed chunks on the local
    network first before retrieving them from the Internet. This should
    speed things up on all installations that have multiple similar
    systems deployed in the same network.

Further plans are listed tersely in the
TODO file.

FAQ:

  1. Is this a systemd project?casync is hosted under the
    github systemd umbrella, and the
    projects share the same coding style. However, the code-bases are
    distinct and without interdependencies, and casync works fine both
    on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and
    that’s what I code for. That said, I am open to accepting portability
    patches (unlike for systemd, which doesn’t really make sense on
    non-Linux systems), as long as they don’t interfere too much with the
    way casync works. Specifically this means that I am not too
    enthusiastic about merging portability patches for OSes lacking the
    openat(2) family
    of APIs.

  3. Does casync require reflink-capable file systems to work, such
    as btrfs?
    — No it doesn’t. The reflink magic in casync is
    employed when the file system permits it, and it’s good to have it,
    but it’s not a requirement, and casync will implicitly fall back to
    copying when it isn’t available. Note that casync supports a number
    of file system features on a variety of file systems that aren’t
    available everywhere, for example FAT’s system/hidden file flags or
    xfs‘s projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial
    release. While I have been working on it since quite some time and it
    is quite featureful, this is the first time I advertise it publicly,
    and it hence received very little testing outside of its own test
    suite. I am also not fully ready to commit to the stability of the
    current serialization or chunk index format. I don’t see any breakages
    coming for it though. casync is pretty light on documentation right
    now, and does not even have a man page. I also intend to correct that
    soon.

  5. Are the .caidx/.caibx and .catar file formats open and
    documented?
    casync is Open Source, so if you want to know the
    precise format, have a look at the sources for now. It’s definitely my
    intention to add comprehensive docs for both formats however. Don’t
    forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing
    the wheel (again)?
    — Well, because casync isn’t “just like” some
    other tool. I am pretty sure I did my homework, and that there is no
    tool just like casync right now. The tools coming closest are probably
    rsync, zsync, tarsnap, restic, but they are quite different beasts
    each.

  7. Why did you invent your own serialization format for file trees?
    Why don’t you just use tar?
    — That’s a good question, and other
    systems — most prominently tarsnap — do that. However, as mentioned
    above tar doesn’t enforce reproducibility. It also doesn’t really do
    random access: if you want to access some specific file you need to
    read every single byte stored before it in the tar archive to find
    it, which is of course very expensive. The serialization casync
    implements places a focus on reproducibility, random access, and
    meta-data control. Much like traditional tar it can still be
    generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the
    moment not. That’s not because I wouldn’t want it to, but simply
    because I am not a guru of either of these systems, and didn’t want to
    implement something I do not fully grok nor can test. If you look at
    the sources you’ll find that there’s already some definitions in place
    that keep room for them though. I’d be delighted to accept a patch
    implementing this fully.

  9. What about delivering squashfs images? How well does chunking
    work on compressed serializations?
    – That’s a very good point!
    Usually, if you apply the a chunking algorithm to a compressed data
    stream (let’s say a tar.gz file), then changing a single bit at the
    front will propagate into the entire remainder of the file, so that
    minimal changes will explode into major changes. Thankfully this
    doesn’t apply that strictly to squashfs images, as it provides
    random access to files and directories and thus breaks up the
    compression streams in regular intervals to make seeking easy. This
    fact is beneficial for systems employing chunking, such as casync as
    this means single bit changes might affect their vicinity but will not
    explode in an unbounded fashion. In order achieve best results when
    delivering squashfs images through casync the block sizes of
    squashfs and the chunks sizes of casync should be matched up
    (using casync‘s --chunk-size= option). How precisely to choose
    both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It’s a synchronizing
    tool, hence the -sync suffix, following rsync‘s naming. It makes
    use of the content-addressable concept of git hence the ca-
    prefix.

  11. Where can I get this stuff? Is it already packaged? – Check
    out the sources on GitHub. I
    just tagged the first
    version
    . Martin
    Pitt has packaged casync for
    Ubuntu
    . There
    is also an ArchLinux
    package
    . Zbigniew
    Jędrzejewski-Szmek has prepared a Fedora
    RPM
    that hopefully
    will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that’s up to you really. If you are involved with projects that
need to deliver IoT, VM, container, application or OS images, then
maybe this is a great tool for you — but other options exist, some of
which are linked above.

Note that casync is an Open Source project: if it doesn’t do exactly
what you need, prepare a patch that adds what you need, and we’ll
consider it.

If you are interested in the project and would like to talk about this
in person, I’ll be presenting casync soon at Kinvolk’s Linux
Technologies
Meetup

in Berlin, Germany. You are invited. I also intend to talk about it at
All Systems Go!, also in Berlin.

All Systems Go! 2017 CfP Open

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/all-systems-go-2017-cfp-open.html

The All Systems Go! 2017 Call for Participation is Now Open!

We’d like to invite presentation proposals for All Systems Go! 2017!

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward.

All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd.

All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.

We are now accepting submissions for presentation proposals. In particular, we are looking for sessions including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.

Please submit your proposals by September 3rd. Notification of acceptance will be sent out 1-2 weeks later.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!

Avoiding CVE-2016-8655 with systemd

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/avoiding-cve-2016-8655-with-systemd.html

Avoiding CVE-2016-8655 with systemd

Just a quick note: on recent versions of
systemd it is
relatively easy to block the vulnerability described in
CVE-2016-8655 for
individual services.

Since systemd release v211 there’s an option
RestrictAddressFamilies=
for service unit files which takes away the right to create sockets of
specific address families for processes of the service. In your unit
file, add RestrictAddressFamilies=~AF_PACKET to the [Service]
section to make AF_PACKET unavailable to it (i.e. a blacklist),
which is sufficient to close the attack path. Safer of course is a
whitelist of address families whch you can define by dropping the ~
character from the assignment. Here’s a trivial example:


[Service]
ExecStart=/usr/bin/mydaemon
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX

This restricts access to socket families, so that the service may
access only AF_INET, AF_INET6 or AF_UNIX sockets, which is
usually the right, minimal set for most system daemons. (AF_INET is
the low-level name for the IPv4 address family, AF_INET6 for the
IPv6 address family, and AF_UNIX for local UNIX socket IPC).

Starting with systemd v232 we added RestrictAddressFamilies= to all
of systemd’s own unit files, always with the minimal set of socket
address families appropriate.

With the upcoming v233 release we’ll provide a second method for
blocking this vulnerability. Using
RestrictNamespaces=
it is possible to limit which types of Linux namespaces a service may
get access to. Use RestrictNamespaces=yes to prohibit access to any
kind of namespace, or set RestrictNamespaces=net ipc (or similar) to
restrict access to a specific set (in this case: network and IPC
namespaces). Given that user namespaces have been a major source of
security vulnerabilities in the past months it’s probably a good idea
to block namespaces on all services which don’t need them (which is
probably most of them).

Of course, ideally, distributions such as Fedora, as well as upstream
developers would turn on the various sandboxing settings systemd
provides like these ones by default, since they know best which kind
of address families or namespaces a specific daemon needs.

systemd.conf 2016 Over Now

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/systemdconf-2016-over-now.html

systemd.conf 2016 is Over Now!

A few days ago systemd.conf 2016 ended, our
second conference of this kind. I personally enjoyed this conference a
lot: the talks, the atmosphere, the audience, the organization, the
location, they all were excellent!

I’d like to take the opportunity to thanks everybody involved. In
particular I’d like to thank Chris, Daniel, Sandra and Henrike
for organizing the conference, your work was stellar!

I’d also like to thank our sponsors, without which the conference
couldn’t take place like this, of course. In particular I’d like to
thank our gold sponsor, Red Hat, our organizing sponsor Kinvolk, as
well as our silver sponsors CoreOS and Facebook. I’d also like to
thank our bronze sponsors Collabora, OpenSUSE, Pantheon, Pengutronix,
our supporting sponsor Codethink and last but not least our media
sponsor Linux Magazin. Thank you all!

I’d also like to thank the Video Operation Center
(“VOC”)
for their amazing work on live-streaming
the conference and making all talks available on YouTube. It’s amazing
how efficient the VOC is, it’s simply stunning! Thank you guys!

In case you missed this year’s iteration of the conference, please
have a look at our YouTube
Channel
. You’ll
find all of this year’s talks there, as well the ones from last
year. (For example, my welcome talk is available
here). Enjoy!

We hope to see you again next year, for systemd.conf 2017 in Berlin!

systemd.conf 2016 Workshop Tickets Available

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/systemdconf-2016-workshop-tickets-available.html

Tickets for systemd 2016 Workshop day still available!

We still have a number of ticket for the workshop day of systemd.conf
2016
available. If you are a newcomer to
systemd, and would like to learn about various systemd facilities, or
if you already know your way around, but would like to know more: this
is the best chance to do so. The workshop day is the 28th of
September, one day before the main conference, at the betahaus in
Berlin, Germany. The schedule for the day is available
here. There
are five interesting, extensive sessions, run by the systemd hackers
themselves. Who better to learn systemd from, than the folks who wrote
it?

Note that the workshop day and the main conference days require
different tickets. (Also note: there are still a few tickets available for
the main conference!).

Buy a ticket here.

See you in Berlin!

Preliminary systemd.conf 2016 Schedule

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/preliminary-systemdconf-2016-now-available.html

A Preliminary systemd.conf 2016 Schedule is Now Available!

We have just published a first, preliminary version of the
systemd.conf 2016
schedule
. There
is a small number of white slots in the schedule still, because we’re
missing confirmation from a small number of presenters. The missing
talks will be added in as soon as they are confirmed.

The schedule consists of 5 workshops by high-profile speakers during
the workshop day, 22 exciting talks during the main conference days,
followed by one full day of hackfests.

Please sign up for the conference soon! Only a limited number of
tickets are available, hence make sure to secure yours quickly before
they run out! (Last year we sold out.) Please sign up here for the
conference!

FINAL REMINDER! systemd.conf 2016 CfP Ends on Monday!

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/final-reminder-systemdconf-2016-cfp-ends-on-monday.html

Please note that the systemd.conf 2016
Call for Participation ends on Monday, on Aug. 1st! Please send
in your talk proposal by then! We’ve already got a good number of
excellent submissions, but we are very interested in yours, too!

We are looking for talks on all facets of systemd: deployment,
maintenance, administration, development. Regardless of whether you
use it in the cloud, on embedded, on IoT, on the desktop, on mobile,
in a container or on the server: we are interested in your
submissions!

In addition to proposals for talks for the main conference, we are
looking for proposals for workshop sessions held during our
Workshop Day (the first day of the conference). The workshop format
consists of a day of 2-3h training sessions, that may cover any
systemd-related topic you’d like. We are both interested in
submissions from the developer community as well as submissions from
organizations making use of systemd! Introductory workshop sessions
are particularly welcome, as the Workshop Day is intended to open up
our conference to newcomers and people who aren’t systemd gurus yet,
but would like to become more fluent.

For further details on the submissions we are looking for and the CfP
process, please consult the CfP
page
and
submit your proposal using the provided form!

ALSO: Please sign up for the conference soon! Only a
limited number of tickets are available, hence make sure to secure
yours quickly before they run out! (Last year we sold out.) Please
sign up here for the
conference!

AND OF COURSE: We are also looking for more sponsors for
systemd.conf! If you are working on systemd-related projects, or make
use of it in your company, please consider becoming a sponsor of
systemd.conf
2016
!
Without our sponsors we couldn’t organize systemd.conf 2016!

Thank you very much, and see you in Berlin!

REMINDER! systemd.conf 2016 CfP Ends in Two Weeks!

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/reminder-systemdconf-2016-cfp-ends-in-two-weeks.html

Please note that the systemd.conf 2016
Call for Participation ends in less than two weeks, on Aug. 1st!
Please send in your talk proposal by then! We’ve already got a good
number of excellent submissions, but we are interested in yours even
more!

We are looking for talks on all facets of systemd: deployment,
maintenance, administration, development. Regardless of whether you
use it in the cloud, on embedded, on IoT, on the desktop, on mobile,
in a container or on the server: we are interested in your
submissions!

In addition to proposals for talks for the main conference, we are
looking for proposals for workshop sessions held during our
Workshop Day (the first day of the conference). The workshop format
consists of a day of 2-3h training sessions, that may cover any
systemd-related topic you’d like. We are both interested in
submissions from the developer community as well as submissions from
organizations making use of systemd! Introductory workshop sessions
are particularly welcome, as the Workshop Day is intended to open up
our conference to newcomers and people who aren’t systemd gurus yet,
but would like to become more fluent.

For further details on the submissions we are looking for and the CfP
process, please consult the CfP
page
and
submit your proposal using the provided form!

And keep in mind:

REMINDER: Please sign up for the conference soon! Only a
limited number of tickets are available, hence make sure to secure
yours quickly before they run out! (Last year we sold out.) Please
sign up here for the
conference!

AND OF COURSE: We are also looking for more sponsors for
systemd.conf! If you are working on systemd-related projects, or make
use of it in your company, please consider becoming a sponsor of
systemd.conf
2016
!
Without our sponsors we couldn’t organize systemd.conf 2016!

Thank you very much, and see you in Berlin!

CfP is now open

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/cfp-is-now-open.html

The systemd.conf 2016 Call for Participation is Now Open!

We’d like to invite presentation and workshop proposals for systemd.conf 2016!

The conference will consist of three parts:

  • One day of workshops, consisting of in-depth (2-3hr) training and learning-by-doing sessions (Sept. 28th)
  • Two days of regular talks (Sept. 29th-30th)
  • One day of hackfest (Oct. 1st)

We are now accepting submissions for the first three days: proposals
for workshops, training sessions and regular talks. In particular, we
are looking for sessions including, but not limited to, the following
topics:

  • Use Cases: systemd in today’s and tomorrow’s devices and applications
  • systemd and containers, in the cloud and on servers
  • systemd in distributions
  • Embedded systemd and in IoT
  • systemd on the desktop
  • Networking with systemd
  • … and everything else related to systemd

Please submit your proposals by August 1st, 2016. Notification of acceptance will be sent out 1-2 weeks later.

If submitting a workshop proposal please contact the organizers for more details.

To submit a talk, please visit our CfP submission page.

For further information on systemd.conf 2016, please visit our conference web site.

Announcing systemd.conf 2016

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/announcing-systemdconf-2016.html

Announcing systemd.conf 2016

We are happy to announce the 2016 installment of systemd.conf, the conference of the systemd project!

After our successful first conference 2015 we’d like to repeat the event in 2016 for the second time. The conference will take place on September 28th until October 1st, 2016 at betahaus in Berlin, Germany. The event is a few days before LinuxCon Europe, which also is located in Berlin this year. This year, the conference will consist of two days of presentations, a one-day hackfest and one day of hands-on training sessions.

The website is online now, please visit https://conf.systemd.io/.

Tickets at early-bird prices are available already. Purchase them at https://ti.to/systemdconf/systemdconf-2016.

The Call for Presentations will open soon, we are looking forward to your submissions! A separate announcement will be published as soon as the CfP is open.

systemd.conf 2016 is a organized jointly by the systemd community and kinvolk.io.

We are looking for sponsors! We’ve got early commitments from some of last year’s sponsors: Collabora, Pengutronix & Red Hat. Please see the web site for details about how your company may become a sponsor, too.

If you have any questions, please contact us at [email protected].

Introducing sd-event

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/introducing-sd-event.html

The Event Loop API of libsystemd

When we began working on
systemd we built
it around a hand-written ad-hoc event loop, wrapping Linux
epoll
. The more
our project grew the more we realized the limitations of using raw
epoll:

  • As we used
    timerfd
    for our timer events, each event source cost one file descriptor and
    we had many of them! File descriptors are a scarce resource on UNIX,
    as
    RLIMIT_NOFILE
    is typically set to 1024 or similar, limiting the number of
    available file descriptors per process to 1021, which isn’t
    particularly a lot.

  • Ordering of event dispatching became a nightmare. In many cases, we
    wanted to make sure that a certain kind of event would always be
    dispatched before another kind of event, if both happen at the same
    time. For example, when the last process of a service dies, we might
    be notified about that via a SIGCHLD signal, via an
    sd_notify() “STATUS=”
    message, and via a control group notification. We wanted to get
    these events in the right order, to know when it’s safe to process
    and subsequently release the runtime data systemd keeps about the
    service or process: it shouldn’t be done if there are still events
    about it pending.

  • For each program we added to the systemd project we noticed we were
    adding similar code, over and over again, to work with epoll’s
    complex interfaces. For example, finding the right file descriptor
    and callback function to dispatch an epoll event to, without running
    into invalidated pointer issues is outright difficult and requires
    non-trivial code.

  • Integrating child process watching into our event loops was much
    more complex than one could hope, and even more so if child process
    events should be ordered against each other and unrelated kinds of
    events.

Eventually, we started working on
sd-bus. At
the same time we decided to seize the opportunity, put together a
proper event loop API in C, and then not only port sd-bus on top of
it, but also the rest of systemd. The result of this is
sd-event. After
almost two years of development we declared sd-event stable in systemd
version 221, and published it as official API of libsystemd.

Why?

sd-event.h,
of course, is not the first event loop API around, and it doesn’t
implement any really novel concepts. When we started working on it we
tried to do our homework, and checked the various existing event loop
APIs, maybe looking for candidates to adopt instead of doing our own,
and to learn about the strengths and weaknesses of the various
implementations existing. Ultimately, we found no implementation that
could deliver what we needed, or where it would be easy to add the
missing bits: as usual in the systemd project, we wanted something
that allows us access to all the Linux-specific bits, instead of
limiting itself to the least common denominator of UNIX. We weren’t
looking for an abstraction API, but simply one that makes epoll usable
in system code.

With this blog story I’d like to take the opportunity to introduce you
to sd-event, and explain why it might be a good candidate to adopt as
event loop implementation in your project, too.

So, here are some features it provides:

  • I/O event sources, based on epoll’s file descriptor watching,
    including edge triggered events (EPOLLET). See
    sd_event_add_io(3).

  • Timer event sources, based on timerfd_create(), supporting the
    CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTIME clocks, as well
    as the CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM clocks that
    can resume the system from suspend. When creating timer events a
    required accuracy parameter may be specified which allows coalescing
    of timer events to minimize power consumption. For each clock only a
    single timer file descriptor is kept, and all timer events are
    multiplexed with a priority queue. See
    sd_event_add_time(3).

  • UNIX process signal events, based on
    signalfd(2),
    including full support for real-time signals, and queued
    parameters. See sd_event_add_signal(3).

  • Child process state change events, based on
    waitid(2). See
    sd_event_add_child(3).

  • Static event sources, of three types: defer, post and exit, for
    invoking calls in each event loop, after other event sources or at
    event loop termination. See
    sd_event_add_defer(3).

  • Event sources may be assigned a 64bit priority value, that controls
    the order in which event sources are dispatched if multiple are
    pending simultanously. See
    sd_event_source_set_priority(3).

  • The event loop may automatically send watchdog notification messages
    to the service manager. See sd_event_set_watchdog(3).

  • The event loop may be integrated into foreign event loops, such as
    the GLib one. The event loop API is hence composable, the same way
    the underlying epoll logic is. See
    sd_event_get_fd(3)
    for an example.

  • The API is fully OOM safe.

  • A complete set of documentation in UNIX man page format is
    available, with
    sd-event(3)
    as the entry page.

  • It’s pretty widely available, and requires no extra
    dependencies. Since systemd is built on it, most major distributions
    ship the library in their default install set.

  • After two years of development, and after being used in all of
    systemd’s components, it has received a fair share of testing already,
    even though we only recently decided to declare it stable and turned
    it into a public API.

Note that sd-event has some potential drawbacks too:

  • If portability is essential to you, sd-event is not your best
    option. sd-event is a wrapper around Linux-specific APIs, and that’s
    visible in the API. For example: our event callbacks receive
    structures defined by Linux-specific APIs such as signalfd.

  • It’s a low-level C API, and it doesn’t isolate you from the OS
    underpinnings. While I like to think that it is relatively nice and
    easy to use from C, it doesn’t compromise on exposing the low-level
    functionality. It just fills the gaps in what’s missing between
    epoll, timerfd, signalfd and related concepts, and it does not hide
    that away.

Either way, I believe that sd-event is a great choice when looking for
an event loop API, in particular if you work on system-level software
and embedded, where functionality like timer coalescing or
watchdog support matter.

Getting Started

Here’s a short example how to use sd-event in a simple daemon. In this
example, we’ll not just use sd-event.h, but also sd-daemon.h to
implement a system service.

#include <alloca.h>
#include <endian.h>
#include <errno.h>
#include <netinet/in.h>
#include <signal.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <unistd.h>

#include <systemd/sd-daemon.h>
#include <systemd/sd-event.h>

static int io_handler(sd_event_source *es, int fd, uint32_t revents, void *userdata) {
        void *buffer;
        ssize_t n;
        int sz;

        /* UDP enforces a somewhat reasonable maximum datagram size of 64K, we can just allocate the buffer on the stack */
        if (ioctl(fd, FIONREAD, &sz) < 0)
                return -errno;
        buffer = alloca(sz);

        n = recv(fd, buffer, sz, 0);
        if (n < 0) {
                if (errno == EAGAIN)
                        return 0;

                return -errno;
        }

        if (n == 5 && memcmp(buffer, "EXIT\n", 5) == 0) {
                /* Request a clean exit */
                sd_event_exit(sd_event_source_get_event(es), 0);
                return 0;
        }

        fwrite(buffer, 1, n, stdout);
        fflush(stdout);
        return 0;
}

int main(int argc, char *argv[]) {
        union {
                struct sockaddr_in in;
                struct sockaddr sa;
        } sa;
        sd_event_source *event_source = NULL;
        sd_event *event = NULL;
        int fd = -1, r;
        sigset_t ss;

        r = sd_event_default(&event);
        if (r < 0)
                goto finish;

        if (sigemptyset(&ss) < 0 ||
            sigaddset(&ss, SIGTERM) < 0 ||
            sigaddset(&ss, SIGINT) < 0) {
                r = -errno;
                goto finish;
        }

        /* Block SIGTERM first, so that the event loop can handle it */
        if (sigprocmask(SIG_BLOCK, &ss, NULL) < 0) {
                r = -errno;
                goto finish;
        }

        /* Let's make use of the default handler and "floating" reference features of sd_event_add_signal() */
        r = sd_event_add_signal(event, NULL, SIGTERM, NULL, NULL);
        if (r < 0)
                goto finish;
        r = sd_event_add_signal(event, NULL, SIGINT, NULL, NULL);
        if (r < 0)
                goto finish;

        /* Enable automatic service watchdog support */
        r = sd_event_set_watchdog(event, true);
        if (r < 0)
                goto finish;

        fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0);
        if (fd < 0) {
                r = -errno;
                goto finish;
        }

        sa.in = (struct sockaddr_in) {
                .sin_family = AF_INET,
                .sin_port = htobe16(7777),
        };
        if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
                r = -errno;
                goto finish;
        }

        r = sd_event_add_io(event, &event_source, fd, EPOLLIN, io_handler, NULL);
        if (r < 0)
                goto finish;

        (void) sd_notifyf(false,
                          "READY=1\n"
                          "STATUS=Daemon startup completed, processing events.");

        r = sd_event_loop(event);

finish:
        event_source = sd_event_source_unref(event_source);
        event = sd_event_unref(event);

        if (fd >= 0)
                (void) close(fd);

        if (r < 0)
                fprintf(stderr, "Failure: %s\n", strerror(-r));

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

The example above shows how to write a minimal UDP/IP server, that
listens on port 7777. Whenever a datagram is received it outputs its
contents to STDOUT, unless it is precisely the string EXIT\n in
which case the service exits. The service will react to SIGTERM and
SIGINT and do a clean exit then. It also notifies the service manager
about its completed startup, if it runs under a service
manager. Finally, it sends watchdog keep-alive messages to the service
manager if it asked for that, and if it runs under a service manager.

When run as systemd service this service’s STDOUT will be connected to
the logging framework of course, which means the service can act as a
minimal UDP-based remote logging service.

To compile and link this example, save it as event-example.c, then run:

$ gcc event-example.c -o event-example `pkg-config --cflags --libs libsystemd`

For a first test, simply run the resulting binary from the command
line, and test it against the following netcat command line:

$ nc -u localhost 7777

For the sake of brevity error checking is minimal, and in a real-world
application should, of course, be more comprehensive. However, it
hopefully gets the idea across how to write a daemon that reacts to
external events with sd-event.

For further details on the functions used in the example above, please
consult the manual pages:
sd-event(3),
sd_event_exit(3),
sd_event_source_get_event(3),
sd_event_default(3),
sd_event_add_signal(3),
sd_event_set_watchdog(3),
sd_event_add_io(3),
sd_notifyf(3),
sd_event_loop(3),
sd_event_source_unref(3),
sd_event_unref(3).

Conclusion

So, is this the event loop to end all other event loops? Certainly
not. I actually believe in “event loop plurality”. There are many
reasons for that, but most importantly: sd-event is supposed to be an
event loop suitable for writing a wide range of applications, but it’s
definitely not going to solve all event loop problems. For example,
while the priority logic is important for many usecase it comes with
drawbacks for others: if not used carefully high-priority event
sources can easily starve low-priority event sources. Also, in order
to implement the priority logic, sd-event needs to linearly iterate
through the event structures returned by
epoll_wait(2)
to sort the events by their priority, resulting in worst case
O(n*log(n)) complexity on each event loop wakeup (for n = number of
file descriptors). Then, to implement priorities fully, sd-event only
dispatches a single event before going back to the kernel and asking
for new events. sd-event will hence not provide the theoretically
possible best scalability to huge numbers of file descriptors. Of
course, this could be optimized, by improving epoll, and making it
support how todays’s event loops actually work (after, all, this is
the problem set all event loops that implement priorities — including
GLib’s — have to deal with), but even then: the design of sd-event is focussed on
running one event loop per thread, and it dispatches events strictly
ordered. In many other important usecases a very different design is
preferable: one where events are distributed to a set of worker threads
and are dispatched out-of-order.

Hence, don’t mistake sd-event for what it isn’t. It’s not supposed to
unify everybody on a single event loop. It’s just supposed to be a
very good implementation of an event loop suitable for a large part of
the typical usecases.

Note that our APIs, including
sd-bus, integrate nicely into
sd-event event loops, but do not require it, and may be integrated
into other event loops too, as long as they support watching for time
and I/O events.

And that’s all for now. If you are considering using sd-event for your
project and need help or have questions, please direct them to the
systemd mailing list.