Tag Archives: kvm

Security advisories for Wednesday

Post Syndicated from ris original http://lwn.net/Articles/687859/rss

Arch Linux has updated expat (code execution) and lib32-expat (code execution).

CentOS has updated libndp (C7: man-in-the-middle attacks).

Debian has updated expat (code execution).

Debian-LTS has updated libidn (information disclosure), librsvg (denial of service), and xen (multiple vulnerabilities).

Fedora has updated dhcp (F22: denial of service).

openSUSE has updated cacti
(Leap42.1, 13.2: SQL injection), Chromium
(SPH for SLE12: multiple vulnerabilities), go (Leap42.1: two vulnerabilities), GraphicsMagick (Leap42.1, 13.2: multiple
vulnerabilities), imlib2 (13.2: multiple
vulnerabilities), libressl (13.2: multiple
vulnerabilities), librsvg (Leap42.1, 13.2:
denial of service), mercurial (Leap42.1,
13.2: code execution), mysql-community-server (Leap42.1, 13.2:
multiple vulnerabilities), ntp (Leap42.1:
multiple vulnerabilities), ocaml (13.2:
information leak), poppler (13.2: denial of
service), and proftpd (Leap42.1, 13.2: weak key usage).

Oracle has updated kernel (OL6:
multiple vulnerabilities), kernel 4.1.12 (OL7; OL6:
three vulnerabilities), libndp (OL7:
man-in-the-middle attacks), and qemu-kvm
(OL6: multiple vulnerabilities).

Scientific Linux has updated kernel (SL7: privilege escalation) and thunderbird (SL5,7: two vulnerabilities).

SUSE has updated xen (SLE12: multiple vulnerabilities).

Ubuntu has updated expat (code
execution), libarchive (code execution), libksba (multiple vulnerabilities), and samba (12.04: regression in previous update).

Thursday’s security advisories

Post Syndicated from jake original http://lwn.net/Articles/687221/rss

Debian-LTS has updated ocaml
(code execution) and xerces-c (code execution).

Fedora has updated kernel (F23:
information leak), ntp (F22: multiple
vulnerabilities), php (F22: multiple
vulnerabilities), subversion (F23: two
vulnerabilities), and xen (F23: two
vulnerabilities).

Mageia has updated libtasn1
(denial of service) and squid (two
vulnerabilities).

Oracle has updated pcre (OL7:
multiple vulnerabilities).

Red Hat has updated kernel
(RHEL7: privilege escalation), kernel-rt (RHEL7; RHEL6:
privilege escalation), and thunderbird (two
vulnerabilities).

Slackware has updated thunderbird
(multiple vulnerabilities).

SUSE has updated mysql (SLE11:
multiple vulnerabilities), ntp (SLE11:
multiple vulnerabilities), and php5 (SLE12:
multiple vulnerabilities).

Ubuntu has updated qemu, qemu-kvm
(multiple vulnerabilities).

Security advisories for Wednesday

Post Syndicated from ris original http://lwn.net/Articles/687038/rss

Arch Linux has updated cacti (SQL injection) and squid (multiple vulnerabilities).

Debian has updated libarchive
(code execution) and monotone ovito pdns
qtcreator softhsm
(regression in previous update).

Debian-LTS has updated botan1.10
(regression in previous update). Not all Debian packages are fully
supported in Wheezy LTS. See the debian-security-support advisory for details.

Fedora has updated glibc (F23:
multiple vulnerabilities), graphite2 (F22:
multiple vulnerabilities), ntp (F23:
multiple vulnerabilities), openssl (F22:
multiple vulnerabilities), pgpdump (F23; F22:
denial of service), and thunderbird (F22: multiple vulnerabilities).

openSUSE has updated compat-openssl098 (Leap42.1: multiple
vulnerabilities) and php5 (13.2: multiple vulnerabilities).

Red Hat has updated file (RHEL6:
multiple vulnerabilities), icedtea-web
(RHEL6: applet execution), java-1.8.0-ibm
(RHEL6: multiple vulnerabilities), kernel
(RHEL6: multiple vulnerabilities), ntp
(RHEL6: multiple vulnerabilities), openshift (RHOSE3.1: information disclosure),
openssh (RHEL6: multiple vulnerabilities),
pcre (RHEL7: multiple vulnerabilities), and
qemu-kvm-rhev
(RHELOSP5 for RHEL6: code execution).

Scientific Linux has updated pcre
(SL7: multiple vulnerabilities).

Slackware has updated imagemagick (multiple vulnerabilities).

SUSE has updated ImageMagick
(SOSC5, SMP2.1, SM2.1, SLE11-SP4: multiple vulnerabilities).

Ubuntu has updated openjdk-6
(12.04: multiple vulnerabilities).

Security updates for Tuesday

Post Syndicated from ris original http://lwn.net/Articles/686856/rss

CentOS has updated ImageMagick (C7; C6:
multiple vulnerabilities), java-1.6.0-openjdk (C7; C6; C5: multiple vulnerabilities), and qemu-kvm (C7: code execution).

Debian has updated qemu (two vulnerabilities) and websvn (cross-site scripting).

Debian-LTS has updated ikiwiki (cross-site scripting), libav (code execution), and websvn (cross-site scripting).

Oracle has updated ImageMagick (OL7; OL6:
multiple vulnerabilities), java-1.6.0-openjdk (OL7; OL6; OL5: multiple vulnerabilities), and qemu-kvm (OL7: code execution).

Red Hat has updated ImageMagick
(RHEL6,7: multiple vulnerabilities), openssl (RHEL6: multiple vulnerabilities), qemu-kvm (RHEL7; RHEL6: code execution), and qemu-kvm-rhev (RHOSP8; RHELOSP7 for RHEL7; RHELOSP6 for RHEL7; RHELOSP5 for RHEL7: code execution).

Scientific Linux has updated ImageMagick (SL6,7: multiple vulnerabilities)
and qemu-kvm (SL7: code execution).

Ubuntu has updated kernel (15.10; 14.04;
12.04: multiple vulnerabilities), linux-lts-trusty (12.04: multiple
vulnerabilities), linux-lts-utopic (14.04:
multiple vulnerabilities), linux-lts-vivid
(14.04: multiple vulnerabilities), linux-lts-wily (14.04: multiple
vulnerabilities), linux-raspi2 (15.10:
multiple vulnerabilities), linux-ti-omap4
(12.04: multiple vulnerabilities), and openssh (15.10, 14.04, 12.04: multiple vulnerabilities).

Bugfixing KVM live migration

Post Syndicated from Michael Chapman original http://www.anchor.com.au/blog/2015/11/bugfixing-kvm-live-migration/

Here at Anchor we really love our virtualization. Our virtualization platform of choice, KVM, lets us provide a variety of different VPS products to meet our customers’ requirements.
Our KVM hosting platform has evolved considerably over the six years it’s been in operation, and we’re always looking at ways we can improve it. One important aspect of this process of continual improvement, and one I am heavily involved in, is the testing of software upgrades before they are rolled out. This post describes a recent problem encountered during this testing, the analysis that led to discovering its cause, and how we have fixed it. Strap yourself in, this might get technical.

The bug’s first sightings
Until now, we have built most of our KVM hosts on Red Hat Enterprise Linux 6 — it’s fast, stable, and supported for a long time. Since the release of RHEL 7 a year ago we have been looking to using it as well, perhaps even to eventually replace all our existing RHEL 6 hypervisors.
Of course, a big change like this can’t be made without a huge amount of testing. One set of tests is to check that “live migration” of virtual machines works correctly, both between RHEL 7 hypervisors and from RHEL 6 to RHEL 7 and back again.
Live migration is a rather complex affair. Before I describe live migration, however, I ought to explain a bit about how KVM works. KVM is itself just a Linux kernel module. It provides access to the underlying hardware’s virtualization extensions, which allows guests to run at near-native speeds without emulation. However, we need to provide our guests with a set of “virtual hardware” — things like a certain number of virtual CPUs, some RAM, some disk space, and any virtual network connections the guest might need. This virtual hardware is provided by software called QEMU.
When live migrating a guest, it is QEMU that performs all the heavy lifting:

QEMU synchronizes any non-shared storage for the guest (the synchronization is maintained for the duration of the migration).
QEMU synchronizes the virtual RAM for the guest across the two hypervisors (again for the duration of the migration). But remember, this is a live migration, which means the guest could be continually changing the contents of RAM and disk, so…
QEMU waits for the amount of “out-of-sync” data to fall below a certain threshold, at which point it pauses the guest (i.e. it turns off the in-kernel KVM component for the guest).
QEMU synchronizes the remaining out-of-sync data, then resumes the guest on the new hypervisor.

Since the guest is only paused while synchronizing a small amount of out-of-sync RAM (and an even smaller amount of disk), we can limit the impact of the migration upon the guest’s operation. We’ve tuned things so that most migrations can be performed with the guest paused for no longer than a second.
So this is where our testing encountered a problem. We had successfully tested live migrations between RHEL 7 hypervisors, as well as from those running RHEL 6 to those running RHEL 7. But when we tried to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 one, something went wrong: the guest remained paused after the migration! What could be the problem?
Some initial diagnosis
The first step in diagnosing any problem is to gather as much information as you can. We have a log file for each of our QEMU processes. Looking at the log file for the QEMU process “receiving” the live migration (i.e. on the target hypervisor) I found this:
KVM: entry failed, hardware error 0x80000021

If you’re running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

RAX=ffffffff8101c980 RBX=ffffffff818e2900 RCX=ffffffff81855120 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=ffffffff81803ef0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff81800000 R13=0000000000000000 R14=00000000ffffffed R15=ffffffff81a27000
RIP=ffffffff81051c02 RFL=00000246 [—Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00100
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0000 0000000000000000 ffffffff 00c00100
FS =0000 0000000000000000 ffffffff 00c00100
GS =0000 ffff88003fc00000 ffffffff 00c00100
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 ffff88003fc10340 00002087 00008b00 DPL=0 TSS64-busy
GDT= ffff88003fc09000 0000007f
IDT= ffffffffff574000 00000fff
CR0=8005003b CR2=00007f6bee823000 CR3=000000003d2c0000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 fb c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 fb f4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00

What appears to have happened here is that the entire migration process worked correctly up to the point at which the QEMU process needed to resumed the guest… but when it tried to actually resume the guest, it failed to start properly. QEMU dumps out the guest’s CPU registers when this occurs. “Hardware error 0x80000021” is unfortunately a rather generic error code — it simply means “invalid guest state”. But what could be wrong with the guest state? It was just running a moment ago on the other hypervisor; how did the migration make it invalid, if live migration is supposed to copy every part of the guest state intact?
Given that all of our other migration tests were passing, what I needed to do was compare this “bad” migration with one of the “good” ones. In particular, I wanted to get the very same register dump out of a “good” migration, so that I could compare it with this “bad” migration’s register dump.
QEMU itself does not seem to have the ability to do this (after all, if a migration is successful, why would you need a register dump?), which meant I would have to change the way QEMU works. Rather than patching the QEMU software then and there, I found it easiest to modify its behaviour through GDB. By attaching a debugger to the QEMU process, I could have it stop at just the right moment, dump out the guest’s CPU registers, then continue on as if nothing had occurred:
# gdb -p 8332

(gdb) break kvm_cpu_exec
Breakpoint 1 at 0x7f25ec576050: file /usr/src/debug/qemu-2.4.0/kvm-all.c, line 1788.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just “end”.
>call cpu_dump_state(cpu, stderr, fprintf, CPU_DUMP_CODE)
>disable 1
>continue
>end
(gdb) continue
Continuing.
[Thread 0x7f2596fff700 (LWP 8339) exited]
[New Thread 0x7f25941ff700 (LWP 8357)]
[New Thread 0x7f2596fff700 (LWP 8410)]
[New Thread 0x7f25939fe700 (LWP 8411)]
[Thread 0x7f25939fe700 (LWP 8411) exited]
[Thread 0x7f2596fff700 (LWP 8410) exited]
[Switching to Thread 0x7f25d8533700 (LWP 8336)]

Breakpoint 1, kvm_cpu_exec ([email protected]=0x7f25ee8cc000) at /usr/src/debug/qemu-2.4.0/kvm-all.c:1788
1788 {
[Switching to Thread 0x7f25d7d32700 (LWP 8337)]

Success! This produced a new register dump:
RAX=ffffffff8101c980 RBX=ffffffff818e2900 RCX=ffffffff81855120 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=ffffffff81803ef0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff81800000 R13=0000000000000000 R14=00000000ffffffed R15=ffffffff81a27000
RIP=ffffffff81051c02 RFL=00000246 [—Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0000 0000000000000000 ffffffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 0000000000000000 ffffffff 00000000
GS =0000 ffff88003fc00000 ffffffff 00000000
LDT=0000 0000000000000000 000fffff 00000000
TR =0040 ffff88003fc10340 00002087 00008b00 DPL=0 TSS64-busy
GDT= ffff88003fc09000 0000007f
IDT= ffffffffff574000 00000fff
CR0=8005003b CR2=00007f0817db3000 CR3=000000003a45d000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 fb c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 fb f4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00

So now I was able to compare this “good” register dump from the previous “bad” one. The most important differences seemed to be related to the “segment registers”:
bad: ES =0000 0000000000000000 ffffffff 00c00100
good: ES =0000 0000000000000000 ffffffff 00000000

bad: DS =0000 0000000000000000 ffffffff 00c00100
good: DS =0000 0000000000000000 ffffffff 00000000

bad: FS =0000 0000000000000000 ffffffff 00c00100
good: FS =0000 0000000000000000 ffffffff 00000000

bad: GS =0000 ffff88003fc00000 ffffffff 00c00100
good: GS =0000 ffff88003fc00000 ffffffff 00000000

bad: LDT=0000 0000000000000000 ffffffff 00c00000
good: LDT=0000 0000000000000000 000fffff 00000000

Those fields at the end contained different values in the “bad” and “good” migrations. Could they be the cause of the “invalid guest state”?
Memory segmentation
To understand what’s going on here, we need to know a bit about how x86 memory segmentation works. Once upon a time, this was really simple: a 16-bit CS (code segment), DS (data segment) or SS (stack segment) register was simply shifted by 4 bits and added to a 16-bit offset in order to form a 20-bit absolute address.
But “protected mode” (introduced in the Intel 80286) complicated things greatly. Instead of a 16-bit segment number, each segment register held:

a 16-bit “segment selector”;
a “base address” for the segment;
the segment’s “size”;
a set of “flags” to keep track of things like whether the segment can be written to, whether the segment is actually present in physical RAM, and so on.

These are the four fields you can see in the segment registers shown above.
But hang on… this guest wasn’t running in “protected mode”. It was a 64-bit guest running a 64-bit operating system; it was running in what’s called “long mode”, and for the most part long mode doesn’t have segmentation. The particular values in the segment registers listed above are mostly irrelevant, because the CPU isn’t actively using those registers.
So at this point I knew that the segment registers had different flags in the “bad” migration than they did in the “good” migration. But if the registers weren’t being used, why would the flags matter?
“Unusable” memory segments
It took a fair bit of trawling through QEMU and kernel source code and Intel’s copious documentation before I found the answer. It turns out that there is a hidden flag, not visible in these register dumps, indicating whether a particular segment is “usable” or not. The usable flags are not part of the register dumps because they’re not really part of a guest’s CPU state; instead, they’re used by a hypervisor to tell the host CPU which of a guest’s segment registers should be loaded when a guest is started — and most importantly, this includes the times a guest is resumed immediately following a migration.
Next up, I needed to see how KVM and QEMU dealt with these “unusable” segments. So long as each register’s “unusable” flag is included in the migration, then the complete guest state should be recoverable after a migration.
Interestingly, it seems that QEMU does not track the “unusable” flag for each segment. The two functions (set_seg and get_seg) responsible for translating between KVM’s and QEMU’s representations of these segment registers would throw away a “unusable” flag when retrieving it from the kernel, and always clear it when loading the register back into the kernel. How could this ever have worked correctly?
This was finally answered when I looked at the kernel versions involved:

On the RHEL 6 kernel, when retrieving a guest’s segment registers the kernel would automatically clear the flags for a segment if the segment was marked “unusable”. When loading the guest’s segment registers again, it would treat a segment with a cleared set of flags as if it were “unusable”, even if QEMU had not said so.
On the RHEL 7 kernel, however, the kernel would not touch the flags at all when they were retrieved. On loading the segment registers again, it would treat a segment as “unusable” only if QEMU said so, or if one specific flag — the “segment is present” flag — were not set.

Although these kernels have different behaviour, they both work correctly if you stick to one kernel in a guest migration. But if you try to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 hypervisor, the flags aren’t cleared and the new kernel doesn’t know the register should be automatically marked unusable. The result is that the guest tries to use an invalid segment register, so the hardware throws an “invalid guest state” error. Bingo — that’s exactly what we’d seen!
The fix
The fix turned out to be quite simple: simply have QEMU also clear the flags of any segment registers that are marked unusable, and have it ensure that segment registers whose “present” flags are cleared are also marked unusable when loading them into the kernel:
diff –git a/target-i386/kvm.c b/target-i386/kvm.c
index 80d1a7e..588df76 100644
— a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -997,7 +997,7 @@ static void set_seg(struct kvm_segment *lhs, const SegmentCache *rhs)
lhs->l = (flags >> DESC_L_SHIFT) & 1;
lhs->g = (flags & DESC_G_MASK) != 0;
lhs->avl = (flags & DESC_AVL_MASK) != 0;
– lhs->unusable = 0;
+ lhs->unusable = !lhs->present;
lhs->padding = 0;
}

@@ -1006,14 +1006,18 @@ static void get_seg(SegmentCache *lhs, const struct kvm_segment *rhs)
lhs->selector = rhs->selector;
lhs->base = rhs->base;
lhs->limit = rhs->limit;
– lhs->flags = (rhs->type <present * DESC_P_MASK) |
– (rhs->dpl <db <s * DESC_S_MASK) |
– (rhs->l <g * DESC_G_MASK) |
– (rhs->avl * DESC_AVL_MASK);
+ if (rhs->unusable) {fix
+ lhs->flags = 0;
+ } else {
+ lhs->flags = (rhs->type <present * DESC_P_MASK) |
+ (rhs->dpl <db <s * DESC_S_MASK) |
+ (rhs->l <g * DESC_G_MASK) |
+ (rhs->avl * DESC_AVL_MASK);
+ }
}

static void kvm_getput_reg(__u64 *kvm_reg, target_ulong *qemu_reg, int set)

With both of these changes in place, a migration would work even if we were migrating to or from an “old” version of QEMU without the fix. Moreover, it would mean we could get the fix rolled out without having to change the kernels involved.
At present we are still testing these changes, however we look forward to working with the upstream QEMU developers in order to have them added to the mainline version of QEMU.
In writing this blog post I’ve skipped over many of the dead-ends I took in solving this problem. While the fix ended up reasonably straight-forward (well, as much as can be expected when you’re dealing with kernels and hypervisors) it was a fun and educational journey getting there.
Got a question or comment? We’d love to hear from you!
The post Bugfixing KVM live migration appeared first on Anchor Cloud Hosting.

systemd for Administrators, Part XX

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/socket-activated-containers.html

This is no time for procrastination,
here
is
already the twentieth
installment
of

my ongoing series
on
systemd
for
Administrators:

Socket Activated Internet Services and OS Containers

Socket
Activation
is an important feature of systemd. When
we first
announced
systemd we already tried to make the point how great
socket activation is for increasing parallelization and robustness of
socket services, but also for simplifying the dependency logic of the
boot. In this episode I’d like to explain why socket activation is an
important tool for drastically improving how many services and even
containers you can run on a single system with the same resource
usage. Or in other words, how you can drive up the density of customer
sites on a system while spending less on new hardware.

Socket Activated Internet Services

First, let’s take a step back. What was socket activation again? —
Basically, socket activation simply means that systemd sets up
listening sockets (IP or otherwise) on behalf of your services
(without these running yet), and then starts (activates) the
services as soon as the first connection comes in. Depending on the
technology the services might idle for a while after having processed
the connection and possible follow-up connections before they exit on
their own, so that systemd will again listen on the sockets and
activate the services again the next time they are connected to. For
the client it is not visible whether the service it is interested in
is currently running or not. The service’s IP socket stays continously
connectable, no connection attempt ever fails, and all connects will
be processed promptly.

A setup like this lowers resource usage: as services are only
running when needed they only consume resources when required. Many
internet sites and services can benefit from that. For example, web
site hosters will have noticed that of the multitude of web sites that
are on the Internet only a tiny fraction gets a continous stream of
requests: the huge majority of web sites still needs to be available
all the time but gets requests only very unfrequently. With a scheme
like socket activation you take benefit of this. By hosting many of
these sites on a single system like this and only activating their
services as necessary allows a large degree of over-commit: you can
run more sites on your system than the available resources actually
allow. Of course, one shouldn’t over-commit too much to avoid
contention during peak times.

Socket activation like this is easy to use in systemd. Many modern
Internet daemons already support socket activation out of the box (and
for those which don’t yet it’s not
hard
to add). Together with systemd’s instantiated
units support
it is easy to write a pair of service and socket
templates that then may be instantiated multiple times, once for each
site. Then, (optionally) make use of some of the security
features
of systemd to nicely isolate the customer’s site’s
services from each other (think: each customer’s service should only
see the home directory of the customer, everybody else’s directories
should be invisible), and there you go: you now have a highly scalable
and reliable server system, that serves a maximum of securely
sandboxed services at a minimum of resources, and all nicely done with
built-in technology of your OS.

This kind of setup is already in production use in a number of
companies. For example, the great folks at Pantheon are running their
scalable instant Drupal system on a setup that is similar to this. (In
fact, Pantheon’s David Strauss pioneered this scheme. David, you
rock!)

Socket Activated OS Containers

All of the above can already be done with older versions of
systemd. If you use a distribution that is based on systemd, you can
right-away set up a system like the one explained above. But let’s
take this one step further. With systemd 197 (to be included in Fedora
19), we added support for socket activating not only individual
services, but entire OS containers. And I really have to say it
at this point: this is stuff I am really excited
about. 😉

Basically, with socket activated OS containers, the host’s systemd
instance will listen on a number of ports on behalf of a container,
for example one for SSH, one for web and one for the database, and as
soon as the first connection comes in, it will spawn the container
this is intended for, and pass to it all three sockets. Inside of the
container, another systemd is running and will accept the sockets and
then distribute them further, to the services running inside the
container using normal socket activation. The SSH, web and database
services will only see the inside of the container, even though they
have been activated by sockets that were originally created on the
host! Again, to the clients this all is not visible. That an entire OS
container is spawned, triggered by simple network connection is entirely
transparent to the client side.[1]

The OS containers may contain (as the name suggests) a full
operating system, that might even be a different distribution than is
running on the host. For example, you could run your host on Fedora,
but run a number of Debian containers inside of it. The OS containers
will have their own systemd init system, their own SSH instances,
their own process tree, and so on, but will share a number of other
facilities (such as memory management) with the host.

For now, only systemd’s own trivial container manager, systemd-nspawn
has been updated to support this kind of socket activation. We hope
that libvirt-lxc will
soon gain similar functionality. At this point, let’s see in more
detail how such a setup is configured in systemd using nspawn:

First, please use a tool such as debootstrap or yum’s
–installroot to set up a container OS
tree[2]. The details of that are a bit out-of-focus
for this story, there’s plenty of documentation around how to do
this. Of course, make sure you have systemd v197 installed inside
the container. For accessing the container from the command line,
consider using systemd-nspawn
itself. After you configured everything properly, try to boot it up
from the command line with systemd-nspawn’s -b switch.

Assuming you now have a working container that boots up fine, let’s
write a service file for it, to turn the container into a systemd
service on the host you can start and stop. Let’s create
/etc/systemd/system/mycontainer.service on the host:

[Unit]
Description=My little container

[Service]
ExecStart=/usr/bin/systemd-nspawn -jbD /srv/mycontainer 3
KillMode=process

This service can already be started and stopped via systemctl
start and systemctl stop. However, there’s no nice way
to actually get a shell prompt inside the container. So let’s add SSH
to it, and even more: let’s configure SSH so that a connection to the
container’s SSH port will socket-activate the entire container. First,
let’s begin with telling the host that it shall now listen on the SSH
port of the container. Let’s create
/etc/systemd/system/mycontainer.socket on the host:

[Unit]
Description=The SSH socket of my little container

[Socket]
ListenStream=23

If we start this unit with systemctl start on the host
then it will listen on port 23, and as soon as a connection comes in
it will activate our container service we defined above. We pick port
23 here, instead of the usual 22, as our host’s SSH is already
listening on that. nspawn virtualizes the process list and the file
system tree, but does not actually virtualize the network stack, hence
we just pick different ports for the host and the various containers
here.

Of course, the system inside the container doesn’t yet know what to
do with the socket it gets passed due to socket activation. If you’d
now try to connect to the port, the container would start-up but the
incoming connection would be immediately closed since the container
can’t handle it yet. Let’s fix that!

All that’s necessary for that is teach SSH inside the container
socket activation. For that let’s simply write a pair of socket and
service units for SSH. Let’s create
/etc/systemd/system/sshd.socket in the container:

[Unit]
Description=SSH Socket for Per-Connection Servers

[Socket]
ListenStream=23
Accept=yes

Then, let’s add the matching SSH service file
/etc/systemd/system/[email protected] in the container:

[Unit]
Description=SSH Per-Connection Server for %I

[Service]
ExecStart=-/usr/sbin/sshd -i
StandardInput=socket

Then, make sure to hook sshd.socket into the
sockets.target so that unit is started automatically when the
container boots up:

ln -s /etc/systemd/system/sshd.socket /etc/systemd/system/sockets.target.wants/

And that’s it. If we now activate mycontainer.socket on
the host, the host’s systemd will bind the socket and we can connect
to it. If we do this, the host’s systemd will activate the container,
and pass the socket in to it. The container’s systemd will then take
the socket, match it up with sshd.socket inside the
container. As there’s still our incoming connection queued on it, it
will then immediately trigger an instance of [email protected],
and we’ll have our login.

And that’s already everything there is to it. You can easily add
additional sockets to listen on to
mycontainer.socket. Everything listed therein will be passed
to the container on activation, and will be matched up as good as
possible with all socket units configured inside the
container. Sockets that cannot be matched up will be closed, and
sockets that aren’t passed in but are configured for listening will be
bound be the container’s systemd instance.

So, let’s take a step back again. What did we gain through all of
this? Well, basically, we can now offer a number of full OS containers
on a single host, and the containers can offer their services without
running continously. The density of OS containers on the host can
hence be increased drastically.

Of course, this only works for kernel-based virtualization, not for
hardware virtualization. i.e. something like this can only be
implemented on systems such as libvirt-lxc or nspawn, but not in
qemu/kvm.

If you have a number of containers set up like this, here’s one
cool thing the journal allows you to do. If you pass -m to
journalctl on the host, it will automatically discover the
journals of all local containers and interleave them on
display. Nifty, eh?

With systemd 197 you have everything to set up your own socket
activated OS containers on-board. However, there are a couple of
improvements we’re likely to add soon: for example, right now even if
all services inside the container exit on idle, the container still
will stay around, and we really should make it exit on idle too, if
all its services exited and no logins are around. As it turns out we
already have much of the infrastructure for this around: we can reuse
the auto-suspend functionality we added for laptops: detecting when a
laptop is idle and suspending it then is a very similar problem to
detecting when a container is idle and shutting it down then.

Anyway, this blog story is already way too long. I hope I haven’t
lost you half-way already with all this talk of virtualization,
sockets, services, different OSes and stuff. I hope this blog story is
a good starting point for setting up powerful highly scalable server
systems. If you want to know more, consult the documentation and drop
by our IRC channel. Thank you!

Footnotes

[1] And BTW, this
is another reason
why fast boot times the way systemd offers them
are actually a really good thing on servers, too.

[2] To make it easy: you need a command line such as yum
–releasever=19 –nogpg –installroot=/srv/mycontainer/ –disablerepo=’*’
–enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
to install Fedora, and debootstrap –arch=amd64 unstable
/srv/mycontainer/ to install Debian. Also see the bottom of systemd-nspawn(1).
Also note that auditing is currently broken for containers, and if enabled in
the kernel will cause all kinds of errors in the container. Use
audit=0 on the host’s kernel command line to turn it off.

systemd for Administrators, Part XX

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/socket-activated-containers.html

This is no time for procrastination,
here
is
already the twentieth
installment
of

my ongoing series
on
systemd
for
Administrators:

Socket Activated Internet Services and OS Containers

Socket
Activation
is an important feature of systemd. When
we first
announced
systemd we already tried to make the point how great
socket activation is for increasing parallelization and robustness of
socket services, but also for simplifying the dependency logic of the
boot. In this episode I’d like to explain why socket activation is an
important tool for drastically improving how many services and even
containers you can run on a single system with the same resource
usage. Or in other words, how you can drive up the density of customer
sites on a system while spending less on new hardware.

Socket Activated Internet Services

First, let’s take a step back. What was socket activation again? —
Basically, socket activation simply means that systemd sets up
listening sockets (IP or otherwise) on behalf of your services
(without these running yet), and then starts (activates) the
services as soon as the first connection comes in. Depending on the
technology the services might idle for a while after having processed
the connection and possible follow-up connections before they exit on
their own, so that systemd will again listen on the sockets and
activate the services again the next time they are connected to. For
the client it is not visible whether the service it is interested in
is currently running or not. The service’s IP socket stays continously
connectable, no connection attempt ever fails, and all connects will
be processed promptly.

A setup like this lowers resource usage: as services are only
running when needed they only consume resources when required. Many
internet sites and services can benefit from that. For example, web
site hosters will have noticed that of the multitude of web sites that
are on the Internet only a tiny fraction gets a continous stream of
requests: the huge majority of web sites still needs to be available
all the time but gets requests only very unfrequently. With a scheme
like socket activation you take benefit of this. By hosting many of
these sites on a single system like this and only activating their
services as necessary allows a large degree of over-commit: you can
run more sites on your system than the available resources actually
allow. Of course, one shouldn’t over-commit too much to avoid
contention during peak times.

Socket activation like this is easy to use in systemd. Many modern
Internet daemons already support socket activation out of the box (and
for those which don’t yet it’s not
hard
to add). Together with systemd’s instantiated
units support
it is easy to write a pair of service and socket
templates that then may be instantiated multiple times, once for each
site. Then, (optionally) make use of some of the security
features
of systemd to nicely isolate the customer’s site’s
services from each other (think: each customer’s service should only
see the home directory of the customer, everybody else’s directories
should be invisible), and there you go: you now have a highly scalable
and reliable server system, that serves a maximum of securely
sandboxed services at a minimum of resources, and all nicely done with
built-in technology of your OS.

This kind of setup is already in production use in a number of
companies. For example, the great folks at Pantheon are running their
scalable instant Drupal system on a setup that is similar to this. (In
fact, Pantheon’s David Strauss pioneered this scheme. David, you
rock!)

Socket Activated OS Containers

All of the above can already be done with older versions of
systemd. If you use a distribution that is based on systemd, you can
right-away set up a system like the one explained above. But let’s
take this one step further. With systemd 197 (to be included in Fedora
19), we added support for socket activating not only individual
services, but entire OS containers. And I really have to say it
at this point: this is stuff I am really excited
about. 😉

Basically, with socket activated OS containers, the host’s systemd
instance will listen on a number of ports on behalf of a container,
for example one for SSH, one for web and one for the database, and as
soon as the first connection comes in, it will spawn the container
this is intended for, and pass to it all three sockets. Inside of the
container, another systemd is running and will accept the sockets and
then distribute them further, to the services running inside the
container using normal socket activation. The SSH, web and database
services will only see the inside of the container, even though they
have been activated by sockets that were originally created on the
host! Again, to the clients this all is not visible. That an entire OS
container is spawned, triggered by simple network connection is entirely
transparent to the client side.[1]

The OS containers may contain (as the name suggests) a full
operating system, that might even be a different distribution than is
running on the host. For example, you could run your host on Fedora,
but run a number of Debian containers inside of it. The OS containers
will have their own systemd init system, their own SSH instances,
their own process tree, and so on, but will share a number of other
facilities (such as memory management) with the host.

For now, only systemd’s own trivial container manager, systemd-nspawn
has been updated to support this kind of socket activation. We hope
that libvirt-lxc will
soon gain similar functionality. At this point, let’s see in more
detail how such a setup is configured in systemd using nspawn:

First, please use a tool such as debootstrap or yum’s
--installroot to set up a container OS
tree[2]. The details of that are a bit out-of-focus
for this story, there’s plenty of documentation around how to do
this. Of course, make sure you have systemd v197 installed inside
the container. For accessing the container from the command line,
consider using systemd-nspawn
itself. After you configured everything properly, try to boot it up
from the command line with systemd-nspawn’s -b switch.

Assuming you now have a working container that boots up fine, let’s
write a service file for it, to turn the container into a systemd
service on the host you can start and stop. Let’s create
/etc/systemd/system/mycontainer.service on the host:

[Unit]
Description=My little container

[Service]
ExecStart=/usr/bin/systemd-nspawn -jbD /srv/mycontainer 3
KillMode=process

This service can already be started and stopped via systemctl
start
and systemctl stop. However, there’s no nice way
to actually get a shell prompt inside the container. So let’s add SSH
to it, and even more: let’s configure SSH so that a connection to the
container’s SSH port will socket-activate the entire container. First,
let’s begin with telling the host that it shall now listen on the SSH
port of the container. Let’s create
/etc/systemd/system/mycontainer.socket on the host:

[Unit]
Description=The SSH socket of my little container

[Socket]
ListenStream=23

If we start this unit with systemctl start on the host
then it will listen on port 23, and as soon as a connection comes in
it will activate our container service we defined above. We pick port
23 here, instead of the usual 22, as our host’s SSH is already
listening on that. nspawn virtualizes the process list and the file
system tree, but does not actually virtualize the network stack, hence
we just pick different ports for the host and the various containers
here.

Of course, the system inside the container doesn’t yet know what to
do with the socket it gets passed due to socket activation. If you’d
now try to connect to the port, the container would start-up but the
incoming connection would be immediately closed since the container
can’t handle it yet. Let’s fix that!

All that’s necessary for that is teach SSH inside the container
socket activation. For that let’s simply write a pair of socket and
service units for SSH. Let’s create
/etc/systemd/system/sshd.socket in the container:

[Unit]
Description=SSH Socket for Per-Connection Servers

[Socket]
ListenStream=23
Accept=yes

Then, let’s add the matching SSH service file
/etc/systemd/system/[email protected] in the container:

[Unit]
Description=SSH Per-Connection Server for %I

[Service]
ExecStart=-/usr/sbin/sshd -i
StandardInput=socket

Then, make sure to hook sshd.socket into the
sockets.target so that unit is started automatically when the
container boots up:

ln -s /etc/systemd/system/sshd.socket /etc/systemd/system/sockets.target.wants/

And that’s it. If we now activate mycontainer.socket on
the host, the host’s systemd will bind the socket and we can connect
to it. If we do this, the host’s systemd will activate the container,
and pass the socket in to it. The container’s systemd will then take
the socket, match it up with sshd.socket inside the
container. As there’s still our incoming connection queued on it, it
will then immediately trigger an instance of [email protected],
and we’ll have our login.

And that’s already everything there is to it. You can easily add
additional sockets to listen on to
mycontainer.socket. Everything listed therein will be passed
to the container on activation, and will be matched up as good as
possible with all socket units configured inside the
container. Sockets that cannot be matched up will be closed, and
sockets that aren’t passed in but are configured for listening will be
bound be the container’s systemd instance.

So, let’s take a step back again. What did we gain through all of
this? Well, basically, we can now offer a number of full OS containers
on a single host, and the containers can offer their services without
running continously. The density of OS containers on the host can
hence be increased drastically.

Of course, this only works for kernel-based virtualization, not for
hardware virtualization. i.e. something like this can only be
implemented on systems such as libvirt-lxc or nspawn, but not in
qemu/kvm.

If you have a number of containers set up like this, here’s one
cool thing the journal allows you to do. If you pass -m to
journalctl on the host, it will automatically discover the
journals of all local containers and interleave them on
display. Nifty, eh?

With systemd 197 you have everything to set up your own socket
activated OS containers on-board. However, there are a couple of
improvements we’re likely to add soon: for example, right now even if
all services inside the container exit on idle, the container still
will stay around, and we really should make it exit on idle too, if
all its services exited and no logins are around. As it turns out we
already have much of the infrastructure for this around: we can reuse
the auto-suspend functionality we added for laptops: detecting when a
laptop is idle and suspending it then is a very similar problem to
detecting when a container is idle and shutting it down then.

Anyway, this blog story is already way too long. I hope I haven’t
lost you half-way already with all this talk of virtualization,
sockets, services, different OSes and stuff. I hope this blog story is
a good starting point for setting up powerful highly scalable server
systems. If you want to know more, consult the documentation and drop
by our IRC channel. Thank you!

Footnotes

[1] And BTW, this
is another reason
why fast boot times the way systemd offers them
are actually a really good thing on servers, too.

[2] To make it easy: you need a command line such as yum
--releasever=19 --nogpg --installroot=/srv/mycontainer/ --disablerepo='*'
--enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

to install Fedora, and debootstrap --arch=amd64 unstable
/srv/mycontainer/
to install Debian. Also see the bottom of systemd-nspawn(1).
Also note that auditing is currently broken for containers, and if enabled in
the kernel will cause all kinds of errors in the container. Use
audit=0 on the host’s kernel command line to turn it off.

systemd for Administrators, Part XX

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/socket-activated-containers.html

This is no time for procrastination,
here
is
already the twentieth
installment
of

my ongoing series
on
systemd
for
Administrators:

Socket Activated Internet Services and OS Containers

Socket
Activation
is an important feature of systemd. When
we first
announced
systemd we already tried to make the point how great
socket activation is for increasing parallelization and robustness of
socket services, but also for simplifying the dependency logic of the
boot. In this episode I’d like to explain why socket activation is an
important tool for drastically improving how many services and even
containers you can run on a single system with the same resource
usage. Or in other words, how you can drive up the density of customer
sites on a system while spending less on new hardware.

Socket Activated Internet Services

First, let’s take a step back. What was socket activation again? —
Basically, socket activation simply means that systemd sets up
listening sockets (IP or otherwise) on behalf of your services
(without these running yet), and then starts (activates) the
services as soon as the first connection comes in. Depending on the
technology the services might idle for a while after having processed
the connection and possible follow-up connections before they exit on
their own, so that systemd will again listen on the sockets and
activate the services again the next time they are connected to. For
the client it is not visible whether the service it is interested in
is currently running or not. The service’s IP socket stays continously
connectable, no connection attempt ever fails, and all connects will
be processed promptly.

A setup like this lowers resource usage: as services are only
running when needed they only consume resources when required. Many
internet sites and services can benefit from that. For example, web
site hosters will have noticed that of the multitude of web sites that
are on the Internet only a tiny fraction gets a continous stream of
requests: the huge majority of web sites still needs to be available
all the time but gets requests only very unfrequently. With a scheme
like socket activation you take benefit of this. By hosting many of
these sites on a single system like this and only activating their
services as necessary allows a large degree of over-commit: you can
run more sites on your system than the available resources actually
allow. Of course, one shouldn’t over-commit too much to avoid
contention during peak times.

Socket activation like this is easy to use in systemd. Many modern
Internet daemons already support socket activation out of the box (and
for those which don’t yet it’s not
hard
to add). Together with systemd’s instantiated
units support
it is easy to write a pair of service and socket
templates that then may be instantiated multiple times, once for each
site. Then, (optionally) make use of some of the security
features
of systemd to nicely isolate the customer’s site’s
services from each other (think: each customer’s service should only
see the home directory of the customer, everybody else’s directories
should be invisible), and there you go: you now have a highly scalable
and reliable server system, that serves a maximum of securely
sandboxed services at a minimum of resources, and all nicely done with
built-in technology of your OS.

This kind of setup is already in production use in a number of
companies. For example, the great folks at Pantheon are running their
scalable instant Drupal system on a setup that is similar to this. (In
fact, Pantheon’s David Strauss pioneered this scheme. David, you
rock!)

Socket Activated OS Containers

All of the above can already be done with older versions of
systemd. If you use a distribution that is based on systemd, you can
right-away set up a system like the one explained above. But let’s
take this one step further. With systemd 197 (to be included in Fedora
19), we added support for socket activating not only individual
services, but entire OS containers. And I really have to say it
at this point: this is stuff I am really excited
about. 😉

Basically, with socket activated OS containers, the host’s systemd
instance will listen on a number of ports on behalf of a container,
for example one for SSH, one for web and one for the database, and as
soon as the first connection comes in, it will spawn the container
this is intended for, and pass to it all three sockets. Inside of the
container, another systemd is running and will accept the sockets and
then distribute them further, to the services running inside the
container using normal socket activation. The SSH, web and database
services will only see the inside of the container, even though they
have been activated by sockets that were originally created on the
host! Again, to the clients this all is not visible. That an entire OS
container is spawned, triggered by simple network connection is entirely
transparent to the client side.[1]

The OS containers may contain (as the name suggests) a full
operating system, that might even be a different distribution than is
running on the host. For example, you could run your host on Fedora,
but run a number of Debian containers inside of it. The OS containers
will have their own systemd init system, their own SSH instances,
their own process tree, and so on, but will share a number of other
facilities (such as memory management) with the host.

For now, only systemd’s own trivial container manager, systemd-nspawn
has been updated to support this kind of socket activation. We hope
that libvirt-lxc will
soon gain similar functionality. At this point, let’s see in more
detail how such a setup is configured in systemd using nspawn:

First, please use a tool such as debootstrap or yum’s
--installroot to set up a container OS
tree[2]. The details of that are a bit out-of-focus
for this story, there’s plenty of documentation around how to do
this. Of course, make sure you have systemd v197 installed inside
the container. For accessing the container from the command line,
consider using systemd-nspawn
itself. After you configured everything properly, try to boot it up
from the command line with systemd-nspawn’s -b switch.

Assuming you now have a working container that boots up fine, let’s
write a service file for it, to turn the container into a systemd
service on the host you can start and stop. Let’s create
/etc/systemd/system/mycontainer.service on the host:

[Unit]
Description=My little container

[Service]
ExecStart=/usr/bin/systemd-nspawn -jbD /srv/mycontainer 3
KillMode=process

This service can already be started and stopped via systemctl
start
and systemctl stop. However, there’s no nice way
to actually get a shell prompt inside the container. So let’s add SSH
to it, and even more: let’s configure SSH so that a connection to the
container’s SSH port will socket-activate the entire container. First,
let’s begin with telling the host that it shall now listen on the SSH
port of the container. Let’s create
/etc/systemd/system/mycontainer.socket on the host:

[Unit]
Description=The SSH socket of my little container

[Socket]
ListenStream=23

If we start this unit with systemctl start on the host
then it will listen on port 23, and as soon as a connection comes in
it will activate our container service we defined above. We pick port
23 here, instead of the usual 22, as our host’s SSH is already
listening on that. nspawn virtualizes the process list and the file
system tree, but does not actually virtualize the network stack, hence
we just pick different ports for the host and the various containers
here.

Of course, the system inside the container doesn’t yet know what to
do with the socket it gets passed due to socket activation. If you’d
now try to connect to the port, the container would start-up but the
incoming connection would be immediately closed since the container
can’t handle it yet. Let’s fix that!

All that’s necessary for that is teach SSH inside the container
socket activation. For that let’s simply write a pair of socket and
service units for SSH. Let’s create
/etc/systemd/system/sshd.socket in the container:

[Unit]
Description=SSH Socket for Per-Connection Servers

[Socket]
ListenStream=23
Accept=yes

Then, let’s add the matching SSH service file
/etc/systemd/system/[email protected] in the container:

[Unit]
Description=SSH Per-Connection Server for %I

[Service]
ExecStart=-/usr/sbin/sshd -i
StandardInput=socket

Then, make sure to hook sshd.socket into the
sockets.target so that unit is started automatically when the
container boots up:

ln -s /etc/systemd/system/sshd.socket /etc/systemd/system/sockets.target.wants/

And that’s it. If we now activate mycontainer.socket on
the host, the host’s systemd will bind the socket and we can connect
to it. If we do this, the host’s systemd will activate the container,
and pass the socket in to it. The container’s systemd will then take
the socket, match it up with sshd.socket inside the
container. As there’s still our incoming connection queued on it, it
will then immediately trigger an instance of [email protected],
and we’ll have our login.

And that’s already everything there is to it. You can easily add
additional sockets to listen on to
mycontainer.socket. Everything listed therein will be passed
to the container on activation, and will be matched up as good as
possible with all socket units configured inside the
container. Sockets that cannot be matched up will be closed, and
sockets that aren’t passed in but are configured for listening will be
bound be the container’s systemd instance.

So, let’s take a step back again. What did we gain through all of
this? Well, basically, we can now offer a number of full OS containers
on a single host, and the containers can offer their services without
running continously. The density of OS containers on the host can
hence be increased drastically.

Of course, this only works for kernel-based virtualization, not for
hardware virtualization. i.e. something like this can only be
implemented on systems such as libvirt-lxc or nspawn, but not in
qemu/kvm.

If you have a number of containers set up like this, here’s one
cool thing the journal allows you to do. If you pass -m to
journalctl on the host, it will automatically discover the
journals of all local containers and interleave them on
display. Nifty, eh?

With systemd 197 you have everything to set up your own socket
activated OS containers on-board. However, there are a couple of
improvements we’re likely to add soon: for example, right now even if
all services inside the container exit on idle, the container still
will stay around, and we really should make it exit on idle too, if
all its services exited and no logins are around. As it turns out we
already have much of the infrastructure for this around: we can reuse
the auto-suspend functionality we added for laptops: detecting when a
laptop is idle and suspending it then is a very similar problem to
detecting when a container is idle and shutting it down then.

Anyway, this blog story is already way too long. I hope I haven’t
lost you half-way already with all this talk of virtualization,
sockets, services, different OSes and stuff. I hope this blog story is
a good starting point for setting up powerful highly scalable server
systems. If you want to know more, consult the documentation and drop
by our IRC channel. Thank you!

Footnotes

[1] And BTW, this
is another reason
why fast boot times the way systemd offers them
are actually a really good thing on servers, too.

[2] To make it easy: you need a command line such as yum
--releasever=19 --nogpg --installroot=/srv/mycontainer/ --disablerepo='*'
--enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

to install Fedora, and debootstrap --arch=amd64 unstable
/srv/mycontainer/
to install Debian. Also see the bottom of systemd-nspawn(1).
Also note that auditing is currently broken for containers, and if enabled in
the kernel will cause all kinds of errors in the container. Use
audit=0 on the host’s kernel command line to turn it off.

systemd for Administrators, Part XIX

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/detect-virt.html

Happy new year
2013! Here
is
now the nineteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Detecting Virtualization

When we started working on systemd
we had a closer look on what the various existing init scripts used on
Linux where actually doing. Among other things we noticed that a
number of them where checking explicitly whether they were running in
a virtualized environment (i.e. in a kvm, VMWare, LXC guest or
suchlike) or not. Some init scripts disabled themselves in such
cases[1], others enabled themselves only in such
cases[2]. Frequently, it would probably have been a better
idea to check for other conditions rather than explicitly checking for
virtualization, but after looking at this from all sides we came to
the conclusion that in many cases explicitly conditionalizing services
based on detected virtualization is a valid thing to do. As a result
we added a new configuration option to systemd that can be used to
conditionalize services this way: ConditionVirtualization;
we also added a small tool that can be used in shell scripts to detect
virtualization: systemd-detect-virt(1);
and finally, we added a minimal bus interface to query this from other
applications.

Detecting whether your code is run inside a virtualized environment
is
actually not that hard
. Depending on what precisely you want to
detect it’s little more than running the CPUID instruction and maybe
checking a few files in /sys and /proc. The
complexity is mostly about knowing the strings to look for, and
keeping this list up-to-date. Currently, the the virtualization
detection code in systemd can detect the following virtualization
systems:

Hardware virtualization (i.e. VMs):
qemu
kvm
vmware
microsoft
oracle
xen
bochs

Same-kernel virtualization (i.e. containers):
chroot
openvz
lxc
lxc-libvirt
systemd-nspawn

Let’s have a look how one may make use if this functionality.

Conditionalizing Units

Adding ConditionVirtualization
to the [Unit] section of a unit file is enough to
conditionalize it depending on which virtualization is used or whether
one is used at all. Here’s an example:

[Unit]
Name=My Foobar Service (runs only only on guests)
ConditionVirtualization=yes

[Service]
ExecStart=/usr/bin/foobard

Instead of specifiying “yes” or “no” it is possible
to specify the ID of a specific virtualization solution (Example:
“kvm”, “vmware”, …), or either
“container” or “vm” to check whether the kernel is
virtualized or the hardware. Also, checks can be prefixed with an exclamation mark (“!”) to invert a check. For further details see the manual page.

In Shell Scripts

In shell scripts it is easy to check for virtualized systems with
the systemd-detect-virt(1)
tool. Here’s an example:

if systemd-detect-virt -q ; then
echo “Virtualization is used:” `systemd-detect-virt`
else
echo “No virtualization is used.”
fi

If this tool is run it will return with an exit code of zero
(success) if a virtualization solution has been found, non-zero
otherwise. It will also print a short identifier of the used
virtualization solution, which can be suppressed with
-q. Also, with the -c and -v parameters it is
possible to detect only kernel or only hardware virtualization
environments. For further details see the manual
page
.

In Programs

Whether virtualization is available is also exported on the system bus:

$ gdbus call –system –dest org.freedesktop.systemd1 –object-path /org/freedesktop/systemd1 –method org.freedesktop.DBus.Properties.Get org.freedesktop.systemd1.Manager Virtualization
(<‘systemd-nspawn’>,)

This property contains the empty string if no virtualization is
detected. Note that some container environments cannot be detected
directly from unprivileged code. That’s why we expose this property on
the bus rather than providing a library — the bus implicitly solves
the privilege problem quite nicely.

Note that all of this will only ever detect and return information
about the “inner-most” virtualization solution. If you stack
virtualization (“We must go deeper!”) then these interfaces will
expose the one the code is most directly interfacing
with. Specifically that means that if a container solution is used
inside of a VM, then only the container is generally detected and
returned.

Footonotes

[1] For example: running certain device management service in a
container environment that has no access to any physical hardware makes little sense.

[2] For example: some VM solutions work best if certain
vendor-specific userspace components are running that connect the
guest with the host in some way.

systemd for Administrators, Part XIX

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/detect-virt.html

Happy new year
2013! Here
is
now the nineteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Detecting Virtualization

When we started working on systemd
we had a closer look on what the various existing init scripts used on
Linux where actually doing. Among other things we noticed that a
number of them where checking explicitly whether they were running in
a virtualized environment (i.e. in a kvm, VMWare, LXC guest or
suchlike) or not. Some init scripts disabled themselves in such
cases[1], others enabled themselves only in such
cases[2]. Frequently, it would probably have been a better
idea to check for other conditions rather than explicitly checking for
virtualization, but after looking at this from all sides we came to
the conclusion that in many cases explicitly conditionalizing services
based on detected virtualization is a valid thing to do. As a result
we added a new configuration option to systemd that can be used to
conditionalize services this way: ConditionVirtualization;
we also added a small tool that can be used in shell scripts to detect
virtualization: systemd-detect-virt(1);
and finally, we added a minimal bus interface to query this from other
applications.

Detecting whether your code is run inside a virtualized environment
is
actually not that hard
. Depending on what precisely you want to
detect it’s little more than running the CPUID instruction and maybe
checking a few files in /sys and /proc. The
complexity is mostly about knowing the strings to look for, and
keeping this list up-to-date. Currently, the the virtualization
detection code in systemd can detect the following virtualization
systems:

  • Hardware virtualization (i.e. VMs):

    • qemu
    • kvm
    • vmware
    • microsoft
    • oracle
    • xen
    • bochs
  • Same-kernel virtualization (i.e. containers):

Let’s have a look how one may make use if this functionality.

Conditionalizing Units

Adding ConditionVirtualization
to the [Unit] section of a unit file is enough to
conditionalize it depending on which virtualization is used or whether
one is used at all. Here’s an example:

[Unit]
Name=My Foobar Service (runs only only on guests)
ConditionVirtualization=yes

[Service]
ExecStart=/usr/bin/foobard

Instead of specifiying “yes” or “no” it is possible
to specify the ID of a specific virtualization solution (Example:
kvm“, “vmware“, …), or either
container” or “vm” to check whether the kernel is
virtualized or the hardware. Also, checks can be prefixed with an exclamation mark (“!”) to invert a check. For further details see the manual page.

In Shell Scripts

In shell scripts it is easy to check for virtualized systems with
the systemd-detect-virt(1)
tool. Here’s an example:

if systemd-detect-virt -q ; then
        echo "Virtualization is used:" `systemd-detect-virt`
else
        echo "No virtualization is used."
fi

If this tool is run it will return with an exit code of zero
(success) if a virtualization solution has been found, non-zero
otherwise. It will also print a short identifier of the used
virtualization solution, which can be suppressed with
-q. Also, with the -c and -v parameters it is
possible to detect only kernel or only hardware virtualization
environments. For further details see the manual
page
.

In Programs

Whether virtualization is available is also exported on the system bus:

$ gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --method org.freedesktop.DBus.Properties.Get org.freedesktop.systemd1.Manager Virtualization
(<'systemd-nspawn'>,)

This property contains the empty string if no virtualization is
detected. Note that some container environments cannot be detected
directly from unprivileged code. That’s why we expose this property on
the bus rather than providing a library — the bus implicitly solves
the privilege problem quite nicely.

Note that all of this will only ever detect and return information
about the “inner-most” virtualization solution. If you stack
virtualization (“We must go deeper!”) then these interfaces will
expose the one the code is most directly interfacing
with. Specifically that means that if a container solution is used
inside of a VM, then only the container is generally detected and
returned.

Footonotes

[1] For example: running certain device management service in a
container environment that has no access to any physical hardware makes little sense.

[2] For example: some VM solutions work best if certain
vendor-specific userspace components are running that connect the
guest with the host in some way.

systemd for Administrators, Part XIX

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/detect-virt.html

Happy new year
2013! Here
is
now the nineteenth
installment
of

my ongoing series
on
systemd
for
Administrators:

Detecting Virtualization

When we started working on systemd
we had a closer look on what the various existing init scripts used on
Linux where actually doing. Among other things we noticed that a
number of them where checking explicitly whether they were running in
a virtualized environment (i.e. in a kvm, VMWare, LXC guest or
suchlike) or not. Some init scripts disabled themselves in such
cases[1], others enabled themselves only in such
cases[2]. Frequently, it would probably have been a better
idea to check for other conditions rather than explicitly checking for
virtualization, but after looking at this from all sides we came to
the conclusion that in many cases explicitly conditionalizing services
based on detected virtualization is a valid thing to do. As a result
we added a new configuration option to systemd that can be used to
conditionalize services this way: ConditionVirtualization;
we also added a small tool that can be used in shell scripts to detect
virtualization: systemd-detect-virt(1);
and finally, we added a minimal bus interface to query this from other
applications.

Detecting whether your code is run inside a virtualized environment
is
actually not that hard
. Depending on what precisely you want to
detect it’s little more than running the CPUID instruction and maybe
checking a few files in /sys and /proc. The
complexity is mostly about knowing the strings to look for, and
keeping this list up-to-date. Currently, the the virtualization
detection code in systemd can detect the following virtualization
systems:

  • Hardware virtualization (i.e. VMs):

    • qemu
    • kvm
    • vmware
    • microsoft
    • oracle
    • xen
    • bochs
  • Same-kernel virtualization (i.e. containers):

Let’s have a look how one may make use if this functionality.

Conditionalizing Units

Adding ConditionVirtualization
to the [Unit] section of a unit file is enough to
conditionalize it depending on which virtualization is used or whether
one is used at all. Here’s an example:

[Unit]
Name=My Foobar Service (runs only only on guests)
ConditionVirtualization=yes

[Service]
ExecStart=/usr/bin/foobard

Instead of specifiying “yes” or “no” it is possible
to specify the ID of a specific virtualization solution (Example:
kvm“, “vmware“, …), or either
container” or “vm” to check whether the kernel is
virtualized or the hardware. Also, checks can be prefixed with an exclamation mark (“!”) to invert a check. For further details see the manual page.

In Shell Scripts

In shell scripts it is easy to check for virtualized systems with
the systemd-detect-virt(1)
tool. Here’s an example:

if systemd-detect-virt -q ; then
        echo "Virtualization is used:" `systemd-detect-virt`
else
        echo "No virtualization is used."
fi

If this tool is run it will return with an exit code of zero
(success) if a virtualization solution has been found, non-zero
otherwise. It will also print a short identifier of the used
virtualization solution, which can be suppressed with
-q. Also, with the -c and -v parameters it is
possible to detect only kernel or only hardware virtualization
environments. For further details see the manual
page
.

In Programs

Whether virtualization is available is also exported on the system bus:

$ gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --method org.freedesktop.DBus.Properties.Get org.freedesktop.systemd1.Manager Virtualization
(<'systemd-nspawn'>,)

This property contains the empty string if no virtualization is
detected. Note that some container environments cannot be detected
directly from unprivileged code. That’s why we expose this property on
the bus rather than providing a library — the bus implicitly solves
the privilege problem quite nicely.

Note that all of this will only ever detect and return information
about the “inner-most” virtualization solution. If you stack
virtualization (“We must go deeper!”) then these interfaces will
expose the one the code is most directly interfacing
with. Specifically that means that if a container solution is used
inside of a VM, then only the container is generally detected and
returned.

Footonotes

[1] For example: running certain device management service in a
container environment that has no access to any physical hardware makes little sense.

[2] For example: some VM solutions work best if certain
vendor-specific userspace components are running that connect the
guest with the host in some way.

A Plumber’s Wish List for Linux

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/plumbers-wishlist.html

Here’s a mail
we just sent to LKML
, for your consideration. Enjoy:

Subject: A Plumber’s Wish List for Linux

We’d like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we’d be delighted. Please ping us in case you need
clarifications or more information on specific items.

Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers

An here’s the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we’d
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we’d like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren’t. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
“inverted” semantics of CapBnd in /proc/$PID/status.

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn’t be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the “cpu” cgroup controller by default without
breaking RT. Right now creating a cgroup in the “cpu” hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in “cpu” whose
processes get an non-RT weight applied, but for RT take advantage of the
parent’s RT budget. We want the separation of RT and non-RT budget
assignment in the “cpu” hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of “cpu” hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.

A Plumber’s Wish List for Linux

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/plumbers-wishlist.html

Here’s a mail
we just sent to LKML
, for your consideration. Enjoy:

Subject: A Plumber’s Wish List for Linux

We’d like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we’d be delighted. Please ping us in case you need
clarifications or more information on specific items.

Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers


An here’s the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we’d
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we’d like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren’t. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
“inverted” semantics of CapBnd in /proc/$PID/status.

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn’t be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the “cpu” cgroup controller by default without
breaking RT. Right now creating a cgroup in the “cpu” hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in “cpu” whose
processes get an non-RT weight applied, but for RT take advantage of the
parent’s RT budget. We want the separation of RT and non-RT budget
assignment in the “cpu” hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of “cpu” hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.

A Plumber’s Wish List for Linux

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/plumbers-wishlist.html

Here’s a mail
we just sent to LKML
, for your consideration. Enjoy:

Subject: A Plumber’s Wish List for Linux

We’d like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we’d be delighted. Please ping us in case you need
clarifications or more information on specific items.

Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers


An here’s the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we’d
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we’d like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren’t. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
“inverted” semantics of CapBnd in /proc/$PID/status.

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn’t be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the “cpu” cgroup controller by default without
breaking RT. Right now creating a cgroup in the “cpu” hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in “cpu” whose
processes get an non-RT weight applied, but for RT take advantage of the
parent’s RT budget. We want the separation of RT and non-RT budget
assignment in the “cpu” hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of “cpu” hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-2.html

It has been a
while since my last status update on systemd
. Here’s another short,
incomprehensive status update on what we worked on for systemd since then.

Fedora F15 (Rawhide) now includes a split up
/etc/init.d/rc.sysinit (Bill Nottingham). This allows us to keep only
a minimal compatibility set of shell scripts around, and boot otherwise a
system without any shell scripts at all. In fact, shell scripts during early
boot are only used in exceptional cases, i.e. when you enabled autoswapping
(bad idea anyway), when a full SELinux relabel is necessary, during the first
boot after initialization, if you have static kernel modules to load (which are
not configured via the systemd-native way to do that), if you boot from a
read-only NFS server, or when you rely on LVM/RAID/Multipath. If nothing of
this applies to you can easily disable these parts of early boot and
save several seconds on boot. How to do this I will describe in a later blog
story.

We have a fully C coded shutdown logic that kills all remaining processes,
unmounts all remaining file systems, detaches all loop devices and DM volumes
and does that in the right way to ensure that all these things are properly
teared down even if they depend on each other in arbitrary ways. This is not
only considerably faster then the traditional shell hackery for this, but also
a lot safer, since we try to unmount/remount the remaining file systems with a
little bit of brains. This feature is available via systemctl –force
poweroff to the administrator. The –force controls whether the
usual shutdown of all services is run or whether this is skipped and we
immediately shall enter this final C shutdown logic. Using –force
hence is a much safer replacement for the old /sbin/reboot -f and does
not leave dirty file systems behind. (Thanks to Fabiano Fidencio has his
colleagues from ProFUSION for this).

systemd now includes a minmalistic readahead implementation, based on
fanotify(), fadvise() and mincore(). It supports btrfs defragmentation and both
SSD and HDD disks. While the effect on boots that are anyway fast (such as most
stuff involving SSD) is minimal, slower and older machines benefit from this
more substantially.

We now control fsck and quota during early boot with a C tool that ensure
maximum parallelization but properly implements the necessary high-level
administration logic.

Every service, every user and every user session now gets its own cgroup in
the ‘cpu’ hierarchy thus creating better fairness between the logged in users
and their sessions.

We now provide /dev/log logging from early boot to late shutdown.
If no syslog daemon is running the output is passed on to kmsg. As soon as a
proper syslog daemon starts up the kmsg buffer is flushed to syslog, and hence
we will have complete log coverage in syslog even for early boot.

systemctl kill was introduced, an easy command to send a signal to
all processes of a service. Expect a blog story with more details about this
shortly.

systemd gained the ability to load the SELinux policy if necessary, thus
supporting non-initrd boots and initrd boots from the same binary with no
duplicate work. This is in fact (and surprisingly) a first among Linux init
systems.

We now initialize and set the system locale inside PID 1 to be inherited by
all services and users.

systemd has native support for /etc/crypttab and can activate
encrypted LUKS/dm-crypt disks both at boot-up and during runtime. A minimal
password querying infrastructure is available, where multiple agents can be
used to present the password to the user. During boot the password is queried
either via Plymouth or directly on the console. If a system crypto disk is
plugged in after boot you are queried for the password via a GNOME agent, or a
wall(1) agent. Finally, while you run systemctl start (or a similar
command) a minimal TTY password agent is available which asks you for passwords
right-away if this is necessary. The password querying logic is very simple,
additional agents can be implemented in a trivial amount of code (Yupp, KDE folks, you
can add an agent for this, too). Note that the password querying logic in
systemd is only for non-user passwords, i.e. passwords that have no relation to
a specific user, but rather to specific hardware or system software. In future
we hope to extend this so that this can be used to query the password of SSL
certificates when Apache or other servers start.

We offer a minimal interface that external projects can use to extend the
dependency graph systemd manages. In fact, the cryptsetup logic mentioned above
is implemented via this ‘plugin’-like system. Since we did not want to add code
that deals with cryptographic disks into the systemd process itself we
introduced this interface (after all cryptographic volumes are not an essential
feature of a minimal OS, and unncessary on most embedded uses; also the future
might bring us STC which might make this at least partially obsolete). Simply
by dropping a generator binary into
/lib/systemd/system-generators which should write out systemd unit
files into a temporary directory third-party packages may extend the systemd
dependency tree dynamically. This could be useful for example to automatically
create a systemd service for each KVM machine or LXC container. With that in
place those containers/machines could be managed and supervised with the same
tools as the usual system services.

We integrated automatic clean-up of directories such as /tmp into
the tmpfiles logic we already had in place that recreates files and
directories on volatile file systems such as /var/run,
/var/lock or /tmp.

We now always measure and write to the log files the system startup time we
measured, broken up into how many time was spent on the kernel, the initrd and
the initialization of userspace.

We now safely destroy all user session before going down. This is a feature
long missing on Linux: since user processes were not killed until the very last
moment the unhealthy situation that user code was running at a time where no
other daemon was remaining was a normal part of shutdown.

systemd now understands an ‘extreme’ form of disabling a service: if you
symlink a service name in /etc/systemd/system to /dev/null
then systemd will mark it as masked and completely refuse starting it,
regardless if this is requested manually or automaticallly. Normally it should
be sufficient to simply call systemctl disable to disable a service
which still allows manual activation but no automatic activation. Masking a
service goes one step further.

There’s now a simple condition syntax in places which allows
skipping or enabling units depending on the existance of a file, whether a
directory is empty or whether a kernel command line option is set.

In addition to normal shutdowns for reboot, halt or poweroff we now
similarly support a kexec reboot, that reboots the machine without going though
the BIOS code again.

We have bash completion support for systemctl. (Ran Benita)

Andrew Edmunds contributed basic support to boot Ubuntu with systemd.

Michael Biebl and Tollef Fog Heen have worked on the systemd integration
into Debian to a level that it is now possible to boot a system without having
the old initscripts packaged installed. For more details see the Debian Wiki. Michael even
tested this integration on an Ubuntu Natty system and as it turns out this
works almost equally well on Ubuntu already. If you are interesting in playing
around with this, ping Michael.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

We have come quite far in the last year. systemd is about a year old now,
and we are now able to boot a system without legacy shell scripts remaining,
something that appeared to be a task for the distant future.

All of this is available in systemd 13 and in F15/Rawhide as I type
this. If you want to play around with this then consider installing Rawhide
(it’s fun!).

systemd Status Update

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd-update-2.html

It has been a
while since my last status update on systemd
. Here’s another short,
incomprehensive status update on what we worked on for systemd since then.

  • Fedora F15 (Rawhide) now includes a split up
    /etc/init.d/rc.sysinit (Bill Nottingham). This allows us to keep only
    a minimal compatibility set of shell scripts around, and boot otherwise a
    system without any shell scripts at all. In fact, shell scripts during early
    boot are only used in exceptional cases, i.e. when you enabled autoswapping
    (bad idea anyway), when a full SELinux relabel is necessary, during the first
    boot after initialization, if you have static kernel modules to load (which are
    not configured via the systemd-native way to do that), if you boot from a
    read-only NFS server, or when you rely on LVM/RAID/Multipath. If nothing of
    this applies to you can easily disable these parts of early boot and
    save several seconds on boot. How to do this I will describe in a later blog
    story.
  • We have a fully C coded shutdown logic that kills all remaining processes,
    unmounts all remaining file systems, detaches all loop devices and DM volumes
    and does that in the right way to ensure that all these things are properly
    teared down even if they depend on each other in arbitrary ways. This is not
    only considerably faster then the traditional shell hackery for this, but also
    a lot safer, since we try to unmount/remount the remaining file systems with a
    little bit of brains. This feature is available via systemctl --force
    poweroff
    to the administrator. The --force controls whether the
    usual shutdown of all services is run or whether this is skipped and we
    immediately shall enter this final C shutdown logic. Using --force
    hence is a much safer replacement for the old /sbin/reboot -f and does
    not leave dirty file systems behind. (Thanks to Fabiano Fidencio has his
    colleagues from ProFUSION for this).
  • systemd now includes a minmalistic readahead implementation, based on
    fanotify(), fadvise() and mincore(). It supports btrfs defragmentation and both
    SSD and HDD disks. While the effect on boots that are anyway fast (such as most
    stuff involving SSD) is minimal, slower and older machines benefit from this
    more substantially.
  • We now control fsck and quota during early boot with a C tool that ensure
    maximum parallelization but properly implements the necessary high-level
    administration logic.
  • Every service, every user and every user session now gets its own cgroup in
    the ‘cpu’ hierarchy thus creating better fairness between the logged in users
    and their sessions.
  • We now provide /dev/log logging from early boot to late shutdown.
    If no syslog daemon is running the output is passed on to kmsg. As soon as a
    proper syslog daemon starts up the kmsg buffer is flushed to syslog, and hence
    we will have complete log coverage in syslog even for early boot.
  • systemctl kill was introduced, an easy command to send a signal to
    all processes of a service. Expect a blog story with more details about this
    shortly.
  • systemd gained the ability to load the SELinux policy if necessary, thus
    supporting non-initrd boots and initrd boots from the same binary with no
    duplicate work. This is in fact (and surprisingly) a first among Linux init
    systems.
  • We now initialize and set the system locale inside PID 1 to be inherited by
    all services and users.
  • systemd has native support for /etc/crypttab and can activate
    encrypted LUKS/dm-crypt disks both at boot-up and during runtime. A minimal
    password querying infrastructure is available, where multiple agents can be
    used to present the password to the user. During boot the password is queried
    either via Plymouth or directly on the console. If a system crypto disk is
    plugged in after boot you are queried for the password via a GNOME agent, or a
    wall(1) agent. Finally, while you run systemctl start (or a similar
    command) a minimal TTY password agent is available which asks you for passwords
    right-away if this is necessary. The password querying logic is very simple,
    additional agents can be implemented in a trivial amount of code (Yupp, KDE folks, you
    can add an agent for this, too). Note that the password querying logic in
    systemd is only for non-user passwords, i.e. passwords that have no relation to
    a specific user, but rather to specific hardware or system software. In future
    we hope to extend this so that this can be used to query the password of SSL
    certificates when Apache or other servers start.
  • We offer a minimal interface that external projects can use to extend the
    dependency graph systemd manages. In fact, the cryptsetup logic mentioned above
    is implemented via this ‘plugin’-like system. Since we did not want to add code
    that deals with cryptographic disks into the systemd process itself we
    introduced this interface (after all cryptographic volumes are not an essential
    feature of a minimal OS, and unncessary on most embedded uses; also the future
    might bring us STC which might make this at least partially obsolete). Simply
    by dropping a generator binary into
    /lib/systemd/system-generators which should write out systemd unit
    files into a temporary directory third-party packages may extend the systemd
    dependency tree dynamically. This could be useful for example to automatically
    create a systemd service for each KVM machine or LXC container. With that in
    place those containers/machines could be managed and supervised with the same
    tools as the usual system services.
  • We integrated automatic clean-up of directories such as /tmp into
    the tmpfiles logic we already had in place that recreates files and
    directories on volatile file systems such as /var/run,
    /var/lock or /tmp.
  • We now always measure and write to the log files the system startup time we
    measured, broken up into how many time was spent on the kernel, the initrd and
    the initialization of userspace.
  • We now safely destroy all user session before going down. This is a feature
    long missing on Linux: since user processes were not killed until the very last
    moment the unhealthy situation that user code was running at a time where no
    other daemon was remaining was a normal part of shutdown.
  • systemd now understands an ‘extreme’ form of disabling a service: if you
    symlink a service name in /etc/systemd/system to /dev/null
    then systemd will mark it as masked and completely refuse starting it,
    regardless if this is requested manually or automaticallly. Normally it should
    be sufficient to simply call systemctl disable to disable a service
    which still allows manual activation but no automatic activation. Masking a
    service goes one step further.
  • There’s now a simple condition syntax in places which allows
    skipping or enabling units depending on the existance of a file, whether a
    directory is empty or whether a kernel command line option is set.
  • In addition to normal shutdowns for reboot, halt or poweroff we now
    similarly support a kexec reboot, that reboots the machine without going though
    the BIOS code again.
  • We have bash completion support for systemctl. (Ran Benita)
  • Andrew Edmunds contributed basic support to boot Ubuntu with systemd.
  • Michael Biebl and Tollef Fog Heen have worked on the systemd integration
    into Debian to a level that it is now possible to boot a system without having
    the old initscripts packaged installed. For more details see the Debian Wiki. Michael even
    tested this integration on an Ubuntu Natty system and as it turns out this
    works almost equally well on Ubuntu already. If you are interesting in playing
    around with this, ping Michael.

And that’s it for now. There’s a lot of other stuff in the git commits, but
most of it is smaller and I will it thus spare you.

We have come quite far in the last year. systemd is about a year old now,
and we are now able to boot a system without legacy shell scripts remaining,
something that appeared to be a task for the distant future.

All of this is available in systemd 13 and in F15/Rawhide as I type
this. If you want to play around with this then consider installing Rawhide
(it’s fun!).

Ok, Be Afraid if Someone’s Got a Voltmeter Hooked to Your CPU

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2010/03/05/crypto-fear.html

Boy, do I hate it when a
FLOSS
project is given a hard time unfairly. I was this morning greeted
with news
from many
places that OpenSSL, one of the
most common FLOSS software libraries used for cryptography, was
somehow severely vulnerable.

I had a hunch what was going on. I quickly downloaded
a copy
of the academic paper
that was cited as the sole source for the
story and read it. As I feared, OpenSSL was getting some bad press
unfairly. One must really read this academic computer science article
in the context it was written; most commenting about this paper
probably did not.

First of all, I don’t claim to be an expert on cryptography, and I
think my knowledge level to opine on this subject remains limited to a
little blog post like this and nothing more. Between college and
graduate school, I worked as a system administrator focusing on network
security. While a computer science graduate student, I did take two
cryptography courses, two theory of computation courses, and one class
on complexity theory0. So, when
compared to the general population I probably am an expert, but compared to
people who actually work in cryptography regularly, I’m clearly a
novice. However, I suspect many who have hitherto opined about this
academic article to declare this severe vulnerability have even
less knowledge than I do on the subject.

This article, of course, wasn’t written for novices like me, and
certainly not for the general public nor the technology press. It was
written by and for professional researchers who spend much time each
week reading dozens of these academic papers, a task I haven’t done
since graduate school. Indeed, the paper is written in a style I know
well; my “welcome to CS graduate school” seminar in 1997
covered the format well.

The first thing you have to note about such papers is that informed
readers generally ignore the parts that a newbie is most likely focus
on: the Abstract, Introduction and Conclusion sections. These sections
are promotional materials; they are equivalent to a sales brochure
selling you on how important and groundbreaking the research is. Some
research is groundbreaking, of course, but most is an incremental step
forward toward understanding some theoretical concept, or some report
about an isolated but interesting experimental finding.

Unfortunately, these promotional parts of the paper are the sections
that focus on the negative implications for OpenSSL. In the rest of the
paper, OpenSSL is merely the software component of the experiment
equipment. They likely could have used GNU TLS or any other
implementation of RSA taken from a book on
cryptography1. But this fact
is not even the primary reason that this article isn’t really that big
of a deal for daily use of cryptography.

The experiment described in the paper is very difficult to reproduce.
You have to cause very subtle faults in computation at specific times.
As I understand it, they had to assemble a specialized hardware copy of
a SPARC-based GNU/Linux environment to accomplish the experiment.

Next, the data generated during the run of the software on the
specially-constructed faulty hardware must be collected and operated
upon by a parallel processing computing environment over the course of
many hours. If it turns out all the needed data was gathered, the
output of this whole process is the private RSA key.

The details of the fault generation process deserve special mention.
Very specific faults have to occur, and they can’t occur such that any
other parts of the computation (such as, say, the normal running of the
operating system) are interrupted or corrupted. This is somewhat
straightforward to get done in a lab environment, but accomplishing it
in a production situation would be impractical and improbable. It would
also usually require physical access to the hardware holding the private
key. Such physical access would, of course, probably give you the
private key anyway by simply copying it off the hard drive or out of
RAM!

This is interesting research, and it does suggest some changes that
might be useful. For example, if it doesn’t slow a system down too
much, the integrity of RSA signatures should be verified, on a closely
controlled proxy unit with a separate CPU, before sending out to a wider
audience. But even that would be a process only for the most paranoid.
If faults are occurring on production hardware enough to generate the
bad computations this cracking process relies on, likely something else
will go wrong on the hardware too and it will be declared generally
unusable for production before an interloper could gather enough data to
crack the key. Thus, another useful change to make based on this
finding is to disable and discard RSA keys that were in use on
production hardware that went faulty.

Finally, I think this article does completely convince me that I would
never want to run any RSA computations on a system where the CPU was
emulated. Causing faults in an emulated CPU would only require changes
to the emulation software, and could be done with careful precision to
detect when an RSA-related computation was happening, and only give the
faulty result on those occasions. I’ve never heard of anyone running
production cryptography on an emulated CPU, since it would be too slow,
and virtualization technologies like Xen, KVM, and QEMU all
pass-through CPU instructions directly to hardware (for speed reasons)
when the virtualized guest matches the hardware architecture of the
host.

The point, however, is that proper description of the dangers of a
“security vulnerability” requires more than a single bit
field. Some security vulnerabilities are much worse than others. This
one is substantially closer to the “oh, that’s cute” end of
the spectrum, not the “ZOMG, everyone’s going to experience
identity theft tomorrow” side.

0Many casual
users don’t realize that cryptography — the stuff that secures your
networked data from unwanted viewers — isn’t about math problems
that are unsolvable. In fact, it’s often based on math problems that are
trivially solvable, but take a very long time to solve. This is why
algorithmic complexity questions are central to the question of
cryptographic security.

1 I’m
oversimplifying a bit here. A key factor in the paper appears to be the
linear time algorithm used to compute cryptographic digital signatures,
and the fact that the signatures aren’t verified for integrity before
being deployed. I suspect, though, that just about any RSA system is
going to do this. (Although I do usually test the integrity of my GnuPG
signatures before sending them out, I do this as a user by hand).

Microsoft Releases GPL’d Software (Again): Does This Change Anything?

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2009/07/29/microsoft-gpl.html

Microsoft has received much undeserved press about their recent release
of Linux drivers for their virtualization technology under GPLv2. I say
“undeserved” because I don’t particularly see why Microsoft
should be lauded merely for doing something that is in their own
interest that they’ve done before.

Most people have forgotten that Microsoft once had a GPL-based product
available for Windows NT. It was called Windows Services for
UNIX, and AFAICT, remains available today (although perhaps
they’ve transitioned in recent years to no longer include GPL’d
software).

This product
was acquired
by Microsoft when they purchased Softway Systems
. The product was
based on GCC, and included a variety of GNU system utilities ported to
Windows. Microsoft was a compliant distributor of this software for
years, right during the time when they were calling the GPL an unAmerican
cancerous virus that eats up software like PacMan. The GPL is not a new
license to Microsoft; they only pretend that it is to give bad press to
the GPL or to give good press to themselves.

Another thing that’s not new to Microsoft is that they have no
interesting in contributing to Free Software unless it makes their
proprietary software more desirable. In my old example above, they
hoped to entice developers who preferred a Unix development environment to
switch to Windows NT. In the recent Linux driver release, they seek to
convince developers to switch from Xen and KVM to their proprietary
virtualization technology.

In fact, the only difference in this particular release is that, unlike
in the case of Softway’s
software, Microsoft
was apparently (according to Steve Hemminger) out of compliance
briefly
. According to Steve, Microsoft distributed binaries linked
to various GPL parts.

Meanwhile, Sam Ramji claimed that Microsoft were already planning to
release the software before Hemminger and Greg K-H contacted them. I do
believe Sam when he says that there was already talk inside Microsoft
about releasing the source underway before the Linux developers
began their enforcement effort. However, that internal Microsoft talk
doesn’t mean that there wasn’t a problem. As soon as one distributes
the binaries of a GPL’d work, one must provide the source (or an offer therefor) alongside
those binaries. Thus, if Microsoft released binaries and delayed in
releasing source, there was a GPL violation.

Like all GPL violations (and potential GPL violations), it’s left to
the copyright holders of the software to engage in enforcement. I think
it’s great
that, according
to Steve
and related press coverage, the Linux developers used the most common enforcement
strategy in the GPL community — quietly contact the company,
inform them of their obligations, and help them in a friendly way into
compliance. That process almost always works, and the fact that
Microsoft came into compliance shows the value of our community’s
standard enforcement practice.

Still, there is a more important item of note from a perspective of
software freedom. This Linux driver — whether it is released properly
under the GPL or kept proprietary in violation of the GPL — is designed to convince users to give up Free
virtualization platforms like Xen and KVM and use Microsoft’s
virtualization technology instead. From that perspective, it matters
little that it was released as Free Software: people should avoid the
software and use platforms for virtualization that respect their
freedom.

Someday, perhaps, Microsoft will take a proper place among other large
companies that actually contribute code that improves the general
infrastructure of Free Software. Many companies give generally useful
improvements back to Linux, GCC, and various other parts of the
GNU/Linux system. Microsoft has never done this: they only contribute
code when it improves Free Software interoperability with their
proprietary technology. The day that Microsoft actually changes its
attitude toward Free Software did not occur last week. Microsoft’s old
strategy stays the
same: try
to kill Free Software with patents
, and in the meantime, convince as
many Free Software users as possible to begin relying on Microsoft
proprietary technology.