The Wondrous World of Discoverable GPT Disk Images

Post Syndicated from original http://0pointer.net/blog/the-wondrous-world-of-discoverable-gpt-disk-images.html

TL;DR: Tag your GPT partitions with the right, descriptive partition
types, and the world will become a better place.

A number of years ago we started the Discoverable Partitions
Specification
which
defines GPT
partition type UUIDs and partition flags for the various partitions
Linux systems typically deal with. Before the specification all Linux
partitions usually just used the same type, basically saying “Hey, I
am a Linux partition” and not much else. With this specification the
GPT partition type, flags and label system becomes a lot more
expressive, as it can tell you:

  1. What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
  2. What the purpose/mount point of a partition is (i.e. is this a /home/ partition or a root file system?)
  3. What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
  4. Shall this partition be mounted automatically? (i.e. without specifically be configured via /etc/fstab)
  5. And if so, shall it be mounted read-only?
  6. And if so, shall the file system be grown to its enclosing partition size, if smaller?
  7. Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)

By embedding all of this information inside the GPT partition table
disk images become self-descriptive: without requiring any other
source of information (such as /etc/fstab) if you look at a
compliant GPT disk image it is clear how an image is put together and
how it should be used and mounted. This self-descriptiveness in
particular breaks one philosophical weirdness of traditional Linux
installations: the original source of information which file system
the root file system is, typically is embedded in the root file system
itself, in /etc/fstab. Thus, in a way, in order to know what the
root file system is you need to know what the root file system is. 🤯
🤯 🤯

(Of course, the way this recursion is traditionally broken up is by
then copying the root file system information from /etc/fstab into
the boot loader configuration, resulting in a situation where the
primary source of information for this — i.e. /etc/fstab — is
actually mostly irrelevant, and the secondary source — i.e. the copy
in the boot loader — becomes the configuration that actually matters.)

Today, the GPT partition type UUIDs defined by the specification have
been adopted quite widely, by distributions and their installers, as
well as a variety of partitioning tools and other tools.

In this article I want to highlight how the various tools the
systemd project provides make use of the
concepts the specification introduces.

But before we start with that, let’s underline why tagging partitions
with these descriptive partition type UUIDs (and the associated
partition flags) is a good thing, besides the philosophical points
made above.

  1. Simplicity: in particular OS installers become simpler — adjusting
    /etc/fstab as part of the installation is not necessary anymore,
    as the partitioning step already put all information into place for
    assembling the system properly at boot. i.e. installing doesn’t
    mean that you always have to get fdisk and /etc/fstab into
    place, the former suffices entirely.

  2. Robustness: since partition tables mostly remain static after
    installation the chance of corruption is much lower than if the
    data is stored in file systems (e.g. in /etc/fstab). Moreover by
    associating the metadata directly with the objects it describes the
    chance of things getting out of sync is reduced. (i.e. if you lose
    /etc/fstab, or forget to rerun your initrd builder you still know
    what a partition is supposed to be just by looking at it.)

  3. Programmability: if partitions are self-descriptive it’s much
    easier to automatically process them with various tools. In fact,
    this blog story is mostly about that: various systemd tools can
    naturally process disk images prepared like this.

  4. Alternative entry points: on traditional disk images, the boot
    loader needs to be told which kernel command line option root= to
    use, which then provides access to the root file system, where
    /etc/fstab is then found which describes the rest of the file
    systems. Where precisely root= is configured for the boot loader
    highly depends on the boot loader and distribution used, and is
    typically encoded in a Turing complete programming language
    (Grub…). This makes it very hard to automatically determine the
    right root file system to use, to implement alternative entry points
    to the system. By alternative entry points I mean other ways to boot
    the disk image, specifically for running it as a systemd-nspawn
    container — but this extends to other mechanisms where the boot
    loader may be bypassed to boot up the system, for example qemu
    when configured without a boot loader.

  5. User friendliness: it’s simply a lot nicer for the user looking at
    a partition table if the partition table explains what is what,
    instead of just saying “Hey, this is a Linux partition!” and
    nothing else.

Uses for the concept

Now that we cleared up the Why?, lets have a closer look how this is
currently used and exposed in systemd‘s various components.

Use #1: Running a disk image in a container

If a disk image follows the Discoverable Partition Specification then
systemd-nspawn
has all it needs to just boot it up. Specifically, if you have a GPT
disk image in a file foobar.raw and you want to boot it up in a
container, just run systemd-nspawn -i foobar.raw -b, and that’s it
(you can specify a block device like /dev/sdb too if you like). It
becomes easy and natural to prepare disk images that can be booted
either on a physical machine, inside a virtual machine manager or
inside such a container manager: the necessary meta-information is
included in the image, easily accessible before actually looking into
its file systems.

Use #2: Booting an OS image on bare-metal without /etc/fstab or kernel command line root=

If a disk image follows the specification in many cases you can remove
/etc/fstab (or never even install it) — as the basic information
needed is already included in the partition table. The
systemd-gpt-auto-generator
logic implements automatic discovery of the root file system as well
as all auxiliary file systems. (Note that the former requires an
initrd that uses systemd, some more conservative distributions do not
support that yet, unfortunately). Effectively this means you can boot
up a kernel/initrd with an entirely empty kernel command line, and the
initrd will automatically find the root file system (by looking for a
suitably marked partition on the same drive the EFI System Partition
was found on).

(Note, if /etc/fstab or root= exist and contain relevant
information they always takes precedence over the automatic logic. This
is in particular useful to tweaks thing by specifying additional mount
options and such.)

Use #3: Mounting a complex disk image for introspection or manipulation

The
systemd-dissect
tool may be used to introspect and manipulate OS disk images that
implement the specification. If you pass the path to a disk image (or
block device) it will extract various bits of useful information from
the image (e.g. what OS is this? what partitions to mount?) and display it.

With the --mount switch a disk image (or block device) can be
mounted to some location. This is useful for looking what is inside
it, or changing its contents. This will dissect the image and then
automatically mount all contained file systems matching their GPT
partition description to the right places, so that you subsequently
could chroot into it. (But why chroot if you can just use systemd-nspawn? 😎)

Use #4: Copying files in and out of a disk image

The
systemd-dissect
tool also has two switches --copy-from and --copy-to which allow
copying files out of or into a compliant disk image, taking all
included file systems and the resulting mount hierarchy into account.

Use #5: Running services directly off a disk image

The
RootImage=
setting in service unit files accepts paths to compliant disk images
(or block device nodes), and can mount them automatically, running
service binaries directly off them (in chroot() style). In fact,
this is the base for the Portable
Service
concept of systemd.

Use #6: Provisioning disk images

systemd provides various tools that can run operations provisioning
disk images in an “offline” mode. Specifically:

systemd-tmpfiles

With the --image= switch
systemd-tmpfiles
can directly operate on a disk image, and for example create all
directories and other inodes defined in its declarative configuration
files included in the image. This can be useful for example to set up
the /var/ or /etc/ tree according to such configuration before
first boot.

systemd-sysusers

Similar, the --image= switch of
systemd-sysusers
tells the tool to read the declarative system user specifications
included in the image and synthesizes system users from it, writing
them to the /etc/passwd (and related) files in the image. This is
useful for provisioning these users before the first boot, for example
to ensure UID/GID numbers are pre-allocated, and such allocations not
delayed until first boot.

systemd-machine-id-setup

The --image= switch of
systemd-machine-id-setup
may be used to provision a fresh machine ID into
/etc/machine-id
of a disk image, before first boot.

systemd-firstboot

The --image= switch of
systemd-firstboot
may be used to set various basic system setting (such as root
password, locale information, hostname, …) on the specified disk
image, before booting it up.

Use #7: Extracting log information

The
journalctl
switch --image= may be used to show the journal log data included in
a disk image (or, as usual, the specified block device). This is very
useful for analyzing failed systems offline, as it gives direct access
to the logs without any further, manual analysis.

Use #8: Automatic repartitioning/growing of file systems

The
systemd-repart
tool may be used to repartition a disk or image in an declarative and
additive way. One primary use-case for it is to run during boot on
physical or VM systems to grow the root file system to the disk size,
or to add in, format, encrypt, populate additional partitions at boot.

With its --image= switch it the tool may operate on compliant disk
images in offline mode of operation: it will then read the partition
definitions that shall be grown or created off the image itself, and
then apply them to the image. This is particularly useful in
combination with the --size= which allows growing disk images to the
specified size.

Specifically, consider the following work-flow: you download a
minimized disk image foobar.raw that contains only the minimized
root file system (and maybe an ESP, if you want to boot it on
bare-metal, too). You then run systemd-repart --image=foo.raw
--size=15G
to enlarge the image to the 15G, based on the declarative
rules defined in the
repart.d/
drop-in files included in the image (this means this can grow the root
partition, and/or add in more partitions, for example for /srv or
so, maybe encrypted with a locally generated key or so). Then, you
proceed to boot it up with systemd-nspawn --image=foo.raw -b, making
use of the full 15G.

Versioning + Multi-Arch

Disk images implementing this specifications can carry OS executables in one of three ways:

  1. Only a root file system

  2. Only a /usr/ file system (in which case the root file system is automatically picked as tmpfs).

  3. Both a root and a /usr/file system (in which case the two are
    combined, the /usr/ file system mounted into the root file system,
    and the former possibly in read-only fashion`)

They may also contain OS executables for different architectures,
permitting “multi-arch” disk images that can safely boot up on
multiple CPU architectures. As the root and /usr/ partition type
UUIDs are specific to architectures this is easily done by including
one such partition for x86-64, and another for aarch64. If the
image is now used on an x86-64 system automatically the former
partition is used, on aarch64 the latter.

Moreover, these OS executables may be contained in different versions,
to implement a simple versioning scheme: when tools such as
systemd-nspawn or systemd-gpt-auto-generator dissect a disk image,
and they find two or more root or /usr/ partitions of the same type
UUID, they will automatically pick the one whose GPT partition label
(a 36 character free-form string every GPT partition may have) is the
newest according to
strverscmp()
(OK, truth be told, we don’t use strverscmp() as-is, but a modified
version with some more modern syntax and semantics, but conceptually
identical).

This logic allows to implement a very simple and natural A/B update
scheme: an updater can drop multiple versions of the OS into separate
root or /usr/ partitions, always updating the partition label to the
version included there-in once the download is complete. All of the
tools described here will then honour this, and always automatically
pick the newest version of the OS.

Verity

When building modern OS appliances, security is highly
relevant. Specifically, offline security matters: an attacker with
physical access should have a difficult time modifying the OS in a way
that isn’t noticed. i.e. think of a car or a cell network base
station: these appliances are usually parked/deployed in environments
attackers can get physical access to: it’s essential that in this case
the OS itself sufficiently protected, so that the attacker cannot just
mount the OS file system image, make modifications (inserting a
backdoor, spying software or similar) and the system otherwise
continues to run without this being immediately detected.

A great way to implement offline security is via Linux’ dm-verity
subsystem: it allows to securely bind immutable disk IO to a single,
short trusted hash value: if an attacker manages to offline modify the
disk image the modified disk image won’t match the trusted hash
anymore, and will not be trusted anymore (depending on policy this
then just result in IO errors being generated, or automatic
reboot/power-off).

The Discoverable Partitions Specification declares how to include
Verity validation data in disk images, and how to relate them to the file
systems they protect, thus making if very easy to deploy and work with
such protected images. For example systemd-nspawn supports a
--root-hash= switch, which accepts the Verity root hash and then
will automatically assemble dm-verity with this, automatically
matching up the payload and verity partitions. (Alternatively, just
place a .roothash file next to the image file).

Future

The above already is a powerful tool set for working with disk
images. However, there are some more areas I’d like to extend this
logic to:

bootctl

Similar to the other tools mentioned above,
bootctl
(which is a tool to interface with the boot loader, and install/update
systemd’s own EFI boot loader
sd-boot)
should learn a --image= switch, to make installation of the boot
loader on disk images easy and natural. It would automatically find
the ESP and other relevant partitions in the image, and copy the boot
loader binaries into them (or update them).

coredumpctl

Similar to the existing journalctl --image= logic the coredumpctl
tool should also gain an --image= switch for extracting coredumps
from compliant disk images. The combination of journalctl --image=
and coredumpctl --image= would make it exceptionally easy to work
with OS disk images of appliances and extracting logging and debugging
information from them after failures.

And that’s all for now. Please refer to the specification and the man
pages for further details. If your distribution’s installer does not
yet tag the GPT partition it creates with the right GPT type UUIDs,
consider asking them to do so.

Thank you for your time.