Tag Archives: File Systems

H1 Instances – Fast, Dense Storage for Big Data Applications

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-h1-instances-fast-dense-storage-for-big-data-applications/

The scale of AWS and the diversity of our customer base gives us the opportunity to create EC2 instance types that are purpose-built for many different types of workloads. For example, a number of popular big data use cases depend on high-speed, sequential access to multiple terabytes of data. Our customers want to build and run very large MapReduce clusters, host distributed file systems, use Apache Kafka to process voluminous log files, and so forth.

New H1 Instances
The new H1 instances are designed specifically for this use case. In comparison to the existing D2 (dense storage) instances, the H1 instances provide more vCPUs and more memory per terabyte of local magnetic storage, along with increased network bandwidth, giving you the power to address more complex challenges with a nicely balanced mix of resources.

The instances are based on Intel Xeon E5-2686 v4 processors running at a base clock frequency of 2.3 GHz and come in four instance sizes (all VPC-only and HVM-only):

Instance Name vCPUs
RAM
Local Storage Network Bandwidth
h1.2xlarge 8 32 GiB 2 TB Up to 10 Gbps
h1.4xlarge 16 64 GiB 4 TB Up to 10 Gbps
h1.8xlarge 32 128 GiB 8 TB 10 Gbps
h1.16xlarge 64 256 GiB 16 TB 25 Gbps

The two largest sizes support Intel Turbo and CPU power management, with all-core Turbo at 2.7 GHz and single-core Turbo at 3.0 GHz.

Local storage is optimized to deliver high throughput for sequential I/O; you can expect to transfer up to 1.15 gigabytes per second if you use a 2 megabyte block size. The storage is encrypted at rest using 256-bit XTS-AES and one-time keys.

Moving large amounts of data on and off of these instances is facilitated by the use of Enhanced Networking, giving you up to 25 Gbps of network bandwith within Placement Groups.

Launch One Today
H1 instances are available today in the US East (Northern Virginia), US West (Oregon), US East (Ohio), and EU (Ireland) Regions. You can launch them in On-Demand or Spot Form. Dedicated Hosts, Dedicated Instances, and Reserved Instances (both 1-year and 3-year) are also available.

Jeff;

The casync filesystem image distribution tool

Post Syndicated from corbet original https://lwn.net/Articles/726005/rss

Lennart Poettering announces
casync
, a tool for distributing system images.
casync takes inspiration from the popular rsync file synchronization
tool as well as the probably even more popular git revision control
system. It combines the idea of the rsync algorithm with the idea of
git-style content-addressable file systems, and creates a new system for
efficiently storing and delivering file system images, optimized for
high-frequency update cycles over the Internet. Its current focus is on
delivering IoT, container, VM, application, portable service or OS images,
but I hope to extend it later in a generic fashion to become useful for
backups and home directory synchronization as well
.”

casync — A tool for distributing file system images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html

Introducing casync

In the past months I have been working on a new project:
casync. casync takes
inspiration from the popular rsync file
synchronization tool as well as the probably even more popular
git revision control system. It combines the
idea of the rsync algorithm with the idea of git-style
content-addressable file systems, and creates a new system for
efficiently storing and delivering file system images, optimized for
high-frequency update cycles over the Internet. Its current focus is
on delivering IoT, container, VM, application, portable service or OS
images, but I hope to extend it later in a generic fashion to become
useful for backups and home directory synchronization as well (but
more about that later).

The basic technological building blocks casync is built from are
neither new nor particularly innovative (at least not anymore),
however the way casync combines them is different from existing tools,
and that’s what makes it useful for a variety of use-cases that other
tools can’t cover that well.

Why?

I created casync after studying how today’s popular tools store and
deliver file system images. To briefly name a few: Docker has a
layered tarball approach,
OSTree serves the
individual files directly via HTTP and maintains packed deltas to
speed up updates, while other systems operate on the block layer and
place raw squashfs images (or other archival file systems, such as
IS09660) for download on HTTP shares (in the better cases combined
with zsync data).

Neither of these approaches appeared fully convincing to me when used
in high-frequency update cycle systems. In such systems, it is
important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don’t think any of the tools mentioned above are really good on more
than a small subset of these points.

Specifically: Docker’s layered tarball approach dumps the “delta”
question onto the feet of the image creators: the best way to make
your image downloads minimal is basing your work on an existing image
clients might already have, and inherit its resources, maintaining full
history. Here, revision control (a tool for the developer) is
intermingled with update management (a concept for optimizing
production delivery). As container histories grow individual deltas
are likely to stay small, but on the other hand a brand-new deployment
usually requires downloading the full history onto the deployment
system, even though there’s no use for it there, and likely requires
substantially more disk space and download sizes.

OSTree’s serving of individual files is unfriendly to CDNs (as many
small files in file trees cause an explosion of HTTP GET
requests). To counter that OSTree supports placing pre-calculated
delta images between selected revisions on the delivery servers, which
means a certain amount of revision management, that leaks into the
clients.

Delivering direct squashfs (or other file system) images is almost
beautifully simple, but of course means every update requires a full
download of the newest image, which is both bad for disk usage and
generated traffic. Enhancing it with zsync makes this a much better
option, as it can reduce generated traffic substantially at very
little cost of history/meta-data (no explicit deltas between a large
number of versions need to be prepared server side). On the other hand
server requirements in disk space and functionality (HTTP Range
requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it’s not
my intention to badmouth them. They only point I am trying to make is
that for the use case I care about — file system image delivery with
high high frequency update-cycles — each system comes with certain
drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn’t happy with the security
and reproducibility properties of these systems. In today’s world
where security breaches involving hacking and breaking into connected
systems happen every day, an image delivery system that cannot make
strong guarantees regarding data integrity is out of
date. Specifically, the tarball format is famously nondeterministic:
the very same file tree can result in any number of different
valid serializations depending on the tool used, its version and the
underlying OS and file system. Some tar implementations attempt to
correct that by guaranteeing that each file tree maps to exactly
one valid serialization, but such a property is always only specific
to the tool used. I strongly believe that any good update system must
guarantee on every single link of the chain that there’s only one
valid representation of the data to deliver, that can easily be
verified.

What casync Is

So much about the background why I created casync. Now, let’s have a
look what casync actually is like, and what it does. Here’s the brief
technical overview:

Encoding: Let’s take a large linear data stream, split it into
variable-sized chunks (the size of each being a function of the
chunk’s contents), and store these chunks in individual, compressed
files in some directory, each file named after a strong hash value of
its contents, so that the hash value may be used to as key for
retrieving the full chunk data. Let’s call this directory a “chunk
store”. At the same time, generate a “chunk index” file that lists
these chunk hash values plus their respective chunk sizes in a simple
linear array. The chunking algorithm is supposed to create variable,
but similarly sized chunks from the data stream, and do so in a way
that the same data results in the same chunks even if placed at
varying offsets. For more information see this blog
story
.

Decoding: Let’s take the chunk index file, and reassemble the large
linear data stream by concatenating the uncompressed chunks retrieved
from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible,
random-access serialization format for file trees (think: a more
modern tar), to permit efficient, stable storage of complete file
trees in the system, simply by serializing them and then passing them
into the encoding step explained above.

Finally, let’s put all this on the network: for each image you want to
deliver, generate a chunk index file and place it on an HTTP
server. Do the same with the chunk store, and share it between the
various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result
in mostly the same chunk files in the chunk store. This means it is
very efficient to store many related versions of a data stream in the
same chunk store, thus minimizing disk usage. Moreover, when
transferring linear data streams chunks already known on the receiving
side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well,
one major difference between casync and those tools is that we
remove file boundaries before chunking things up. This means that
small files are lumped together with their siblings and large files
are chopped into pieces, which permits us to recognize similarities in
files and directories beyond file boundaries, and makes sure our chunk
sizes are pretty evenly distributed, without the file boundaries
affecting them.

The “chunking” algorithm is based on a the buzhash rolling hash
function. SHA256 is used as strong hash function to generate digests
of the chunks. xz is used to compress the individual chunks.

Here’s a diagram, hopefully explaining a bit how the encoding process
works, wasn’t it for my crappy drawing skills:

Diagram

The diagram shows the encoding process from top to bottom. It starts
with a block device or a file tree, which is then serialized and
chunked up into variable sized blocks. The compressed chunks are then
placed in the chunk store, while a chunk index file is written listing
the chunk hashes in order. (The original SVG of this graphic may be
found here.)

Details

Note that casync operates on two different layers, depending on the
use-case of the user:

  1. You may use it on the block layer. In this case the raw block data
    on disk is taken as-is, read directly from the block device, split
    into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the
    file tree serialization format mentioned above comes into play:
    the file tree is serialized depth-first (much like tar would do
    it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer
opens it up for a variety of different use-cases. In the VM and IoT
ecosystems shipping images as block-level serializations is more
common, while in the container and application world file-system-level
serializations are more typically used.

Chunk index files referring to block-layer serializations carry the
.caibx suffix, while chunk index files referring to file system
serializations carry the .caidx suffix. Note that you may also use
casync as direct tar replacement, i.e. without the chunking, just
generating the plain linear file tree serialization. Such files
carry the .catar suffix. Internally .caibx are identical to
.caidx files, the only difference is semantical: .caidx files
describe a .catar file, while .caibx files may describe any other
blob. Finally, chunk stores are directories carrying the .castr
suffix.

Features

Here are a couple of other features casync has:

  1. When downloading a new image you may use casync‘s --seed=
    feature: each block device, file, or directory specified is processed
    using the same chunking logic described above, and is used as
    preferred source when putting together the downloaded image locally,
    avoiding network transfer of it. This of course is useful whenever
    updating an image: simply specify one or more old versions as seed and
    only download the chunks that truly changed since then. Note that
    using seeds requires no history relationship between seed and the new
    image to download. This has major benefits: you can even use it to
    speed up downloads of relatively foreign and unrelated data. For
    example, when downloading a container image built using Ubuntu you can
    use your Fedora host OS tree in /usr as seed, and casync will
    automatically use whatever it can from that tree, for example timezone
    and locale data that tends to be identical between
    distributions. Example: casync extract
    http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2
    . This
    will place the block-layer image described by the indicated URL in the
    /dev/sda2 partition, using the existing /dev/sda1 data as seeding
    source. An invocation like this could be typically used by IoT systems
    with an A/B partition setup. Example 2: casync extract
    http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1
    --seed=/srv/container-v2 /src/container-v3
    , is very similar but
    operates on the file system layer, and uses two old container versions
    to seed the new version.

  2. When operating on the file system level, the user has fine-grained
    control on the meta-data included in the serialization. This is
    relevant since different use-cases tend to require a different set of
    saved/restored meta-data. For example, when shipping OS images, file
    access bits/ACLs and ownership matter, while file modification times
    hurt. When doing personal backups OTOH file ownership matters little
    but file modification times are important. Moreover different backing
    file systems support different feature sets, and storing more
    information than necessary might make it impossible to validate a tree
    against an image if the meta-data cannot be replayed in full. Due to
    this, casync provides a set of --with= and --without= parameters
    that allow fine-grained control of the data stored in the file tree
    serialization, including the granularity of modification times and
    more. The precise set of selected meta-data features is also always
    part of the serialization, so that seeding can work correctly and
    automatically.

  3. casync tries to be as accurate as possible when storing file
    system meta-data. This means that besides the usual baseline of file
    meta-data (file ownership and access bits), and more advanced features
    (extended attributes, ACLs, file capabilities) a number of more exotic
    data is stored as well, including Linux
    chattr(1) file attributes, as
    well as FAT file
    attributes

    (you may wonder why the latter? — EFI is FAT, and /efi is part of
    the comprehensive serialization of any host). In the future I intend
    to extend this further, for example storing btrfs sub-volume
    information where available. Note that as described above every single
    type of meta-data may be turned off and on individually, hence if you
    don’t need FAT file bits (and I figure it’s pretty likely you don’t),
    then they won’t be stored.

  4. The user creating .caidx or .caibx files may control the desired
    average chunk length (before compression) freely, using the
    --chunk-size= parameter. Smaller chunks increase the number of
    generated files in the chunk store and increase HTTP GET load on the
    server, but also ensure that sharing between similar images is
    improved, as identical patterns in the images stored are more likely
    to be recognized. By default casync will use a 64K average chunk
    size. Tweaking this can be particularly useful when adapting the
    system to specific CDNs, or when delivering compressed disk images
    such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible,
    well-defined and strictly deterministic. As mentioned above this is a
    requirement to reach the intended security guarantees, but is also
    useful for many other use-cases. For example, the casync digest
    command may be used to calculate a hash value identifying a specific
    directory in all desired detail (use --with= and --without to pick
    the desired detail). Moreover the casync mtree command may be used
    to generate a BSD mtree(5) compatible manifest of a directory tree,
    .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this
    I mean that the serialization of a file tree is the concatenation of
    the serializations of all files and file sub-trees located at the
    top of the tree, with zero meta-data references from any of these
    serializations into the others. This property is essential to ensure
    maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync
    will automatically create
    reflinks
    from any specified seeds if the underlying file system supports it
    (such as btrfs, ocfs, and future xfs). After all, instead of
    copying the desired data from the seed, we can just tell the file
    system to link up the relevant blocks. This works both when extracting
    .caidx and .caibx files — the latter of course only when the
    extracted disk image is placed in a regular raw image file on disk,
    rather than directly on a plain block device, as plain block devices
    do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can
    create traditional UNIX hard-links for identical files in specified
    seeds (--hardlink=yes). This works on all UNIX file systems, and can
    save substantial amounts of disk space. However, this only works for
    very specific use-cases where disk images are considered read-only
    after extraction, as any changes made to one tree will propagate to
    all other trees sharing the same hard-linked files, as that’s the
    nature of hard-links. In this mode, casync exposes OSTree-like
    behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file
    system images. Implicitly, file systems such as procfs and sysfs are
    excluded from serialization, as they expose API objects, not real
    files. Moreover, the “nodump” (+d)
    chattr(1) flag is honored by
    default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an
    automatic or explicit UID/GID shift. This is particularly useful when
    transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP,
    HTTPS, FTP and ssh natively for downloading chunk index files and
    chunks (the ssh mode requires installing casync on the remote host,
    though, but an sftp mode not requiring that should be easy to
    add). When creating index files or chunks, only ssh is supported as
    remote back-end.

  12. When operating on block-layer images, you may expose locally or
    remotely stored images as local block devices. Example: casync mkdev
    http://example.com/myimage.caibx
    exposes the disk image described by
    the indicated URL as local block device in /dev, which you then may
    use the usual block device tools on, such as mount or fdisk (only
    read-only though). Chunks are downloaded on access with high priority,
    and at low priority when idle in the background. Note that in this
    mode, casync also plays a role similar to “dm-verity”, as all blocks
    are validated against the strong digests in the chunk index file
    before passing them on to the kernel’s block layer. This feature is
    implemented though Linux’ NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount
    locally or remotely stored images as regular file systems. Example:
    casync mount http://example.com/mytree.caidx /srv/mytree mounts the
    file tree image described by the indicated URL as a local directory
    /srv/mytree. This feature is implemented though Linux’ FUSE kernel
    facility. Note that special care is taken that the images exposed this
    way can be packed up again with casync make and are guaranteed to
    return the bit-by-bit exact same serialization again that it was
    mounted from. No data is lost or changed while passing things through
    FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s
    hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in
    the two partitions are usually much shorter than the partition size,
    in order to keep some room for later, larger updates. casync is able
    to analyze the super-block of a number of common file systems in order
    to determine the actual size of a file system stored on a block
    device, so that writing a file system to such a partition and reading
    it back again will result in reproducible data. Moreover this speeds
    up the seeding process, as there’s little point in seeding the
    white-space after the file system within the partition.

Example Command Lines

Here’s how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local
directory, and populate the chunk store directory default.castr
located next to it with the chunks of the serialization (you can
change the name for the store directory with --store= if you
like). This command operates on the file-system level. A similar
command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local
directory describing the current contents of the /dev/sda1 block
device, and populates default.castr in the same way as above. Note
that you may as well read a raw disk image from a file instead of a
block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and
the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data
only. Now let’s make this more interesting, and reference remote
resources:

$ casync extract http://example.com/images/foobar.caidx /some/other/directory

This extracts the specified .caidx onto a local directory. This of
course assumes that foobar.caidx was uploaded to the HTTP server in
the first place, along with the chunk store. You can use any command
you like to accomplish that, for example scp or
rsync. Alternatively, you can let casync do this directly when
generating the chunk index:

$ casync make ssh.example.com:images/foobar.caidx /some/directory

This will use ssh to connect to the ssh.example.com server, and then
places the .caidx file and the chunks on it. Note that this mode of
operation is “smart”: this scheme will only upload chunks currently
missing on the server side, and not re-transmit what already is
available.

Note that you can always configure the precise path or URL of the
chunk store via the --store= option. If you do not do that, then the
store path is automatically derived from the path or URL: the last
component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources,
using a local seed is advisable:

$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by
default store meta-data as accurately as possible. Let’s create a chunk
index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization
that has three features above the absolute baseline supported: 1s
granularity time-stamps, symbolic links and a single read-only bit. In
this mode, all the other meta-data bits are not stored, including
nanosecond time-stamps, full UNIX permission bits, file ownership or
even ACLs or extended attributes.

Now let’s make a .caidx file available locally as a mounted file
system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let’s make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device
node path to STDOUT.

As mentioned, casync is big about reproducibility. Let’s make use of
that to calculate the a digest identifying a very specific version of
a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying
file system know about. Usually, to make this useful you want to
configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data
fields. Specifying --with-unix= selects all meta-data that
traditional UNIX file systems support. It is a shortcut for writing out:
--with=16bit-uids --with=permissions --with=sec-time --with=symlinks
--with=device-nodes --with=fifos --with=sockets
.

Note that when calculating digests or creating chunk indexes you may
also use the negative --without= option to remove specific features
but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves
one feature out: chattr(1)‘s
immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list http://example.com/images/foobar.caidx

or

$ casync mtree http://example.com/images/foobar.caidx

The former command will generate a brief list of files and
directories, not too different from tar t or ls -al in its
output. The latter command will generate a BSD
mtree(5) compatible
manifest. Note that casync actually stores substantially more file
meta-data than mtree files can express, though.

What casync isn’t

  1. casync is not an attempt to minimize serialization and downloaded
    deltas to the extreme. Instead, the tool is supposed to find a good
    middle ground, that is good on traffic and disk space, but not at the
    price of convenience or requiring explicit revision control. If you
    care about updates that are absolutely minimal, there are binary delta
    systems around that might be an option for you, such as Google’s
    Courgette
    .

  2. casync is not a replacement for rsync, or git or zsync or
    anything like that. They have very different use-cases and
    semantics. For example, rsync permits you to directly synchronize two
    file trees remotely. casync just cannot do that, and it is unlikely
    it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary
focus for now is delivery of OS images, but I’d like to make it useful
for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have
    pretty concrete plans how to add that. When implemented, the tool
    might become an alternative to restic,
    BorgBackup or
    tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still
    need to validate the downloaded .caidx or .caibx file yourself, for
    example with some gpg signature. It is my intention to integrate with
    gpg in a minimal way so that signing and verifying chunk index files
    is done automatically.

  3. In the longer run, I’d like to build an automatic synchronizer for
    $HOME between systems from this. Each $HOME instance would be
    stored automatically in regular intervals in the cloud using casync,
    and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet
    built as one. Specifically this means that almost all of casync‘s
    functionality is supposed to be available as C API soon, and
    applications can process casync files on every level. It is my
    intention to make this library useful enough so that it will be easy
    to write a module for GNOME’s gvfs subsystem in order to make remote
    or local .caidx files directly available to applications (as an
    alternative to casync mount). In fact the idea is to make this all
    flexible enough that even the remoting back-ends can be replaced
    easily, for example to replace casync‘s default HTTP/HTTPS back-ends
    built on CURL with GNOME’s own HTTP implementation, in order to share
    cookies, certificates, … There’s also an alternative method to
    integrate with casync in place already: simply invoke casync as a
    sub-process. casync will inform you about a certain set of state
    changes using a mechanism compatible with
    sd_notify(3). In
    future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from
    the local network. After downloading the new .caidx file off the
    Internet casync would then search for the listed chunks on the local
    network first before retrieving them from the Internet. This should
    speed things up on all installations that have multiple similar
    systems deployed in the same network.

Further plans are listed tersely in the
TODO file.

FAQ:

  1. Is this a systemd project?casync is hosted under the
    github systemd umbrella, and the
    projects share the same coding style. However, the code-bases are
    distinct and without interdependencies, and casync works fine both
    on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and
    that’s what I code for. That said, I am open to accepting portability
    patches (unlike for systemd, which doesn’t really make sense on
    non-Linux systems), as long as they don’t interfere too much with the
    way casync works. Specifically this means that I am not too
    enthusiastic about merging portability patches for OSes lacking the
    openat(2) family
    of APIs.

  3. Does casync require reflink-capable file systems to work, such
    as btrfs?
    — No it doesn’t. The reflink magic in casync is
    employed when the file system permits it, and it’s good to have it,
    but it’s not a requirement, and casync will implicitly fall back to
    copying when it isn’t available. Note that casync supports a number
    of file system features on a variety of file systems that aren’t
    available everywhere, for example FAT’s system/hidden file flags or
    xfs‘s projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial
    release. While I have been working on it since quite some time and it
    is quite featureful, this is the first time I advertise it publicly,
    and it hence received very little testing outside of its own test
    suite. I am also not fully ready to commit to the stability of the
    current serialization or chunk index format. I don’t see any breakages
    coming for it though. casync is pretty light on documentation right
    now, and does not even have a man page. I also intend to correct that
    soon.

  5. Are the .caidx/.caibx and .catar file formats open and
    documented?
    casync is Open Source, so if you want to know the
    precise format, have a look at the sources for now. It’s definitely my
    intention to add comprehensive docs for both formats however. Don’t
    forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing
    the wheel (again)?
    — Well, because casync isn’t “just like” some
    other tool. I am pretty sure I did my homework, and that there is no
    tool just like casync right now. The tools coming closest are probably
    rsync, zsync, tarsnap, restic, but they are quite different beasts
    each.

  7. Why did you invent your own serialization format for file trees?
    Why don’t you just use tar?
    — That’s a good question, and other
    systems — most prominently tarsnap — do that. However, as mentioned
    above tar doesn’t enforce reproducibility. It also doesn’t really do
    random access: if you want to access some specific file you need to
    read every single byte stored before it in the tar archive to find
    it, which is of course very expensive. The serialization casync
    implements places a focus on reproducibility, random access, and
    meta-data control. Much like traditional tar it can still be
    generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the
    moment not. That’s not because I wouldn’t want it to, but simply
    because I am not a guru of either of these systems, and didn’t want to
    implement something I do not fully grok nor can test. If you look at
    the sources you’ll find that there’s already some definitions in place
    that keep room for them though. I’d be delighted to accept a patch
    implementing this fully.

  9. What about delivering squashfs images? How well does chunking
    work on compressed serializations?
    – That’s a very good point!
    Usually, if you apply the a chunking algorithm to a compressed data
    stream (let’s say a tar.gz file), then changing a single bit at the
    front will propagate into the entire remainder of the file, so that
    minimal changes will explode into major changes. Thankfully this
    doesn’t apply that strictly to squashfs images, as it provides
    random access to files and directories and thus breaks up the
    compression streams in regular intervals to make seeking easy. This
    fact is beneficial for systems employing chunking, such as casync as
    this means single bit changes might affect their vicinity but will not
    explode in an unbounded fashion. In order achieve best results when
    delivering squashfs images through casync the block sizes of
    squashfs and the chunks sizes of casync should be matched up
    (using casync‘s --chunk-size= option). How precisely to choose
    both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It’s a synchronizing
    tool, hence the -sync suffix, following rsync‘s naming. It makes
    use of the content-addressable concept of git hence the ca-
    prefix.

  11. Where can I get this stuff? Is it already packaged? – Check
    out the sources on GitHub. I
    just tagged the first
    version
    . Martin
    Pitt has packaged casync for
    Ubuntu
    . There
    is also an ArchLinux
    package
    . Zbigniew
    Jędrzejewski-Szmek has prepared a Fedora
    RPM
    that hopefully
    will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that’s up to you really. If you are involved with projects that
need to deliver IoT, VM, container, application or OS images, then
maybe this is a great tool for you — but other options exist, some of
which are linked above.

Note that casync is an Open Source project: if it doesn’t do exactly
what you need, prepare a patch that adds what you need, and we’ll
consider it.

If you are interested in the project and would like to talk about this
in person, I’ll be presenting casync soon at Kinvolk’s Linux
Technologies
Meetup

in Berlin, Germany. You are invited. I also intend to talk about it at
All Systems Go!, also in Berlin.

Amazon EFS Update – On-Premises Access via Direct Connect

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-efs-update-on-premises-access-via-direct-connect-vpc/

I introduced you to Amazon Elastic File System last year (Amazon Elastic File System – Shared File Storage for Amazon EC2) and announced production readiness earlier this year (Amazon Elastic File System – Production-Ready in Three Regions). Since the launch earlier this year, thousands of AWS customers have used it to set up, scale, and operate shared file storage in the cloud.

Today we are making EFS even more useful with the introduction of simple and reliable on-premises access via AWS Direct Connect. This has been a much-requested feature and I know that it will be useful for migration, cloudbursting, and backup. To use this feature for migration, you simply attach an EFS file system to your on-premises servers, copy your data to it, and then process it in the cloud as desired, leaving your data in AWS for the long term.  For cloudbursting, you would copy on-premises data to an EFS file system, analyze it at high speed using a fleet of Amazon Elastic Compute Cloud (EC2) instances, and then copy the results back on-premises or visualize them in Amazon QuickSight.

You’ll get the same file system access semantics including strong consistency and file locking, whether you access your EFS file systems from your on-premises servers or from your EC2 instances (of course, you can do both concurrently). You will also be able to enjoy the same multi-AZ availability and durability that is part-and-parcel of EFS.

In order to take advantage of this new feature, you will need to use Direct Connect to set up a dedicated network connection between your on-premises data center and an Amazon Virtual Private Cloud. Then you need to make sure that your filesystems have mount targets in subnets that are reachable via the Direct Connect connection:

You also need to add a rule to the mount target’s security group in order to allow inbound TCP and UDP traffic to port 2049 (NFS) from your on-premises servers:

After you create the file system, you can reference the mount targets by their IP addresses, NFS-mount them on-premises, and start copying files. The IP addresses are available from within the AWS Management Console:

The Management Console also provides you with access to step-by-step directions! Simply click on the On-premises mount instructions:

And follow along:

This feature is available today at no extra charge in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and US East (Ohio) Regions.

Jeff;

 

Pi 3 booting part II: Ethernet

Post Syndicated from Gordon Hollingworth original https://www.raspberrypi.org/blog/pi-3-booting-part-ii-ethernet-all-the-awesome/

Yesterday, we introduced the first of two new boot modes which have now been added to the Raspberry Pi 3. Today, we introduce an even more exciting addition: network booting a Raspberry Pi with no SD card.

Again, rather than go through a description of the boot mode here, we’ve written a fairly comprehensive guide on the Raspberry Pi documentation pages, and you can find a tutorial to get you started here. Below are answers to what we think will be common questions, and a look at some limitations of the boot mode.

Note: this is still in beta testing and uses the “next” branch of the firmware. If you’re unsure about using the new boot modes, it’s probably best to wait until we release it fully.

What is network booting?

Network booting is a computer’s ability to load all its software over a network. This is useful in a number of cases, such as remotely operated systems or those in data centres; network booting means they can be updated, upgraded, and completely re-imaged, without anyone having to touch the device!

The main advantages when it comes to the Raspberry Pi are:

  1. SD cards are difficult to make reliable unless they are treated well; they must be powered down correctly, for example. A Network File System (NFS) is much better in this respect, and is easy to fix remotely.
  2. NFS file systems can be shared between multiple Raspberry Pis, meaning that you only have to update and upgrade a single Pi, and are then able to share users in a single file system.
  3. Network booting allows for completely headless Pis with no external access required. The only desirable addition would be an externally controlled power supply.

I’ve tried doing things like this before and it’s really hard editing DHCP configurations!

It can be quite difficult to edit DHCP configurations to allow your Raspberry Pi to boot, while not breaking the whole network in the process. Because of this, and thanks to input from Andrew Mulholland, I added the support of proxy DHCP as used with PXE booting computers.

What’s proxy DHCP and why does it make it easier?

Standard DHCP is the protocol that gives a system an IP address when it powers up. It’s one of the most important protocols, because it allows all the different systems to coexist. The problem is that if you edit the DHCP configuration, you can easily break your network.

So proxy DHCP is a special protocol: instead of handing out IP addresses, it only hands out the TFTP server address. This means it will only reply to devices trying to do netboot. This is much easier to enable and manage, because we’ve given you a tutorial!

Are there any bugs?

At the moment we know of three problems which need to be worked around:

  • When the boot ROM enables the Ethernet link, it first waits for the link to come up, then sends its first DHCP request packet. This is sometimes too quick for the switch to which the Raspberry Pi is connected: we believe that the switch may throw away packets it receives very soon after the link first comes up.
  • The second bug is in the retransmission of the DHCP packet: the retransmission loop is not timing out correctly, so the DHCP packet will not be retransmitted.

The solution to both these problems is to find a suitable switch which works with the Raspberry Pi boot system. We have been using a Netgear GS108 without a problem.

  • Finally, the failing timeout has a knock-on effect. This means it can require the occasional random packet to wake it up again, so having the Raspberry Pi network wired up to a general network with lots of other computers actually helps!

Can I use network boot with Raspberry Pi / Pi 2?

Unfortunately, because the code is actually in the boot ROM, this won’t work with Pi 1, Pi B+, Pi 2, and Pi Zero. But as with the MSD instructions, there’s a special mode in which you can copy the ‘next’ firmware bootcode.bin to an SD card on its own, and then it will try and boot from the network.

This is also useful if you’re having trouble with the bugs above, since I’ve fixed them in the bootcode.bin implementation.

Finally, I would like to thank my Slack beta testing team who provided a great testing resource for this work. It’s been a fun few weeks! Thanks in particular to Rolf Bakker for this current handy status reference…

Current state of network boot on all Pis

Current state of network boot on all Pis

The post Pi 3 booting part II: Ethernet appeared first on Raspberry Pi.

Looking for: Systems Administrator

Post Syndicated from Yev original https://www.backblaze.com/blog/looking-systems-administrator/

hiring-lady-desk (1)

Want to a join a rapidly expanding team and help us grow Backblaze to new heights? We’re looking for a Sys Admin who is looking for a challenging and fast-paced working environment. The position can either be in San Mateo, California or in our Rancho Cordova datacenter! Interested? Check out the job description and application details below:

Here’s what you’ll be working on:

    – Rebuild failed RAID arrays, diagnose and repair file system problems (ext4) and debug other operations problems with minimal supervision.
    – Administrative proficiency in software patches, releases and system upgrades.
    – Troubleshoot and resolve operational problems.
    – Help deploy, configure and maintain production systems.
    – Assist with networks and services (static/dynamic web servers, etc) as needed.
    – Assist in efforts to automate provisioning and other tasks that need to be run across hundreds of servers.
    – Help maintain monitoring systems to measure system availability and detect issues.
    – Help qualify hardware and components.
    – Participate in the 24×7 on-call pager rotation and respond to alerts as needed. This may include occasional trips to Backblaze datacenter(s).
    – Write, design, maintain and support operational Documentation and scripts.
    – Help train operations staff as needed.

This is a must:

    – Strong knowledge of Linux system administration, Debian experience preferred.
    – 4+ years of experience.
    – Bash scripting skills required.
    – Ability to lift/move 50-75 lbs and work down near the floor as needed.
    – Position based in the San Mateo Corporate Office or the Rancho Cordova Datacenter, California.

It would be nice if you had:

    – Experience configuring and supporting (Debian) Linux software RAID (mdadm).
    – Experience configuring and supporting file systems on Linux (Debian).
    – Experience troubleshooting server hardware/component issues.
    – Experience supporting Apache, Tomcat, and Java services.
    – Experience with automation in a production environment (Puppet/Chef/Ansible).
    – Experience supporting network equipment (layer 2 switches).

Required for all Backblaze Employees:

    – Good attitude and willingness to do whatever it takes to get the job done.
    – Strong desire to work for a small fast paced company.
    – Desire to learn and adapt to rapidly changing technologies and work environment.
    – Occasional visits to Backblaze datacenters necessary.
    – Rigorous adherence to best practices.
    – Relentless attention to detail.
    – Excellent interpersonal skills and good oral/written communication.
    – Excellent troubleshooting and problem solving skills.
    – OK with pets in office.

Backblaze is an Equal Opportunity Employer and we offer competitive salary and benefits, including our no policy vacation policy.

If this sounds like you — follow these steps:

  • Send an email to [email protected] with the position in the subject line.
  • Include your resume.
  • Tell us a bit about your Sys Admin experience and why you’re excited to work with Backblaze.

The post Looking for: Systems Administrator appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

My Raspeberry Pi cluster

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/07/my-raspeberry-pi-cluster.html

So I accidentally ordered too many Raspberry Pi’s. Therefore, I built a small cluster out of them. I thought I’d write up a parts list for others wanting to build a cluster.

To start with is some pics of the cluster What you see is a stack of 7 RPis. At the bottom of the stack is a USB multiport charger and also an Ethernet hub. You see USB cables coming out of the charger to power the RPis, and out the other side you see Ethernet cables connecting the RPis to a network. I’ve including the mouse and keyboard in the picture to give you a sense of perspective.

Here is the same stack turn around, seeing it from the other side. Out the bottom left you see three external cables, one Ethernet to my main network and power cables for the USB charger and Ethernet hub. You can see that the USB hub is nicely tied down to the frame, but that the Ethernet hub is just sort jammed in there somehow.

The concept is to get things as cheap as possible, on per unit basis. Otherwise, one might as well just buy more expensive computers. My parts list for a 7x Pi cluster are:

$35.00/unit Raspberry Pi
 $6.50/unit stacking case from Amazon
 $5.99/unit micro SD flash from Newegg
 $4.30/unit power supply from Amazon
 $1.41/unit Ethernet hub from Newegg
 $0.89/unit 6 inch and 1-foot micro USB cable from Monoprice
 $0.57/unit 1 foot Ethernet cable from Monoprice

…or $54.65 per unit (or $383 for entire cluster), or around 50% more than the base Raspberry Pis alone. This is getting a bit expensive, as Newegg. always has cheap Android tablets on closeout for $30 to $50.

So here’s a discussion of the parts.

Raspberry Pi 2

These are old boards I’d ordered a while back. They are up to RPi3 now with slightly faster processors and WiFi/Bluetooth on board, neither of which are useful for a cluster. It has four CPUs each running at 900 MHz as opposed to the RPi3 which has four 1.2 GHz processors. If you order a Raspberry Pi now, it’ll be the newer, better one.

The case

You’ll notice that the RPi’s are mounted on acrylic sheets, which are in turn held together with standoffs/spaces. This is a relatively expensive option.

A cheaper solution would be just to buy the spaces/standoffs yourself. They are a little hard to find, because the screws need to fit the 2.9mm holes, where are unusually tiny. Such spaces/standoffs are usually made of brass, but you can also find nylon ones. For the ends, you need some washers and screws. This will bring the price down to about $2/unit — or a lot cheaper if you are buying in bulk for a lot of units.

The micro-SD

The absolute cheapest micro SD’s I could find were $2.95/unit for 4gb, or half the price than the ones I bought. But the ones I chose are 4x the size and 2x the speed. RPi distros are getting large enough that they no longer fit well on 4gig cards, and are even approaching 8gigs. Thus, 16gigs are the best choice, especially when I could get hen for $6/unit. By the time you read this, the price of flash will have changed up or down. I search on Newegg, because that’s the easiest way to focus on the cheapest. Most cards should work, but check http://elinux.org/RPi_SD_cards to avoid any known bad chips.

Note that different cards have different speeds, which can have a major impact on performance. You probably don’t care for a cluster, but if you are buying a card for a development system, get the faster ones. The Samsung EVO cards are a good choice for something fast.

USB Charging Hub

What we want here is a charger not a hub. Both can work, but the charger works better.

A normal hub is about connecting all your USB devices to your desktop/laptop. That doesn’t work for this RPi — the connector is just for power. It’s just leveraging the fact that there’s already lots of USB power cables/chargers out there, so that it doesn’t have to invite a custom one.

USB hubs an supply some power to the RPi, enough to boot it. However, under load, or when you connect further USB devices to the RPi, there may not be enough power available. You might be able to run a couple RPis from a normal hub, but when you’ve got all seven running (as in this stack), there might not be enough power. Power problems can outright crash the devices, but worse, it can lead to things like corrupt writes to the flash drives, slowly corrupting the system until it fails.

Luckily, in the last couple years we’ve seen suppliers of multiport chargers. These are designed for families (and workplaces) that have a lot of phones and tablets to charge. They can charge high-capacity batteries on all ports — supplying much more power than your RPi will ever need.

If want to go ultra cheaper, then cheap hubs at $1/port may be adequate. Chargers cost around $4/port.

The charger I chose in particular is the Bolse 60W 7-port charger. I only need exactly 7 ports. More ports would be nicer, in case I needed to power something else along with the stack, but this Bolse unit has the nice property that it fits snugly within the stack. The frame came with extra spacers which I could screw together to provide room. I then used zip ties to hold it firmly in place.

Ethernet hub

The RPis only have 100mbps Ethernet. Therefore, you don’t need a gigabit hub, which you’d normally get, but can choose a 100mbps hub instead: it’s cheaper, smaller, and lower power. The downside is that while each RPi only does 100-mbps, combined they will do 700-mbps, which the hub can’t handle.

I got a $10 hub from Newegg. As you can see, it fits within the frame, though not well. Every gigabit hub I’ve seen is bigger and could not fit this way.

Note that I have a couple extra RPis, but I only built a 7-high stack, because of the Ethernet hub. Hubs have only 8 ports, one of which is needed for the uplink. That leaves 7 devices. I’d have to upgrade to an unwieldy 16-port hub if I wanted more ports, which wouldn’t fit the nice clean case I’ve got.

For a gigabit option, Ethernet switches will cost between $23 and $35 dollars. That $35 option is a “smart” switch that supports not only gigabit, but also a web-based configuration tool, VLANs, and some other high-end features. If I paid more for a switch, I’d probably go with the smart/managed one.

Cables (Ethernet, USB)

Buying cables is expensive, as everyone knows whose bought an Apple cable for $30. But buying in bulk from specialty sellers can reduce the price to under $1/cable.

The chief buy factor is length. We want short cables that will just barely be long enough. in the pictures above, the Ethernet cables are 1-foot, as are two of the USB cables. The colored USB cables are 6-inches. I got these off Amazon because they looked cool, but now I’m regretting it.

The easiest, cheapest, and highest quality place to buy cables is Monoprice.com. It allows you to easily select the length and color.

To reach everything in this stack, you’ll need 1-foot cables. Though, 6-inch cables will work for some (but not all) of the USB devices. Although, instead of putting the hubs on the bottom, I could’ve put them in the middle of the stack, then 6-inch cables would’ve worked better — but I didn’t think that’d look as pretty. (I chose these colored cables because somebody suggested them, but they won’t work for the full seven-high tower).

Power consumption

The power consumption of the entire stack is 13.3 watts while it’s idle. The Ethernet hub by itself was 1.3 watts (so low because it’s 100-mbps instead of gigabit).

So, round it up, that’s 2-watts per RPi while idle.

In previous power tests, it’s an extra 2 to 3 watts while doing heavy computations, so for the entire stack, that can start consuming a significant amount of power. I mention this because people think terms of a low-power alternative to Intel’s big CPUs, but in truth, once you’ve gotten enough RPis in a cluster to equal the computational power of an Intel processor, you’ll probably be consuming more electricity.

The operating system

I grabbed the lasted Raspbian image and installed it on one of the RPis. I then removed it, copied the files off (cp -a), reformatted it to use the f2fs flash file system, then copied the files back on. I then made an image of the card (using dd), then wrote that image to 6 other cards. I then I logged into each one ad renamed them rpi-a1, …, rpi-a7. (Security note: this means they all have the same SSH private key, but I don’t care).

About flash file systems

The micro SD flash has a bit of wear leveling, but not enough. A lot of RPi servers I’ve installed in the past have failed after a few months with corrupt drives. I don’t know why, I suspect it’s because the flash is getting corrupted.

Thus, I installed f2fs, a wear leveling file system designed especially for this sort of situation. We’ll see if that helps at all.

One big thing is to make sure atime is disabled, a massively brain dead feature inherited from 1980s Unix that writes to the disk every time you read from a file.

I notice that the green LED on the RPi, indicating disk activity, flashes very briefly once per second, (so quick you’ll miss it unless you look closely at the light). I used iotop -a to find out what it is. I think it’s just a hardware feature and not related to disk activity. On the other hand, it’s worth tracking down what writes might be happening in the background that will affect flash lifetime.

What I found was that there is some kernel thread that writes rarely to the disk, and a “f2fs garbage collector” that’s cleaning up the disk for wear leveling. I saw nothing that looked like it was writing regularly to the disk.

What to use it for?

So here’s the thing about an RPi cluster — it’s technically useless. If you run the numbers, it’s got less compute power and higher power consumption than a normal desktop/laptop computer. Thus, an entire cluster of them will still perform slower than laptops/desktops.

Thus, the point of a cluster is to have something to play with, to experiment with, not that it’s the best form of computation. The point of individual RPis is not that they have better performance/watt — but that you don’t need as much performance but want a package with very low watts.

With that said, I should do some password cracking benchmarks with them, compared across CPUs and GPUs, measuring power consumption. That’ll be a topic for a later post.

With that said, I will be using these, though as individual computers rather than as a “cluster”. There’s lots of services I want to run, but I don’t want to run a full desktop running VMware. I’d rather control individual devices.

Conclusion

I’m not sure what I’m going to do with my little RPi stack/cluster, but I wanted to document everything about it so that others can replicate it if they want to.

Amazon EBS Update – New Cold Storage and Throughput Options

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-ebs-update-new-cold-storage-and-throughput-options/

The AWS team spends a lot of time looking in to ways to deliver innovation based around improvements in price/performance. Quite often, this means wrestling with interesting economic and technical dilemmas.

For example, it turns out that there are some really interesting trade-offs between HDD and SSD storage. On the one hand, today’s SSD devices provide more IOPS per dollar, more throughput per gigabyte, and lower latency than today’s HDD devices. On the other hand, continued density improvements in HDD technology drive the cost per gigabyte down, but also reduce the effective throughput per gigabyte. We took this as a challenge and asked ourselves—could we use cost-effective HDD devices to build a high-throughput storage option for EBS that would deliver consistent performance for common workloads like big data and log processing?

Of course we could!

Today we are launching a new pair of low-cost EBS volume types that take advantage of the scale of the cloud to deliver high throughput on a consistent basis, for use with EC2 instances and Amazon EMR clusters (prices are for the US East (Northern Virginia) Region; please see the EBS Pricing page for other regions):

  • Throughput Optimized HDD (st1) – Designed for high-throughput MapReduce, Kafka, ETL, log processing, and data warehouse workloads; $0.045 / gigabyte / month.
  • Cold HDD (sc1) – Designed for workloads similar to those for Throughput Optimized HDD that are accessed less frequently; $0.025 / gigabyte / month.

Like the existing General Purpose SSD (gp2) volume type, the new magnetic volumes give you baseline performance, burst performance, and a burst credit bucket. While the SSD volumes defines performance in terms of IOPS (Input/Output Operations Per Second), the new volumes define it in terms of throughput. The burst values are based on the amount of storage provisioned for the volume:

  • Throughput Optimized HDD (st1) – Starts at 250 MB/s for a 1 terabyte volume, and grows by 250 MB/s for every additional provisioned terabyte until reaching a maximum burst throughput of 500 MB/s.
  • Cold HDD (sc1) – Starts at 80 MB/s for a 1 terabyte volume, and grows by 80 MB/s for every additional provisioned terabyte until reaching a maximum burst throughput of 250 MB/s.

Evolution of EBS
I like to think of customer-driven product and feature development in evolutionary terms. New offerings within a category often provide broad solutions that are a good fit for a wide variety of use cases. Over time, as we see how customers put the new offering to use and provide us with feedback on how we can do even better, a single initial offering will often speciate into several new offerings, each one tuned to the needs of a particular customer type and/or use case.

The various storage options for EC2 instances are a great example of this. Here’s a brief timeline of some of the most significant developments:

  • 2006 – EC2 launched with instance storage.
  • 2008 – EBS (Elastic Block Storage) launched on magnetic storage.
  • 2012 – EBS Provisioned IOPS and EBS-Optimized instances.
  • 2014 – SSD-Backed general purpose storage.
  • 2014 – EBS data volume encryption.
  • 2015 – Larger and faster EBS volumes.
  • 2015 – EBS boot volume encryption.
  • 2016 – EBS Throughput Optimized HDD (st1) and Cold HDD (sc1) volume types.

Workload Characteristics
We tuned these volumes to deliver great price/performance when used for big data workloads. In order to achieve the levels of performance that are possible with the volumes, your application must perform large and sequential I/O operations, which is typical of big data workloads. This is due to the nature of the underlying magnetic storage, which can transfer contiguous data with great rapidity. Small random access I/O operations (often generated by database engines) are less efficient and will result in lower throughput. The General Purpose SSD volumes are a much better fit for this access pattern.

For both of the new magnetic volume types, the burst credit bucket can grow until it reaches the size of the volume. In other words, when a volume’s bucket is full, you can scan the entire volume at the burst rate. Each I/O request of 1 megabyte or less counts as 1 megabyte’s worth of credit. Sequential I/O operations are merged into larger ones where possible; this can increase throughput and maximizes the value of the burst credit bucket (to learn more about how the bucket operates, visit the Performance Burst Details section of my New SSD-Backed Elastic Block Storage post).

If your application makes use of the file system and the operating system’s page cache (as just about all applications do), we recommend that you set the volume’s read-ahead buffer to 1 MiB on the EC2 instance that the volume is attached to. Here’s how you do that using an instance that is running Ubuntu or that was booted from the Amazon Linux AMI (adjust the device name as needed):

$ sudo blockdev --setra 2048 /dev/xvdf

The value is expressed as the number of 512-byte sectors to be used for buffering.

This value will improve read performance for workloads that consist of large, sequential reads. However, it may increase latency for workloads that consist of small, random read operations.

Most customers are using Linux kernel versions before 4.2 and the read ahead setting is all they need to tune. For customers using newer kernels, we also recommend setting xen_blkfront.max to 256 for the best performance. To set this parameter on an instance that runs the Amazon Linux AMI, edit /boot/grub/menu.list so that it invokes the kernel as follows:

kernel /boot/vmlinuz-4.4.5-15.26.amzn1.x86_64 root=LABEL=/ console=ttyS0 xen_blkfront.max=256

If your file contains multiple entries, edit the one that corresponds to the active kernel.  This is a boot-time setting so you’ll need to reboot the instance in order for the setting to take effect. If you are using a Linux distribution that does not use the Grub bootloader, you will need to figure out how to make the equivalent change to your configuration.

For more performance tuning tips, please read Amazon EBS Volume Performance on Linux Instances and Amazon EBS Volume Performance on Windows Instances.

Comparing EBS Volume Types
Here’s a table that summarizes the specifications and use cases of each EBS volume type (Although not shown in the table, the original EBS Magnetic offering is still available if needed for your application):

Solid State Drive (SSD) Hard Disk Drive (HDD)
Volume Type Provisioned IOPS SSD (io1) General Purpose SSD (gp2)  Throughput Optimized HDD (st1) Cold HDD (sc1)
Use Cases I/O intensive NoSQL and relational databases. Boot volumes, low-latency interactive applications, dev, test. Big data, data warehouses, log processing. Colder data requiring fewer scans per day.
Volume Size 4 GB – 16 TB 1 GB – 16 TB 500 GB – 16 TB 500 GB – 16 TB
Max IOPS/Volume 20,000
(16 KB I/O size)
10,000
(16 KB I/O size)
500
(1 MB I/O size)
250
(1 MB I/O size)
Max IOPS/Instance
(using multiple volumes)
48,000 48,000 48,000 48,000
Max Throughput/Volume 320 MB/s 160 MB/s 500 MB/s 250 MB/s
Max Throughput/Instance 800 MB/s 800 MB/s 800 MB/s 800 MB/s
Price $0.125/GB-month + $.065/provisioned IOPS/month $0.100/GB-month $.045/GB-month $.025/GB-month
Dominant Performance Attribute IOPS IOPS MB/s MB/s

You also have the option to further boost performance by using EBS-Optimized instances and RAID to create file systems that are larger and/or support more IOPS. Read about RAID Configuration on Linux and RAID Configuration on Windows to learn more.

CloudFormation Template for Testing
In order to make it as easy as possible for you to set up a test environment on a reproducible basis, we have created a simple CloudFormation template. You can launch the st1 template to create an EC2 instance with a 2 terabyte st1 volume attached. The st1 template instructions contain some additional information.

Available Now
The new volume types are available now and you can start using them today with EC2 and EMR. You can create them from the AWS Management Console, AWS Command Line Interface (CLI), AWS Tools for Windows PowerShell, AWS CloudFormation templates, the AWS SDKs, and so forth.

As you can see from the table above, this new offering gives you a unique combination of high throughput and a very low cost per gigabyte.

I am looking forward to your feedback so that we can continue to evolve EBS to meet your ever-growing (and continually diversifying) needs. Leave me a comment and I’ll make sure that the team sees it.


Jeff;

PS – If you are a developer, development manager, or product manager and would like to build systems like this, please visit the EBS Jobs page.

Top 10 Performance Tuning Techniques for Amazon Redshift

Post Syndicated from Ian Meyers original https://blogs.aws.amazon.com/bigdata/post/Tx31034QG0G3ED1/Top-10-Performance-Tuning-Techniques-for-Amazon-Redshift

Ian Meyers is a Principal Solutions Architect with Amazon Web Services

Zach Christopherson, an Amazon Redshift Database Engineer, contributed to this post

Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. Customers use Amazon Redshift for everything from accelerating existing database environments that are struggling to scale, to ingestion of web logs for big data analytics use cases. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.

Amazon Redshift can run any type of data model, from a production transaction system third-normal-form model, to star and snowflake schemas, or simple flat tables. As customers adopt Amazon Redshift, they must consider its architecture in order to ensure that their data model is correctly deployed and maintained by the database. This post takes you through the most common issues that customers find as they adopt Amazon Redshift, and gives you concrete guidance on how to address each. If you address each of these items, you should be able to achieve optimal performance of queries and be able to scale effectively to meet customer demand.

Issue #1: Incorrect column encoding

Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded (see Choosing a Column Compression Type in the Amazon Redshift Database Developer Guide) , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.

Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied. To determine if you are deviating from this best practice, run the following query to determine if any tables have NO column encoding applied:

SELECT database, schema || ‘.’ || "table" AS "table", encoded, size
FROM svv_table_info
WHERE encoded=’N’
ORDER BY 2;

Afterward, review the tables and columns which aren’t encoded by running the following query:

SELECT trim(n.nspname || ‘.’ || c.relname) AS "table",trim(a.attname) AS "column",format_type(a.atttypid, a.atttypmod) AS "type",
format_encoding(a.attencodingtype::integer) AS "encoding", a.attsortkeyord AS "sortkey"
FROM pg_namespace n, pg_class c, pg_attribute a
WHERE n.oid = c.relnamespace AND c.oid = a.attrelid AND a.attnum > 0 AND NOT a.attisdropped and n.nspname NOT IN (‘information_schema’,’pg_catalog’,’pg_toast’) AND format_encoding(a.attencodingtype::integer) = ‘none’ AND c.relkind=’r’ AND a.attsortkeyord != 1 ORDER BY n.nspname, c.relname, a.attnum;

If you find that you have tables without optimal column encoding, then use the Amazon Redshift Column Encoding Utility on AWS Labs GitHub to apply encoding. This command line utility uses the ANALYZE COMPRESSION command on each table. If encoding is required, it generates a SQL script which creates a new table with the correct encoding, copies all the data into the new table, and then transactionally renames the new table to the old name while retaining the original data. (Please note that the first column in a compound sort key should not be encoded, and is not encoded by this utility.)

Issue #2 – Skewed table data

Amazon Redshift is a distributed, shared nothing database architecture where each node in the cluster stores a subset of the data. When a table is created, decide whether to spread the data evenly among nodes (default), or place data on a node on the basis of one of the columns. By choosing columns for distribution that are commonly joined together, you can minimize the amount of data transferred over the network during the join. This can significantly increase performance on these types of queries.

The selection of a good distribution key is the topic of many AWS articles, including Choose the Best Distribution Style; see a definitive guide to distribution and sorting of star schemas in the Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift blog post. In general, a good distribution key should exhibit the following properties:

High cardinality – There should be a large number of unique data values in the column relative to the number of nodes in the cluster.

Uniform distribution/low skew – Each unique value in the distribution key should occur in the table an even number of times. This allows Amazon Redshift to put the same number of records on each node in the cluster.

Commonly joined – The columns in a distribution key should be those that you usually join to other tables. If you have many possible columns that fit this criterion, then you may choose the column that joins to the largest table.

A skewed distribution key results in nodes not working equally hard as each other on query execution, requiring unbalanced CPU or memory, and ultimately only running as fast as the slowest node:

If skew is a problem, you typically see that node performance is uneven on the cluster. Use one of the admin scripts in the Amazon Redshift Utils GitHub repository, such as table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster.

If you find that you have tables with skewed distribution keys, then consider changing the distribution key to a column that exhibits high cardinality and uniform distribution. Evaluate a candidate column as a distribution key by creating a new table using CTAS:

CREATE TABLE MY_TEST_TABLE DISTKEY (<COLUMN NAME>) AS SELECT * FROM <TABLE NAME>;

Run the table_inspector.sql script against the table again to analyze data skew.

If there is no good distribution key in any of your records, you may find that moving to EVEN distribution works better, due to the lack of a single node being a hotspot. For small tables, you can also use DISTSTYLE ALL to place table data onto every node in the cluster.

Issue #3 – Queries not benefiting from sort keys

Amazon Redshift tables can have a sort key column identified, which acts like an index in other databases but which does not incur a storage cost as with other platforms (for more information, see Choosing Sort Keys). A sort key should be created on those columns which are most commonly used in WHERE clauses. If you have a known query pattern, then COMPOUND sort keys give the best performance; if end users query different columns equally, then use an INTERLEAVED sort key.

To determine which tables don’t have sort keys, and how often they have been queried, run the following query:

SELECT database, table_id, schema || ‘.’ || "table" AS "table", size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN (‘Internal Worktable’,’S3′)
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;

You can run a tutorial that walks you through how to address unsorted tables in the Amazon Redshift Developer Guide. You can also take advantage of another GitHub admin script that recommends sort keys based on query activity. Bear in mind that queries evaluated against a sort key column must not apply a SQL function to the sort key; instead, ensure that you apply the functions to the compared values so that the sort key is used. This is commonly found on TIMESTAMP columns that are used as sort keys.

Issue #4 – Tables without statistics or which need vacuum

Amazon Redshift, like other databases, requires statistics about tables and the composition of data blocks being stored in order to make good decisions when planning a query (for more information, see Analyzing Tables). Without good statistics, the optimiser may make suboptimal or incorrect choices about the order in which to access tables, or how to join datasets together.

The ANALYZE Command History topic in the Amazon Redshift Developer Guide supplies queries to help you address missing or stale statistics, and you can also simply run the missing_table_stats.sql admin script to determine which tables are missing stats, or the statement below to determine tables that have stale statistics:

SELECT database, schema || ‘.’ || "table" AS "table", stats_off
FROM svv_table_info
WHERE stats_off > 5
ORDER BY 2;

In Amazon Redshift, data blocks are immutable. When rows are DELETED or UPDATED, they are simply logically deleted (flagged for deletion) but not physically removed from disk. Updates result in a new block being written with new data appended. Both of these operations cause the previous version of the row to continue consuming disk space and continue being scanned when a query scans the table. As a result, table storage space is increased and performance degraded due to otherwise avoidable disk I/O during scans. A VACUUM command recovers the space from deleted rows and restores the sort order.

To address issues with tables with missing or stale statistics or where vacuum is required, run another AWS Labs utility, Analyze & Vacuum Schema. This ensures that you always keep up-to-date statistics, and only vacuum tables that actually need reorganisation.

Issue #5 – Tables with very large VARCHAR columns

During processing of complex queries, intermediate query results might need to be stored in temporary blocks. These temporary tables are not compressed, so unnecessarily wide columns consume excessive memory and temporary disk space, which can affect query performance. For more information, see Use the Smallest Possible Column Size.

Use the following query to generate a list of tables that should have their maximum column widths reviewed:

SELECT database, schema || ‘.’ || "table" AS "table", max_varchar
FROM svv_table_info
WHERE max_varchar > 150
ORDER BY 2;

After you have a list of tables, identify which table columns have wide varchar columns and then determine the true maximum width for each wide column, using the following query:

SELECT max(len(rtrim(column_name)))
FROM table_name;

In some cases, you may have large VARCHAR type columns because you are storing JSON fragments in the table, which you then query with JSON functions. If you query the top running queries for the database using the top_queries.sql admin script, pay special attention to SELECT * queries which include the JSON fragment column. If end users query these large columns but don’t use actually execute JSON functions against them, consider moving them into another table that only contains the primary key column of the original table and the JSON column.

If you find that the table has columns that are wider than necessary, then you need to re-create a version of the table with appropriate column widths by performing a deep copy.

Issue #6 – Queries waiting on queue slots

Amazon Redshift runs queries using a queuing system known as workload management (WLM). You can define up to 8 queues to separate workloads from each other, and set the concurrency on each queue to meet your overall throughput requirements.

In some cases, the queue to which a user or query has been assigned is completely busy and a user’s query must wait for a slot to be open. During this time, the system is not executing the query at all, which is a sign that you may need to increase concurrency.

First, you need to determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that increasing concurrency allows more queries to run, but they share the same memory allocation (unless you increase it). You may find that by increasing concurrency, some queries must use temporary disk storage to complete, which is also sub-optimal as well see next.

Issue #7 – Queries that are disk-based

If a query isn’t able to completely execute in memory, it may need to use disk-based temporary storage for parts of an explain plan. The additional disk I/O slows down the query, and can be addressed by increasing the amount of memory allocated to a session (for more information, see WLM Dynamic Memory Allocation).

To determine if any queries have been writing to disk, use the following query:

SELECT
q.query, trim(q.cat_text)
FROM (SELECT query, replace( listagg(text,’ ‘) WITHIN GROUP (ORDER BY sequence), ‘\n’, ‘ ‘) AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q
JOIN
(SELECT distinct query FROM svl_query_summary WHERE is_diskbased=’t’ AND (LABEL LIKE ‘hash%’ OR LABEL LIKE ‘sort%’ OR LABEL LIKE ‘aggr%’) AND userid > 1) qs ON qs.query = q.query;

Based on the user or the queue assignment rules, you can increase the amount of memory given to the selected queue to prevent queries needing to spill to disk to complete. You can also increase the WLM_QUERY_SLOT_COUNT (http://docs.aws.amazon.com/redshift/latest/dg/r_wlm_query_slot_count.html) for the session from the default of 1 to the maximum concurrency for the queue.  As outlined in Issue #6, this may result in queueing queries, so use with care

Issue #8 – Commit queue waits

Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to a commit queue.

If you are committing too often on your database, you will start to see waits on the commit queue increase, which can be viewed with the commit_stats.sql admin script. This script shows the largest queue length and queue time for queries run in the past two days. If you have queries that are waiting on the commit queue, then look for sessions that are committing multiple times per session, such as ETL jobs that are logging progress or inefficient data loads.

Issue #9 – Inefficient data loads

Amazon Redshift best practices suggest the use of the COPY command to perform data loads. This API operation uses all compute nodes in the cluster to load data in parallel, from sources such as Amazon S3, Amazon DynamoDB, Amazon EMR HDFS file systems, or any SSH connection.

When performing data loads, you should compress the files to be loaded whenever possible; Amazon Redshift supports both GZIP and LZO compression. It is more efficient to load a large number of small files than one large one, and the ideal file count is a multiple of the slice count. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 16 slices. By ensuring you have an even number of files per slices, you can know that COPY execution will evenly use cluster resources and complete as quickly as possible.

An anti-pattern is to insert data directly into Amazon Redshift, with single record inserts or the use of a multi-value INSERT statement, which allows up to 16 MB of data to be inserted at one time. These are leader node–based operations, and can create significant performance bottlenecks by maxing out the leader node CPU or memory.

Issue #10 – Inefficient use of Temporary Tables

Amazon Redshift provides temporary tables, which are like normal tables except that they are only visible within a single session. When the user disconnects the session, the tables are automatically deleted. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a query SELECT … INTO #TEMP_TABLE. The CREATE TABLE statement gives you complete control over the definition of the temporary table, while the SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and uses default storage properties.

These default storage properties may cause issues if not carefully considered. Amazon Redshift’s default table structure is to use EVEN distribution with no column encoding. This is a sub-optimal data structure for many types of queries, and if you are using select/into syntax you cannot set the column encoding or distribution and sort keys.

It is highly recommended that you convert all select/into syntax to use the CREATE statement. This ensures that your temporary tables have column encoding and are distributed in a fashion that is sympathetic the other entities that are part of the workflow. To perform a conversion of a statement which uses:

select column_a, column_b into #my_temp_table from my_table;

You would analyse the temporary table for optimal column encoding:

And then convert the select/into statement to:

BEGIN;
create temporary table my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) — Assuming you intend to join this table on column_a
sortkey (column_b); — Assuming you are sorting or grouping by column_b

insert into my_temp_table select column_a, column_b from my_table;
COMMIT;

You may also wish to analyze statistics on the temporary table, if it is used as a join table for subsequent queries:

analyze my_temp_table;

This way, you retain the functionality of using temporary tables but control data placement on the cluster through distkey assignment and take advantage of the columnar nature of Amazon Redshift through use of Column Encoding.

Tip: Using explain plan alerts

The last tip is to use diagnostic information from the cluster during query execution. This is stored in an extremely useful view called STL_ALERT_EVENT_LOG. Use the perf_alert.sql admin script to diagnose issues that the cluster has encountered over the last seven days. This is an invaluable resource in understanding how your cluster develops over time.

Summary

Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. While Amazon Redshift can run any type of data model, you can avoid possible pitfalls that might decrease performance or increase cost, by being aware of how data is stored and managed. Run a simple set of diagnostic queries for common issues and ensure that you get the best performance possible.

If you have questions or suggestions, please leave a comment below.

UPDATE: This blog post has been translated into Japanese:

————————————

Related

Best Practices for Micro-Batch Loading on Amazon Redshift

;

Top 10 Performance Tuning Techniques for Amazon Redshift

Post Syndicated from Ian Meyers original https://blogs.aws.amazon.com/bigdata/post/Tx31034QG0G3ED1/Top-10-Performance-Tuning-Techniques-for-Amazon-Redshift

Ian Meyers is a Principal Solutions Architect with Amazon Web Services

Zach Christopherson, an Amazon Redshift Database Engineer, contributed to this post

Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. Customers use Amazon Redshift for everything from accelerating existing database environments that are struggling to scale, to ingestion of web logs for big data analytics use cases. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.

Amazon Redshift can run any type of data model, from a production transaction system third-normal-form model, to star and snowflake schemas, or simple flat tables. As customers adopt Amazon Redshift, they must consider its architecture in order to ensure that their data model is correctly deployed and maintained by the database. This post takes you through the most common issues that customers find as they adopt Amazon Redshift, and gives you concrete guidance on how to address each. If you address each of these items, you should be able to achieve optimal performance of queries and be able to scale effectively to meet customer demand.

Issue #1: Incorrect column encoding

Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded (see Choosing a Column Compression Type in the Amazon Redshift Database Developer Guide) , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.

Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied. To determine if you are deviating from this best practice, run the following query to determine if any tables have NO column encoding applied:

SELECT database, schema || ‘.’ || "table" AS "table", encoded, size
FROM svv_table_info
WHERE encoded=’N’
ORDER BY 2;

Afterward, review the tables and columns which aren’t encoded by running the following query:

SELECT trim(n.nspname || ‘.’ || c.relname) AS "table",trim(a.attname) AS "column",format_type(a.atttypid, a.atttypmod) AS "type",
format_encoding(a.attencodingtype::integer) AS "encoding", a.attsortkeyord AS "sortkey"
FROM pg_namespace n, pg_class c, pg_attribute a
WHERE n.oid = c.relnamespace AND c.oid = a.attrelid AND a.attnum > 0 AND NOT a.attisdropped and n.nspname NOT IN (‘information_schema’,’pg_catalog’,’pg_toast’) AND format_encoding(a.attencodingtype::integer) = ‘none’ AND c.relkind=’r’ AND a.attsortkeyord != 1 ORDER BY n.nspname, c.relname, a.attnum;

If you find that you have tables without optimal column encoding, then use the Amazon Redshift Column Encoding Utility on AWS Labs GitHub to apply encoding. This command line utility uses the ANALYZE COMPRESSION command on each table. If encoding is required, it generates a SQL script which creates a new table with the correct encoding, copies all the data into the new table, and then transactionally renames the new table to the old name while retaining the original data. (Please note that the first column in a compound sort key should not be encoded, and is not encoded by this utility.)

Issue #2 – Skewed table data

Amazon Redshift is a distributed, shared nothing database architecture where each node in the cluster stores a subset of the data. When a table is created, decide whether to spread the data evenly among nodes (default), or place data on a node on the basis of one of the columns. By choosing columns for distribution that are commonly joined together, you can minimize the amount of data transferred over the network during the join. This can significantly increase performance on these types of queries.

The selection of a good distribution key is the topic of many AWS articles, including Choose the Best Distribution Style; see a definitive guide to distribution and sorting of star schemas in the Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift blog post. In general, a good distribution key should exhibit the following properties:

High cardinality – There should be a large number of unique data values in the column relative to the number of nodes in the cluster.

Uniform distribution/low skew – Each unique value in the distribution key should occur in the table an even number of times. This allows Amazon Redshift to put the same number of records on each node in the cluster.

Commonly joined – The columns in a distribution key should be those that you usually join to other tables. If you have many possible columns that fit this criterion, then you may choose the column that joins to the largest table.

A skewed distribution key results in nodes not working equally hard as each other on query execution, requiring unbalanced CPU or memory, and ultimately only running as fast as the slowest node:

If skew is a problem, you typically see that node performance is uneven on the cluster. Use one of the admin scripts in the Amazon Redshift Utils GitHub repository, such as table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster.

If you find that you have tables with skewed distribution keys, then consider changing the distribution key to a column that exhibits high cardinality and uniform distribution. Evaluate a candidate column as a distribution key by creating a new table using CTAS:

CREATE TABLE MY_TEST_TABLE DISTKEY (<COLUMN NAME>) AS SELECT * FROM <TABLE NAME>;

Run the table_inspector.sql script against the table again to analyze data skew.

If there is no good distribution key in any of your records, you may find that moving to EVEN distribution works better, due to the lack of a single node being a hotspot. For small tables, you can also use DISTSTYLE ALL to place table data onto every node in the cluster.

Issue #3 – Queries not benefiting from sort keys

Amazon Redshift tables can have a sort key column identified, which acts like an index in other databases but which does not incur a storage cost as with other platforms (for more information, see Choosing Sort Keys). A sort key should be created on those columns which are most commonly used in WHERE clauses. If you have a known query pattern, then COMPOUND sort keys give the best performance; if end users query different columns equally, then use an INTERLEAVED sort key.

To determine which tables don’t have sort keys, and how often they have been queried, run the following query:

SELECT database, table_id, schema || ‘.’ || "table" AS "table", size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN (‘Internal Worktable’,’S3′)
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;

You can run a tutorial that walks you through how to address unsorted tables in the Amazon Redshift Developer Guide. You can also take advantage of another GitHub admin script that recommends sort keys based on query activity. Bear in mind that queries evaluated against a sort key column must not apply a SQL function to the sort key; instead, ensure that you apply the functions to the compared values so that the sort key is used. This is commonly found on TIMESTAMP columns that are used as sort keys.

Issue #4 – Tables without statistics or which need vacuum

Amazon Redshift, like other databases, requires statistics about tables and the composition of data blocks being stored in order to make good decisions when planning a query (for more information, see Analyzing Tables). Without good statistics, the optimiser may make suboptimal or incorrect choices about the order in which to access tables, or how to join datasets together.

The ANALYZE Command History topic in the Amazon Redshift Developer Guide supplies queries to help you address missing or stale statistics, and you can also simply run the missing_table_stats.sql admin script to determine which tables are missing stats, or the statement below to determine tables that have stale statistics:

SELECT database, schema || ‘.’ || "table" AS "table", stats_off
FROM svv_table_info
WHERE stats_off > 5
ORDER BY 2;

In Amazon Redshift, data blocks are immutable. When rows are DELETED or UPDATED, they are simply logically deleted (flagged for deletion) but not physically removed from disk. Updates result in a new block being written with new data appended. Both of these operations cause the previous version of the row to continue consuming disk space and continue being scanned when a query scans the table. As a result, table storage space is increased and performance degraded due to otherwise avoidable disk I/O during scans. A VACUUM command recovers the space from deleted rows and restores the sort order.

To address issues with tables with missing or stale statistics or where vacuum is required, run another AWS Labs utility, Analyze & Vacuum Schema. This ensures that you always keep up-to-date statistics, and only vacuum tables that actually need reorganisation.

Issue #5 – Tables with very large VARCHAR columns

During processing of complex queries, intermediate query results might need to be stored in temporary blocks. These temporary tables are not compressed, so unnecessarily wide columns consume excessive memory and temporary disk space, which can affect query performance. For more information, see Use the Smallest Possible Column Size.

Use the following query to generate a list of tables that should have their maximum column widths reviewed:

SELECT database, schema || ‘.’ || "table" AS "table", max_varchar
FROM svv_table_info
WHERE max_varchar > 150
ORDER BY 2;

After you have a list of tables, identify which table columns have wide varchar columns and then determine the true maximum width for each wide column, using the following query:

SELECT max(len(rtrim(column_name)))
FROM table_name;

In some cases, you may have large VARCHAR type columns because you are storing JSON fragments in the table, which you then query with JSON functions. If you query the top running queries for the database using the top_queries.sql admin script, pay special attention to SELECT * queries which include the JSON fragment column. If end users query these large columns but don’t use actually execute JSON functions against them, consider moving them into another table that only contains the primary key column of the original table and the JSON column.

If you find that the table has columns that are wider than necessary, then you need to re-create a version of the table with appropriate column widths by performing a deep copy.

Issue #6 – Queries waiting on queue slots

Amazon Redshift runs queries using a queuing system known as workload management (WLM). You can define up to 8 queues to separate workloads from each other, and set the concurrency on each queue to meet your overall throughput requirements.

In some cases, the queue to which a user or query has been assigned is completely busy and a user’s query must wait for a slot to be open. During this time, the system is not executing the query at all, which is a sign that you may need to increase concurrency.

First, you need to determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that increasing concurrency allows more queries to run, but they share the same memory allocation (unless you increase it). You may find that by increasing concurrency, some queries must use temporary disk storage to complete, which is also sub-optimal as well see next.

Issue #7 – Queries that are disk-based

If a query isn’t able to completely execute in memory, it may need to use disk-based temporary storage for parts of an explain plan. The additional disk I/O slows down the query, and can be addressed by increasing the amount of memory allocated to a session (for more information, see WLM Dynamic Memory Allocation).

To determine if any queries have been writing to disk, use the following query:

SELECT
q.query, trim(q.cat_text)
FROM (SELECT query, replace( listagg(text,’ ‘) WITHIN GROUP (ORDER BY sequence), ‘\n’, ‘ ‘) AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q
JOIN
(SELECT distinct query FROM svl_query_summary WHERE is_diskbased=’t’ AND (LABEL LIKE ‘hash%’ OR LABEL LIKE ‘sort%’ OR LABEL LIKE ‘aggr%’) AND userid > 1) qs ON qs.query = q.query;

Based on the user or the queue assignment rules, you can increase the amount of memory given to the selected queue to prevent queries needing to spill to disk to complete. You can also increase the WLM_QUERY_SLOT_COUNT (http://docs.aws.amazon.com/redshift/latest/dg/r_wlm_query_slot_count.html) for the session from the default of 1 to the maximum concurrency for the queue.  As outlined in Issue #6, this may result in queueing queries, so use with care

Issue #8 – Commit queue waits

Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to a commit queue.

If you are committing too often on your database, you will start to see waits on the commit queue increase, which can be viewed with the commit_stats.sql admin script. This script shows the largest queue length and queue time for queries run in the past two days. If you have queries that are waiting on the commit queue, then look for sessions that are committing multiple times per session, such as ETL jobs that are logging progress or inefficient data loads.

Issue #9 – Inefficient data loads

Amazon Redshift best practices suggest the use of the COPY command to perform data loads. This API operation uses all compute nodes in the cluster to load data in parallel, from sources such as Amazon S3, Amazon DynamoDB, Amazon EMR HDFS file systems, or any SSH connection.

When performing data loads, you should compress the files to be loaded whenever possible; Amazon Redshift supports both GZIP and LZO compression. It is more efficient to load a large number of small files than one large one, and the ideal file count is a multiple of the slice count. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 16 slices. By ensuring you have an even number of files per slices, you can know that COPY execution will evenly use cluster resources and complete as quickly as possible.

An anti-pattern is to insert data directly into Amazon Redshift, with single record inserts or the use of a multi-value INSERT statement, which allows up to 16 MB of data to be inserted at one time. These are leader node–based operations, and can create significant performance bottlenecks by maxing out the leader node CPU or memory.

Issue #10 – Inefficient use of Temporary Tables

Amazon Redshift provides temporary tables, which are like normal tables except that they are only visible within a single session. When the user disconnects the session, the tables are automatically deleted. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a query SELECT … INTO #TEMP_TABLE. The CREATE TABLE statement gives you complete control over the definition of the temporary table, while the SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and uses default storage properties.

These default storage properties may cause issues if not carefully considered. Amazon Redshift’s default table structure is to use EVEN distribution with no column encoding. This is a sub-optimal data structure for many types of queries, and if you are using select/into syntax you cannot set the column encoding or distribution and sort keys.

It is highly recommended that you convert all select/into syntax to use the CREATE statement. This ensures that your temporary tables have column encoding and are distributed in a fashion that is sympathetic the other entities that are part of the workflow. To perform a conversion of a statement which uses:

select column_a, column_b into #my_temp_table from my_table;

You would analyse the temporary table for optimal column encoding:

And then convert the select/into statement to:

BEGIN;
create temporary table my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) — Assuming you intend to join this table on column_a
sortkey (column_b); — Assuming you are sorting or grouping by column_b

insert into my_temp_table select column_a, column_b from my_table;
COMMIT;

You may also wish to analyze statistics on the temporary table, if it is used as a join table for subsequent queries:

analyze my_temp_table;

This way, you retain the functionality of using temporary tables but control data placement on the cluster through distkey assignment and take advantage of the columnar nature of Amazon Redshift through use of Column Encoding.

Tip: Using explain plan alerts

The last tip is to use diagnostic information from the cluster during query execution. This is stored in an extremely useful view called STL_ALERT_EVENT_LOG. Use the perf_alert.sql admin script to diagnose issues that the cluster has encountered over the last seven days. This is an invaluable resource in understanding how your cluster develops over time.

Summary

Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. While Amazon Redshift can run any type of data model, you can avoid possible pitfalls that might decrease performance or increase cost, by being aware of how data is stored and managed. Run a simple set of diagnostic queries for common issues and ensure that you get the best performance possible.

If you have questions or suggestions, please leave a comment below.

UPDATE: This blog post has been translated into Japanese:

————————————

Related

Best Practices for Micro-Batch Loading on Amazon Redshift

;

systemd For Administrators, Part XXI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemd-for-administrators-part-xxi.html

Container Integration
Since a while containers have been one of the hot topics on
Linux. Container managers such as libvirt-lxc, LXC or Docker are
widely known and used these days. In this blog story I want to shed
some light on systemd‘s integration points with container managers, to
allow seamless management of services across container boundaries.
We’ll focus on OS containers here, i.e. the case where an init system
runs inside the container, and the container hence in most ways
appears like an independent system of its own. Much of what I describe
here is available on pretty much any container manager that implements
the logic described
here
,
including libvirt-lxc. However, to make things easy we’ll focus on
systemd-nspawn,
the mini-container manager that is shipped with systemd
itself. systemd-nspawn uses the same kernel interfaces as the other
container managers, however is less flexible as it is designed to be a
container manager that is as simple to use as possible and “just
works”, rather than trying to be a generic tool you can configure in
every low-level detail. We use systemd-nspawn extensively when
developing systemd.
Anyway, so let’s get started with our run-through. Let’s start by
creating a Fedora container tree in a subdirectory:
# yum -y –releasever=20 –nogpg –installroot=/srv/mycontainer –disablerepo='*' –enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in
/srv/mycontainer. This command line is Fedora-specific, but most
distributions provide similar functionality in one way or another. The
examples section in the systemd-nspawn(1) man
page

contains a list of the various command lines for other distribution.
We now have the new container installed, let’s set an initial root password:
# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then
use passwd to set the root password. After that the initial setup is done,
hence let’s boot it up and log in as root with our new password:
$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[ OK ] Reached target Remote File Systems.
[ OK ] Created slice Root Slice.
[ OK ] Created slice User and Session Slice.
[ OK ] Created slice System Slice.
[ OK ] Created slice system-getty.slice.
[ OK ] Reached target Slices.
[ OK ] Listening on Delayed Shutdown Socket.
[ OK ] Listening on /dev/initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket.
Starting Journal Service…
[ OK ] Started Journal Service.
[ OK ] Reached target Paths.
Mounting Debug File System…
Mounting Configuration File System…
Mounting FUSE Control File System…
Starting Create static device nodes in /dev…
Mounting POSIX Message Queue File System…
Mounting Huge Pages File System…
[ OK ] Reached target Encrypted Volumes.
[ OK ] Reached target Swap.
Mounting Temporary Directory…
Starting Load/Save Random Seed…
[ OK ] Mounted Configuration File System.
[ OK ] Mounted FUSE Control File System.
[ OK ] Mounted Temporary Directory.
[ OK ] Mounted POSIX Message Queue File System.
[ OK ] Mounted Debug File System.
[ OK ] Mounted Huge Pages File System.
[ OK ] Started Load/Save Random Seed.
[ OK ] Started Create static device nodes in /dev.
[ OK ] Reached target Local File Systems (Pre).
[ OK ] Reached target Local File Systems.
Starting Trigger Flushing of Journal to Persistent Storage…
Starting Recreate Volatile Files and Directories…
[ OK ] Started Recreate Volatile Files and Directories.
Starting Update UTMP about System Reboot/Shutdown…
[ OK ] Started Trigger Flushing of Journal to Persistent Storage.
[ OK ] Started Update UTMP about System Reboot/Shutdown.
[ OK ] Reached target System Initialization.
[ OK ] Reached target Timers.
[ OK ] Listening on D-Bus System Message Bus Socket.
[ OK ] Reached target Sockets.
[ OK ] Reached target Basic System.
Starting Login Service…
Starting Permit User Sessions…
Starting D-Bus System Message Bus…
[ OK ] Started D-Bus System Message Bus.
Starting Cleanup of Temporary Directories…
[ OK ] Started Cleanup of Temporary Directories.
[ OK ] Started Permit User Sessions.
Starting Console Getty…
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
[ OK ] Started Login Service.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container
integration of systemd. Let’s have a look at the first tool,
machinectl. When run without parameters it shows a list of all
locally running containers:
$ machinectl
MACHINE CONTAINER SERVICE
mycontainer container nspawn

1 machines listed.

The “status” subcommand shows details about the container:
$ machinectl status mycontainer
mycontainer:
Since: Mi 2014-11-12 16:47:19 CET; 51s ago
Leader: 5374 (systemd)
Service: nspawn; class container
Root: /srv/mycontainer
Address: 192.168.178.38
10.36.6.162
fd00::523f:56ff:fe00:4994
fe80::523f:56ff:fe00:4994
OS: Fedora 20 (Heisenbug)
Unit: machine-mycontainer.scope
├─5374 /usr/lib/systemd/systemd
└─system.slice
├─dbus.service
│ └─5414 /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-act…
├─systemd-journald.service
│ └─5383 /usr/lib/systemd/systemd-journald
├─systemd-logind.service
│ └─5411 /usr/lib/systemd/systemd-logind
└─console-getty.service
└─5416 /sbin/agetty –noclear -s console 115200 38400 9600

With this we see some interesting information about the container,
including its control group tree (with processes), IP addresses and
root directory.
The “login” subcommand gets us a new login shell in the container:
# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The “reboot” subcommand reboots the container:
# machinectl reboot mycontainer

The “poweroff” subcommand powers the container off:
# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more
commands, please check the man
page

for details. Note again that even though we use systemd-nspawn as
container manager here the concepts apply to any container manager
that implements the logic described
here
,
including libvirt-lxc for example.
machinectl is not the only tool that is useful in conjunction with
containers. Many of systemd’s own tools have been updated to
explicitly support containers too! Let’s try this (after starting the
container up again first, repeating the systemd-nspawn command from
above.):
# hostnamectl -M mycontainer set-hostname "wuff"

This uses
hostnamectl(1)
on the local container and sets its hostname.
Similar, many other tools have been updated for connecting to local
containers. Here’s
systemctl(1)‘s -M switch
in action:
# systemctl -M mycontainer
UNIT LOAD ACTIVE SUB DESCRIPTION
-.mount loaded active mounted /
dev-hugepages.mount loaded active mounted Huge Pages File System
dev-mqueue.mount loaded active mounted POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted /proc/sys/kernel/random/boot_id
[…]
time-sync.target loaded active active System Time Synchronized
timers.target loaded active active Timers
systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass –all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified
container, not the host. (Output is shortened here, the blog story is
already getting too long).
Let’s use this to restart a service within our container:
# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M
switch. With the -r switch it shows the units running on the host,
plus all units of all local, running containers:
# systemctl -r
UNIT LOAD ACTIVE SUB DESCRIPTION
boot.automount loaded active waiting EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount loaded active waiting Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0x2dLVDSx2d1-intel_backlight.device loaded active plugged /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[…]
timers.target loaded active active Timers
mandb.timer loaded active waiting Daily man-db cache update
systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories
mycontainer:-.mount loaded active mounted /
mycontainer:dev-hugepages.mount loaded active mounted Huge Pages File System
mycontainer:dev-mqueue.mount loaded active mounted POSIX Message Queue File System
[…]
mycontainer:time-sync.target loaded active active System Time Synchronized
mycontainer:timers.target loaded active active Timers
mycontainer:systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass –all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the
units of the one container we have currently running. The units of the
containers are prefixed with the container name, and a colon
(“:”). (The output is shortened again for brevity’s sake.)
The list-machines subcommand of systemctl shows a list of all
running containers, inquiring the system managers within the containers
about system state and health. More specifically it shows if
containers are properly booted up, or if there are any failed
services:
# systemctl list-machines
NAME STATE FAILED JOBS
delta (host) running 0 0
mycontainer running 0 0
miau degraded 1 0
waldi running 0 0

4 machines listed.

To make things more interesting we have started two more containers in
parallel. One of them has a failed service, which results in the
machine state to be degraded.
Let’s have a look at
journalctl(1)‘s
container support. It too supports -M to show the logs of a specific
container:
# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes…
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the
host and all local containers:
# journalctl -m -e

(Let’s skip the output here completely, I figure you can extrapolate
how this looks.)
But it’s not only systemd’s own tools that understand container
support these days, procps sports support for it, too:
# ps -eo pid,machine,args
PID MACHINE COMMAND
1 – /usr/lib/systemd/systemd –switched-root –system –deserialize 20
[…]
2915 – emacs contents/projects/containers.md
3403 – [kworker/u16:7]
3415 – [kworker/u16:9]
4501 – /usr/libexec/nm-vpnc-service
4519 – /usr/sbin/vpnc –non-inter –no-detach –pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid –
4749 – /usr/libexec/dconf-service
4980 – /usr/lib/systemd/systemd-resolved
5006 – /usr/lib64/firefox/firefox
5168 – [kworker/u16:0]
5192 – [kworker/u16:4]
5193 – [kworker/u16:5]
5497 – [kworker/u16:1]
5591 – [kworker/u16:8]
5711 – sudo -s
5715 – /bin/bash
5749 – /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer /usr/lib/systemd/systemd
5799 mycontainer /usr/lib/systemd/systemd-journald
5862 mycontainer /usr/lib/systemd/systemd-logind
5863 mycontainer /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-activation
5868 mycontainer /sbin/agetty –noclear –keep-baud console 115200 38400 9600 vt102
5871 mycontainer /usr/sbin/sshd -D
6527 mycontainer /usr/lib/systemd/systemd-resolved
[…]

This shows a process list (shortened). The second column shows the
container a process belongs to. All processes shown with “-” belong to
the host itself.
But it doesn’t stop there. The new “sd-bus” D-Bus client library we
have been preparing in the systemd/kdbus context knows containers
too. While you use sd_bus_open_system() to connect to your local
host’s system bus sd_bus_open_system_container() may be used to
connect to the system bus of any local container, so that you can
execute bus methods on it.
sd-login.h
and machined’s bus
interface

provide a number of APIs to add container support to other programs
too. They support enumeration of containers as well as retrieving the
machine name from a PID and similar.
systemd-networkd also has support for containers. When run inside a
container it will by default run a DHCP client and IPv4LL on any veth
network interface named host0 (this interface is special under the
logic described here). When run on the host networkd will by default
provide a DHCP server and IPv4LL on veth network interface named ve-
followed by a container name.
Let’s have a look at one last facet of systemd’s container
integration: the hook-up with the name service switch. Recent systemd
versions contain a new NSS module nss-mymachines that make the names
of all local containers resolvable via gethostbyname() and
getaddrinfo(). This only applies to containers that run within their
own network namespace. With the systemd-nspawn command shown above the
the container shares the network configuration with the host however;
hence let’s restart the container, this time with a virtual veth
network link between host and container:
# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer –network-veth -b

Now, (assuming that networkd is used in the container and outside) we
can already ping the container using its name, due to the simple magic
of nss-mymachines:
# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with
all other tools that use libc gethostbyname() or getaddrinfo()
too, among them venerable ssh.
And this is pretty much all I want to cover for now. We briefly
touched a variety of integration points, and there’s a lot more still
if you look closely. We are working on even more container integration
all the time, so expect more new features in this area with every
systemd release.
Note that the whole machine concept is actually not limited to
containers, but covers VMs too to a certain degree. However, the
integration is not as close, as access to a VM’s internals is not as
easy as for containers, as it usually requires a network transport
instead of allowing direct syscall access.
Anyway, I hope this is useful. For further details, please have a look
at the linked man pages and other documentation.

systemd For Administrators, Part XXI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemd-for-administrators-part-xxi.html

Container Integration

Since a while containers have been one of the hot topics on
Linux. Container managers such as libvirt-lxc, LXC or Docker are
widely known and used these days. In this blog story I want to shed
some light on systemd‘s integration points with container managers, to
allow seamless management of services across container boundaries.

We’ll focus on OS containers here, i.e. the case where an init system
runs inside the container, and the container hence in most ways
appears like an independent system of its own. Much of what I describe
here is available on pretty much any container manager that implements
the logic described
here
,
including libvirt-lxc. However, to make things easy we’ll focus on
systemd-nspawn,
the mini-container manager that is shipped with systemd
itself. systemd-nspawn uses the same kernel interfaces as the other
container managers, however is less flexible as it is designed to be a
container manager that is as simple to use as possible and “just
works”, rather than trying to be a generic tool you can configure in
every low-level detail. We use systemd-nspawn extensively when
developing systemd.

Anyway, so let’s get started with our run-through. Let’s start by
creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in
/srv/mycontainer. This command line is Fedora-specific, but most
distributions provide similar functionality in one way or another. The
examples section in the systemd-nspawn(1) man
page

contains a list of the various command lines for other distribution.

We now have the new container installed, let’s set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then
use passwd to set the root password. After that the initial setup is done,
hence let’s boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container
integration of systemd. Let’s have a look at the first tool,
machinectl. When run without parameters it shows a list of all
locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The “status” subcommand shows details about the container:

$ machinectl status mycontainer
mycontainer:
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
     Address: 192.168.178.38
              10.36.6.162
              fd00::523f:56ff:fe00:4994
              fe80::523f:56ff:fe00:4994
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
              └─system.slice
                ├─dbus.service
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                ├─systemd-journald.service
                │ └─5383 /usr/lib/systemd/systemd-journald
                ├─systemd-logind.service
                │ └─5411 /usr/lib/systemd/systemd-logind
                └─console-getty.service
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container,
including its control group tree (with processes), IP addresses and
root directory.

The “login” subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The “reboot” subcommand reboots the container:

# machinectl reboot mycontainer

The “poweroff” subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more
commands, please check the man
page

for details. Note again that even though we use systemd-nspawn as
container manager here the concepts apply to any container manager
that implements the logic described
here
,
including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with
containers. Many of systemd’s own tools have been updated to
explicitly support containers too! Let’s try this (after starting the
container up again first, repeating the systemd-nspawn command from
above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses
hostnamectl(1)
on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local
containers. Here’s
systemctl(1)‘s -M switch
in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]
time-sync.target                     loaded active active    System Time Synchronized
timers.target                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified
container, not the host. (Output is shortened here, the blog story is
already getting too long).

Let’s use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M
switch. With the -r switch it shows the units running on the host,
plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0x2dLVDSx2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]
timers.target                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]
mycontainer:time-sync.target                                                                        loaded active active    System Time Synchronized
mycontainer:timers.target                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the
units of the one container we have currently running. The units of the
containers are prefixed with the container name, and a colon
(“:”). (The output is shortened again for brevity’s sake.)

The list-machines subcommand of systemctl shows a list of all
running containers, inquiring the system managers within the containers
about system state and health. More specifically it shows if
containers are properly booted up, or if there are any failed
services:

# systemctl list-machines
NAME         STATE   FAILED JOBS
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in
parallel. One of them has a failed service, which results in the
machine state to be degraded.

Let’s have a look at
journalctl(1)‘s
container support. It too supports -M to show the logs of a specific
container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the
host and all local containers:

# journalctl -m -e

(Let’s skip the output here completely, I figure you can extrapolate
how this looks.)

But it’s not only systemd’s own tools that understand container
support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
[...]
2915 -                               emacs contents/projects/containers.md
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved
[...]

This shows a process list (shortened). The second column shows the
container a process belongs to. All processes shown with “-” belong to
the host itself.

But it doesn’t stop there. The new “sd-bus” D-Bus client library we
have been preparing in the systemd/kdbus context knows containers
too. While you use sd_bus_open_system() to connect to your local
host’s system bus sd_bus_open_system_container() may be used to
connect to the system bus of any local container, so that you can
execute bus methods on it.

sd-login.h
and machined’s bus
interface

provide a number of APIs to add container support to other programs
too. They support enumeration of containers as well as retrieving the
machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a
container it will by default run a DHCP client and IPv4LL on any veth
network interface named host0 (this interface is special under the
logic described here). When run on the host networkd will by default
provide a DHCP server and IPv4LL on veth network interface named ve-
followed by a container name.

Let’s have a look at one last facet of systemd’s container
integration: the hook-up with the name service switch. Recent systemd
versions contain a new NSS module nss-mymachines that make the names
of all local containers resolvable via gethostbyname() and
getaddrinfo(). This only applies to containers that run within their
own network namespace. With the systemd-nspawn command shown above the
the container shares the network configuration with the host however;
hence let’s restart the container, this time with a virtual veth
network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we
can already ping the container using its name, due to the simple magic
of nss-mymachines:

# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with
all other tools that use libc gethostbyname() or getaddrinfo()
too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly
touched a variety of integration points, and there’s a lot more still
if you look closely. We are working on even more container integration
all the time, so expect more new features in this area with every
systemd release.

Note that the whole machine concept is actually not limited to
containers, but covers VMs too to a certain degree. However, the
integration is not as close, as access to a VM’s internals is not as
easy as for containers, as it usually requires a network transport
instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look
at the linked man pages and other documentation.

Mucking About With SquashFS

Post Syndicated from Craig original http://www.devttys0.com/2014/08/mucking-about-with-squashfs/

SquashFS is an incredibly popular file system for embedded Linux devices. Unfortunately, it is also notorious for being hacked up by vendors, causing the standard SquashFS tools (i.e., unsquashfs) to fail when extracting these file systems.
While projects like the Firmware-Mod-Kit (FMK) have amassed many unsquashfs utilities to work with a wide range of SquashFS variations found in the wild, this approach has several draw backs, most notably that each individual unsquashfs tool only supports its one particular variation. If you run into a SquashFS image that is mostly compatible with a given unsquashfs tool, but has some minor modification, you can’t extract it – and worse, you probably don’t know why.
So what are these “minor modifications” that cause unsquashfs to fail?

It generally comes down to compression, specifically, lzma. Although SquashFS 4.0 now supports a wide variety of compression types, ’twas not always thus. Prior to version 4, SquashFS only officially supported zlib compression. However, lzma compresses much smaller, so many embedded vendors hacked in lzma support, and of course they all did it in a slightly different way.
Some vendors put the standard 13-byte lzma header in front of all their compressed data blocks, which includes important compression meta-data, most notably the lzma properties used to compress that of the data block:

struct lzma_header
{
uint8_t properties; // Contains the lc, lp, and pb property values
uint32_t dictionary_size;
uint64_t uncompressed_size;
};

This makes decompressing each data block straightforward; even so, the official SquashFS tools assume that any SquashFS file system prior to 4.0 is compressed using zlib, requiring special lzma versions of these tools to be built in order to support lzma compressed file systems prior to version 4.
Some vendors omitted the uncompressed size field from the lzma header of each data block:

struct lzma_header
{
uint8_t properties; // Contains the lc, lp, and pb property values
uint32_t dictionary_size;
//uint64_t uncompressed_size;
};

This kind of makes sense, since the uncompressed size field is not really required anyway; SquashFS code will know the exact, or at least the maximum, size of each data block, and lzma itself will just keep uncompressing data until it’s done. While it is valid to set the uncompressed size field to -1 in the lzma header if the size of the original data is not known at compression time, lzma decompressors still expect this field to exist. If it doesn’t, the decompressor will interpret whatever bytes happen to be there as the uncompressed size field, which likely won’t make sense, and decompression will fail.
Other implementations decided to encode lzma properties for each compressed data block using their own custom structure. Take DD-WRT for example:

struct lzma_header
{
uint8_t pb;
uint8_t lc;
uint8_t lp;
uint8_t unk;
};

Some just use hard-coded compression properties for all data blocks, so there’s no lzma header at the beginning their compressed data blocks at all. Further, these properties are not necessarily the default lzma property values:

// lzma zlib simplified wrapper
#include <zlib.h>

#define ZLIB_LC 0 // The default value for lc is 3; here, it’s been changed to 0
#define ZLIB_LP 0
#define ZLIB_PB 2

Still others throw seemingly unnecessary data into the beginning of their data blocks, like the string “7zip”.
Due to the use of non-standard compression, many vendors also change the SquashFS “magic bytes”, which makes standard unsquashfs utilities think that the SquashFS image is invalid.
All this, coupled with the fact that most unsquashfs utilities are pedantic about which SquashFS version(s) they support, requires anyone interested in extracting embedded file systems to litter their system with many different unsquashfs variants.

Luckily, the latest unsquashfs utility supports all versions of SquashFS (v1 – v4). While it still suffers from all the other above problems, it provides a useful base from which to develop a more “hacker friendly” tool.
In a (perhaps futile) attempt to write one extraction tool to support as many SquashFS variations as possible, sasquatch was born. It’s basically unsquashfs v4.3 that has been modified with some nifty features:

Doesn’t care about the SquashFS magic bytes
Doesn’t trust the reported compression header field
Tries all supported decompressors until it finds one that works, regardless of the SquashFS version
Adds some vendor-specific lzma implementations to the supported decompressor list
Includes an “adaptive” lzma decompressor that attempts to dynamically identify lzma compression options
Provides more fine-grained command line control over decompression and debug output

The adaptive lzma decompressor is perhaps the best feature, as it not only generically auto-detects and decompresses several known vendor variations, but potentially can detect and decompress yet-unknown variations. In fact, it has already been able to extract SquashFS images that could not be extracted by any of the unsquashfs utilities in the Firmware-Mod-Kit.
With that said, the code is still beta and there are a couple of known SquashFS images that sasquatch can’t extract (yet). Bug reports and patches welcome.

For your eyes only (or Adding better encryption to MariaDB)

Post Syndicated from Michael "Monty" Widenius original http://monty-says.blogspot.com/2014/05/for-your-eyes-only-or-adding-better.html

With MariaDB and MySQL we have always taken security seriously.In MariaDB 10.0 we added roles to make it easier to administrate many users.MariaDB and MySQL has also many different encryption functions, but what has been neglected in the past is to make encryption easy to use.This is now about to change.I recently had a meeting with Elmar Eperiesi-Beck from eperi about simplifying the usage of encryption. We agreed to start a close collaboration around encryption for MariaDB with an agenda to deliver something very secure and easy to use soon.The things we are initially focusing on are:Adding column level encryption.This will be done at the field level, invisible for the storage engine.Block level encryption for certain storage engines.Initially we will target InnoDB and XtraDB.MariaDB will initially support storing the security keys on a remote file systems, accessed only at startup, and later also support using a daemon for key management.The above will make your encrypted data in MariaDB secure for:Database users that has user access to the database.Anyone that would attempt to steal the hard disk with the database.By using the daemon approach a MariaDB installation will even be secure against database administrators, as they will not have any way to access the key data.eperi has 11 years of experience with encryption and I am very happy to see them engage with MariaDB to provide better security to MariaDB users!

For your eyes only (or Adding better encryption to MariaDB)

Post Syndicated from Michael "Monty" Widenius original http://monty-says.blogspot.com/2014/05/for-your-eyes-only-or-adding-better.html

With MariaDB and MySQL we have always taken security seriously.In MariaDB 10.0 we added roles to make it easier to administrate many users.MariaDB and MySQL has also many different encryption functions, but what has been neglected in the past is to make encryption easy to use.This is now about to change.I recently had a meeting with Elmar Eperiesi-Beck from eperi about simplifying the usage of encryption. We agreed to start a close collaboration around encryption for MariaDB with an agenda to deliver something very secure and easy to use soon.The things we are initially focusing on are:Adding column level encryption.This will be done at the field level, invisible for the storage engine.Block level encryption for certain storage engines.Initially we will target InnoDB and XtraDB.MariaDB will initially support storing the security keys on a remote file systems, accessed only at startup, and later also support using a daemon for key management.The above will make your encrypted data in MariaDB secure for:Database users that has user access to the database.Anyone that would attempt to steal the hard disk with the database.By using the daemon approach a MariaDB installation will even be secure against database administrators, as they will not have any way to access the key data.eperi has 11 years of experience with encryption and I am very happy to see them engage with MariaDB to provide better security to MariaDB users!