Amazon Elastic File System was designed to be the file system of choice for cloud-native applications that require shared access to file-based storage. We launched EFS in mid-2016 and have added several important features since then including on-premises access via Direct Connect and encryption of data at rest. We have also made EFS available in additional AWS Regions, most recently US West (Northern California). As was the case with EFS itself, these enhancements were made in response to customer feedback, and reflect our desire to serve an ever-widening customer base.
Encryption in Transit Today we are making EFS even more useful with the addition of support for encryption of data in transit. When used in conjunction with the existing support for encryption of data at rest, you now have the ability to protect your stored files using a defense-in-depth security strategy.
In order to make it easy for you to implement encryption in transit, we are also releasing an EFS mount helper. The helper (available in source code and RPM form) takes care of setting up a TLS tunnel to EFS, and also allows you to mount file systems by ID. The two features are independent; you can use the helper to mount file systems by ID even if you don’t make use of encryption in transit. The helper also supplies a recommended set of default options to the actual mount command.
Setting up Encryption I start by installing the EFS mount helper on my Amazon Linux instance:
$ sudo yum install -y amazon-efs-utils
Next, I visit the EFS Console and capture the file system ID:
Then I specify the ID (and the TLS option) to mount the file system:
$ sudo mount -t efs fs-92758f7b -o tls /mnt/efs
And that’s it! The encryption is transparent and has an almost negligible impact on data transfer speed.
Available Now You can start using encryption in transit today in all AWS Regions where EFS is available.
The mount helper is available for Amazon Linux. If you are running another distribution of Linux you will need to clone the GitHub repo and build your own RPM, as described in the README.
We launched EFS File Sync a few days before AWS re:Invent 2017 and I finally have time to tell you about it!
If you need to move a large collection of files from an on-premises or in-cloud file system to Amazon Elastic File System, this tool is for you. Simple, single-threaded command line tools such as cp and rsync predate the cloud and cannot deliver the throughput required to move massive amounts of data from place to place. These tools are generally used as building blocks, often within scripts that take care of scheduling, orchestration, and network security.
Secure & Parallel EFS File Sync uses a secure, highly parallel data transfer mechanism that can run up to 5 times faster than the tools I mentioned above. It is available as an agent that runs within VMware ESXi or on an EC2 instance, and accesses the source file system via NFS (v3 and v4), and can be used in all AWS Regions where EFS is available. Because the agent is responsible for initiating all communication with AWS you don’t need to set up VPNs or allow inbound connections through your firewall.
You can launch, control, and monitor the agent and your sync tasks from the AWS Management Console. Jobs can specify the transfer of an entire file system or a specific directory tree, with the option to detect and skip files that are already present in the destination. File metadata (modification and access time, POSIX ownership and permissions, symbolic links, and hard links) is also copied.
Using EFS File Sync In order to write this blog post, I launched an EC2 instance, exported an NFS file system (/data), and populated the file system with the Linux kernel source code.
I open the EFS Console in the same Region as my instance, and click File syncs:
I click on Get started, choose Amazon EC2 as my host platform and click Launch instance, and click Connect to agent to proceed:
Clicking Launch instance opens the EC2 console in a separate tab. I pick a Memory optimized instance type (xlarge or bigger), configure it with a public IP address and with a security group that allows inbound traffic on port 80, and launch it as I would any other EC2 instance. Then I wait a minute or two (time to water my plants or check on my dog), and wait until the status checks pass:
Then I capture the instance’s public IP address, return to the EFS tab, enter the address, and click on Activate agent:
This step retrieves the activation key from the sync agent. After it completes, I enter a name for it and click Activate agent to proceed:
Now that the agent is running and activated, I click on Create sync task to start moving some files to EFS:
I configure the source location (the EC2 instance that I mentioned at the start of this section):
I also choose the destination EFS file system and specify a target location within it for my files:
Then I select my sync options and click Next to review my configuration:
The review looks good and I click Create sync task to start copying my files:
After the sync task has been created and its status becomes Available, I can select it and choose Start from the Actions menu to initiate a sync:
I fine-tune the settings that I established when I created the task, and click Start to proceed:
I can track the status of the sync task on the History tab:
It completes within minutes and my EFS file system now includes the new files:
Available Now EFS File Sync is available in all AWS Regions where EFS is available. You pay for the EFS and EC2 resources that you consume and $0.01 per GB of data copied (see the EFS Pricing page for more info).
This post summarizes the responses we received to our November 28 post asking our readers how they handle the challenge of digital asset management (DAM). You can read the previous posts in this series below:
Use the Join button above to receive notification of future posts on this topic.
This past November, we published a blog post entitled What’s the Best Solution for Managing Digital Photos and Videos? We asked our readers to tell us how they’re currently backing up their digital media assets and what their ideal system might be. We posed these questions:
How are you currently backing up your digital photos, video files, and/or file libraries/catalogs? Do you have a backup system that uses attached drives, a local network, the cloud, or offline storage media? Does it work well for you?
Imagine your ideal digital asset backup setup. What would it look like? Don’t be constrained by current products, technologies, brands, or solutions. Invent a technology or product if you wish. Describe an ideal system that would work the way you want it to.
We were thrilled to receive a large number of responses from readers. What was clear from the responses is that there is no consensus on solutions for either amateur or professional, and that users had many ideas for how digital media management could be improved to meet their needs.
We asked our readers to contribute to this dialog for a number of reasons. As a cloud backup and cloud storage service provider, we want to understand how our users are working with digital media so we know how to improve our services. Also, we want to participate in the digital media community, and hope that sharing the challenges our readers are facing and the solutions they are using will make a contribution to that community.
The State of Managing Digital Media
While a few readers told us they had settled on a system that worked for them, most said that they were still looking for a better solution. Many expressed frustration with dealing with the growing amount of data for digital photos and videos that is only getting larger with the increasing resolution of still and video cameras. Amateurs are making do with a number of consumer services, while professionals employ a wide range of commercial, open source, or jury rigged solutions for managing data and maintaining its integrity.
I’ve summarized the responses we received in three sections on, 1) what readers are doing today, 2) common wishes they have for improvements, and 3) concerns that were expressed by a number of respondents.
The Digital Media Workflow
Protecting Media From Camera to Cloud
We heard from a wide range of smartphone users, DSLR and other format photographers, and digital video creators. Speed of operation, the ability to share files with collaborators and clients, and product feature sets were frequently cited as reasons for selecting their particular solution. Also of great importance was protecting the integrity of media through the entire capture, transfer, editing, and backup workflow.
Avid Media Composer
Many readers said they backed up their camera memory cards as soon as possible to a computer or external drive and erased cards only when they had more than one backup of the media. Some said that they used dual memory cards that are written to simultaneously by the camera for peace-of-mind.
While some cameras now come equipped with Wi-Fi, no one other than smartphone users said they were using Wi-Fi as part of their workflow. Also, we didn’t receive feedback from any photographers who regularly shoot tethered.
Some readers said they still use CDs and DVDs for storing media. One user admitted to previously using VHS tape.
NAS (Network Attached Storage) is in wide use. Synology, Drobo, FreeNAS, and other RAID and non-RAID storage devices were frequently mentioned.
A number were backing up their NAS to the cloud for archiving. Others said they had duplicate external drives that were stored onsite or offsite, including in a physical safe, other business locations, a bank lock box, and even “mom’s house.”
Many said they had regular backup practices, including nightly backups, weekly and other regularly scheduled backups, often in non-work hours.
One reader said that a monthly data scrub was performed on the NAS to ensure data integrity.
Hardware used for backups included Synology, QNAP, Drobo, and FreeNAS systems.
Services used by our readers for backing up included Backblaze Backup, Backblaze B2 Cloud Storage, CrashPlan, SmugMug, Amazon Glacier, Google Photos, Amazon Prime Photos, Adobe Creative Cloud, Apple Photos, Lima, DropBox, and Tarsnap. Some readers made a distinction between how they used sync (such as DropBox), backup (such as Backblaze Backup), and storage (such as Backblaze B2), but others did not. (See Sync vs. Backup vs. Storage on our blog for an explanation of the differences.)
Software used for backups and maintaining file integrity included Arq, Carbon Copy Cloner, ChronoSync, SoftRAID, FreeNAS, corz checksum, rclone, rsync, Apple Time Machine, Capture One, Btrfs, BorgBackup, SuperDuper, restic, Acronis True Image, custom Python scripts, and smartphone apps PhotoTransfer and PhotoSync.
Cloud torrent services mentioned were Offcloud, Bitport, and Seedr.
A common practice mentioned is to use SSD (Solid State Drives) in the working computer or attached drives (or both) to improve speed and reliability. Protection from magnetic fields was another reason given to use SSDs.
Many users copy their media to multiple attached or network drives for redundancy.
Users of Lightroom reported keeping their Lightroom catalog on a local drive and their photo files on an attached drive. They frequently had different backup schemes for the catalog and the media. Many readers are careful to have multiple backups of their Lightroom catalog. Some expressed the desire to back up both their original raw files and their edited (working) raw files, but limitations in bandwidth and backup media caused some to give priority to good backups of their raw files, since the edited files could be recreated if necessary.
A number of smartphone users reported using Apple or Google Photos to store their photos and share them.
Digital Editing and Enhancement
Adobe still rules for many users for photo editing. Some expressed interest in alternatives from Phase One, Skylum (formerly Macphun), ON1, and DxO.
Adobe Lightroom
While Adobe Lightroom (and Adobe Photoshop for some) are the foundation of many users’ photo media workflow, others are still looking for something that might better suit their needs. A number of comments were made regarding Adobe’s switch to a subscription model.
Software used for image and video editing and enhancement included Adobe Lightroom, Adobe Photoshop, Luminar, Affinity Photo, Phase One, DxO, ON1, GoPro Quik, Apple Aperture (discontinued), Avid Media Composer, Adobe Premiere, and Apple Final Cut Studio (discontinued) or Final Cut Pro.
Luminar 2018 DAM preview
Managing, Archiving, Adding Metadata, Searching for Media Files
While some of our respondents are casual or serious amateur digital media users, others make a living from digital photography and videography. A number of our readers report having hundreds of thousands of files and many terabytes of data — even approaching one petabyte of data for one professional who responded. Whether amateur or professional, all shared the desire to preserve their digital media assets for the future. Consequently, they want to be able to attach metadata quickly and easily, and search for and retrieve files from wherever they are stored when necessary.
It’s not surprising that metadata was of great interest to our readers. Tagging, categorizing, and maintaining searchable records is important to anyone dealing with digital media.
While Lightroom was frequently used to manage catalogs, metadata, and files, others used spreadsheets to record archive location and grep for searching records.
Some liked the idea of Adobe’s Creative Cloud but weren’t excited about its cost and lack of choice in cloud providers.
Others reported using Photo Mechanic, DxO, digiKam, Google Photos, Daminion, Photo Supreme, Phraseanet, Phase One Media Pro, Google Picasa (discontinued), Adobe Bridge, Synology Photo Station, FotoStation, PhotoShelter, Flickr, and SmugMug.
Photo Mechanic 5
Common Wishes For Managing Digital Media in the Future
Our readers came through with numerous suggestions for how digital media management could be improved. There were a number of common themes centered around bigger and better storage, faster broadband or other ways to get data into the cloud, managing metadata, and ensuring integrity of their data.
Many wished for faster internet speeds that would make transferring and backing up files more efficient. This desire was expressed multiple times. Many said that the sheer volume of digital data they worked with made cloud services and storage impractical.
A number of readers would like the option to be able to ship files on a physical device to a cloud provider so that the initial large transfer would not take as long. Some wished to be able to send monthly physical transfers with incremental transfers send over the internet. (Note that Backblaze supports adding data via a hardware drive to B2 Cloud Storage with our Fireball service.)
Reasonable service cost, not surprisingly, was a desire expressed by just about everyone.
Many wished for not just backup, but long-term archiving of data. One suggestion was to be able to specify the length-of-term for archiving and pay by that metric for specific sets of files.
An easy-to-use Windows, Macintosh, or Linux client was a feature that many appreciated. Some were comfortable with using third-party apps for cloud storage and others wanted a vendor-supplied client.
A number of users like the combination of NAS and cloud. Many backed up their NAS devices to the cloud. Some suggested that the NAS should be the local gateway to unlimited virtual storage in the cloud. (They should read our recent blog post on Morro Data’s CloudNAS solution.)
Some just wanted the storage problem solved. They would like the computer system to manage storage intelligently so they don’t have to. One reader said that storage should be managed and optimized by the system, as RAM is, and not by the user.
Common Concerns Expressed by our Readers
Over and over again our readers expressed similar concerns about the state of digital asset management.
Dealing with large volumes of data was a common challenge. As digital media files increase in size, readers struggle to manage the amount of data they have to deal with. As one reader wrote, “Why don’t I have an online backup of my entire library? Because it’s too much damn data!”
Many said they would back up more often, or back up even more files if they had the bandwidth or storage media to do so.
The cloud is attractive to many, but some said that they didn’t have the bandwidth to get their data into the cloud in an efficient manner, the cloud is too expensive, or they have other concerns about trusting the cloud with their data.
Most of our respondents are using Apple computer systems, some Windows, and a few Linux. A lot of the Mac users are using Time Machine. Some liked the concept of Time Machine but said they had experienced corrupted data when using it.
Visibility into the backup process was mentioned many times. Users want to know what’s happening to their data. A number said they wanted automatic integrity checks of their data and reports sent to them if anything changes.
A number of readers said they didn’t want to be locked into one vendor’s proprietary solution. They prefer open standards to prevent loss if a vendor leaves the market, changes the product, or makes a turn in strategy that they don’t wish to follow.
A number of users talked about how their practices differed depending on whether they were working in the field or working in a studio or at home. Access to the internet and data transfer speed was an issue for many.
It’s clear that people working in high resolution photography and videography are pushing the envelope for moving data between storage devices and the cloud.
Some readers expressed concern about the integrity of their stored data. They were concerned that over time, files would degrade. Some asked for tools to verify data integrity manually, or that data integrity should be monitored and reported by the storage vendor on a regular basis. The OpenZFS and Btrfs file systems were mentioned by some.
A few readers mentioned that they preferred redundant data centers for cloud storage.
Metadata is an important element for many, and making sure that metadata is easily and permanently associated with their files is essential.
The ability to share working files with collaborators or finished media with clients, friends, and family also is a common requirement.
Thank You for Your Comments and Suggestions
As a cloud backup and storage provider, your contributions were of great interest to us. A number of readers made suggestions for how we can improve or augment our services to increase the options for digital media management. We listened and are considering your comments. They will be included in our discussions and planning for possible future services and offerings from Backblaze. We thank everyone for your contributions.
Digital media management
Let’s Keep the Conversation Going!
Were you surprised by any of the responses? Do you have something further to contribute? This is by no means the end of our exploration of how to better serve media professionals, so let’s keep the lines of communication open.
As Simon mentioned in his recent blog post about Raspbian Stretch, we have developed a new piece of software called PiServer. Use this tool to easily set up a network of client Raspberry Pis connected to a single x86-based server via Ethernet. With PiServer, you don’t need SD cards, you can control all clients via the server, and you can add and configure user accounts — it’s ideal for the classroom, your home, or an industrial setting.
Client? Server?
Before I go into more detail, let me quickly explain some terms.
Server — the server is the computer that provides the file system, boot files, and password authentication to the client(s)
Client — a client is a computer that retrieves boot files from the server over the network, and then uses a file system the server has shared. More than one client can connect to a server, but all clients use the same file system.
User – a user is a user name/password combination that allows someone to log into a client to access the file system on the server. Any user can log into any client with their credentials, and will always see the same server and share the same file system. Users do not have sudo capability on a client, meaning they cannot make significant changes to the file system and software.
I see no SD cards
Last year we described how the Raspberry Pi 3 Model B can be booted without an SD card over an Ethernet network from another computer (the server). This is called network booting or PXE (pronounced ‘pixie’) booting.
Why would you want to do this?
A client computer (the Raspberry Pi) doesn’t need any permanent storage (an SD card) to boot.
You can network a large number of clients to one server, and all clients are exactly the same. If you log into one of the clients, you will see the same file system as if you logged into any other client.
The server can be run on an x86 system, which means you get to take advantage of the performance, network, and disk speed on the server.
Sounds great, right? Of course, for the less technical, creating such a network is very difficult. For example, there’s setting up all the required DHCP and TFTP servers, and making sure they behave nicely with the rest of the network. If you get this wrong, you can break your entire network.
PiServer to the rescue
To make network booting easy, I thought it would be nice to develop an application which did everything for you. Let me introduce: PiServer!
PiServer has the following functionalities:
It automatically detects Raspberry Pis trying to network boot, so you don’t have to work out their Ethernet addresses.
It sets up a DHCP server — the thing inside the router that gives all network devices an IP address — either in proxy mode or in full IP mode. No matter the mode, the DHCP server will only reply to the Raspberry Pis you have specified, which is important for network safety.
It creates user names and passwords for the server. This is great for a classroom full of Pis: just set up all the users beforehand, and everyone gets to log in with their passwords and keep all their work in a central place. Moreover, users cannot change the software, so educators have control over which programs their learners can use.
It uses a slightly altered Raspbian build which allows separation of temporary spaces, doesn’t have the default ‘pi’ user, and has LDAP enabled for log-in.
What can I do with PiServer?
Serve a whole classroom of Pis
In a classroom, PiServer allows all files for lessons or projects to be stored on a central x86-based computer. Each user can have their own account, and any files they create are also stored on the server. Moreover, the networked Pis doesn’t need to be connected to the internet. The teacher has centralised control over all Pis, and all Pis are user-agnostic, meaning there’s no need to match a person with a computer or an SD card.
Build a home server
PiServer could be used in the home to serve file systems for all Raspberry Pis around the house — either a single common Raspbian file system for all Pis or a different operating system for each. Hopefully, our extensive OS suppliers will provide suitable build files in future.
Use it as a controller for networked Pis
In an industrial scenario, it is possible to use PiServer to develop a network of Raspberry Pis (maybe even using Power over Ethernet (PoE)) such that the control software for each Pi is stored remotely on a server. This enables easy remote control and provisioning of the Pis from a central repository.
How to use PiServer
The client machines
So that you can use a Pi as a client, you need to enable network booting on it. Power it up using an SD card with a Raspbian Lite image, and open a terminal window. Type in
echo program_usb_boot_mode=1| sudo tee -a /boot/config.txt
and press Return. This adds the line program_usb_boot_mode=1 to the end of the config.txt file in /boot. Now power the Pi down and remove the SD card. The next time you connect the Pi to a power source, you will be able to network boot it.
The server machine
As a server, you will need an x86 computer on which you can install x86 Debian Stretch. Refer to Simon’s blog post for additional information on this. It is possible to use a Raspberry Pi to serve to the client Pis, but the file system will be slower, especially at boot time.
Make sure your server has a good amount of disk space available for the file system — in general, we recommend at least 16Gb SD cards for Raspberry Pis. The whole client file system is stored locally on the server, so the disk space requirement is fairly significant.
Next, start PiServer by clicking on the start icon and then clicking Preferences > PiServer. This will open a graphical user interface — the wizard — that will walk you through setting up your network. Skip the introduction screen, and you should see a screen looking like this:
If you’ve enabled network booting on the client Pis and they are connected to a power source, their MAC addresses will automatically appear in the table shown above. When you have added all your Pis, click Next.
On the Add users screen, you can set up users on your server. These are pairs of user names and passwords that will be valid for logging into the client Raspberry Pis. Don’t worry, you can add more users at any point. Click Next again when you’re done.
The Add software screen allows you to select the operating system you want to run on the attached Pis. (You’ll have the option to assign an operating system to each client individually in the setting after the wizard has finished its job.) There are some automatically populated operating systems, such as Raspbian and Raspbian Lite. Hopefully, we’ll add more in due course. You can also provide your own operating system from a local file, or install it from a URL. For further information about how these operating system images are created, have a look at the scripts in /var/lib/piserver/scripts.
Once you’re done, click Next again. The wizard will then install the necessary components and the operating systems you’ve chosen. This will take a little time, so grab a coffee (or decaffeinated drink of your choice).
When the installation process is finished, PiServer is up and running — all you need to do is reboot the Pis to get them to run from the server.
Shooting troubles
If you have trouble getting clients connected to your network, there are a fewthings you can do to debug:
If some clients are connecting but others are not, check whether you’ve enabled the network booting mode on the Pis that give you issues. To do that, plug an Ethernet cable into the Pi (with the SD card removed) — the LEDs on the Pi and connector should turn on. If that doesn’t happen, you’ll need to follow the instructions above to boot the Pi and edit its /boot/config.txt file.
If you can’t connect to any clients, check whether your network is suitable: format an SD card, and copy bootcode.bin from /boot on a standard Raspbian image onto it. Plug the card into a client Pi, and check whether it appears as a new MAC address in the PiServer GUI. If it does, then the problem is a known issue, and you can head to our forums to ask for advice about it (the network booting code has a couple of problems which we’re already aware of). For a temporary fix, you can clone the SD card on which bootcode.bin is stored for all your clients.
If neither of these things fix your problem, our forums are the place to find help — there’s a host of people there who’ve got PiServer working. If you’re sure you have identified a problem that hasn’t been addressed on the forums, or if you have a request for a functionality, then please add it to the GitHub issues.
The scale of AWS and the diversity of our customer base gives us the opportunity to create EC2 instance types that are purpose-built for many different types of workloads. For example, a number of popular big data use cases depend on high-speed, sequential access to multiple terabytes of data. Our customers want to build and run very large MapReduce clusters, host distributed file systems, use Apache Kafka to process voluminous log files, and so forth.
New H1 Instances The new H1 instances are designed specifically for this use case. In comparison to the existing D2 (dense storage) instances, the H1 instances provide more vCPUs and more memory per terabyte of local magnetic storage, along with increased network bandwidth, giving you the power to address more complex challenges with a nicely balanced mix of resources.
The instances are based on Intel Xeon E5-2686 v4 processors running at a base clock frequency of 2.3 GHz and come in four instance sizes (all VPC-only and HVM-only):
Instance Name
vCPUs
RAM
Local Storage
Network Bandwidth
h1.2xlarge
8
32 GiB
2 TB
Up to 10 Gbps
h1.4xlarge
16
64 GiB
4 TB
Up to 10 Gbps
h1.8xlarge
32
128 GiB
8 TB
10 Gbps
h1.16xlarge
64
256 GiB
16 TB
25 Gbps
The two largest sizes support Intel Turbo and CPU power management, with all-core Turbo at 2.7 GHz and single-core Turbo at 3.0 GHz.
Local storage is optimized to deliver high throughput for sequential I/O; you can expect to transfer up to 1.15 gigabytes per second if you use a 2 megabyte block size. The storage is encrypted at rest using 256-bit XTS-AES and one-time keys.
Moving large amounts of data on and off of these instances is facilitated by the use of Enhanced Networking, giving you up to 25 Gbps of network bandwith within Placement Groups.
Launch One Today H1 instances are available today in the US East (Northern Virginia), US West (Oregon), US East (Ohio), and EU (Ireland) Regions. You can launch them in On-Demand or Spot Form. Dedicated Hosts, Dedicated Instances, and Reserved Instances (both 1-year and 3-year) are also available.
Encryption at Rest Today we are adding support for encryption of data at rest. When you create a new file system, you can select a key that will be used to encrypt the contents of the files that you store on the file system. The key can be a built-in key that is managed by AWS or a key that you created yourself using AWS Key Management Service (KMS). File metadata (file names, directory names, and directory contents) will be encrypted using a key managed by AWS. Both forms of encryption are implemented using an industry-standard AES-256 algorithm.
You can set this up in seconds when you create a new file system. You simply choose the built-in key (aws/elasticfilesystem) or one of your own:
EFS will take care of the rest! You can select the filesystem in the console to verify that it is encrypted as desired:
A cryptographic algorithm that meets the approval of FIPS 140-2 is used to encrypt data and metadata. The encryption is transparent and has a minimal effect on overall performance.
You can use AWS Identity and Access Management (IAM) to control access to the Customer Master Key (CMK). The CMK must be enabled in order to grant access to the file system; disabling the key prevents it from being used to create new file systems and blocks access (after a period of time) to existing file systems that it protects. To learn more about your options, read Managing Access to Encrypted File Systems.
Available Now Encryption of data at rest is available now in all regions where EFS is supported, at no additional charge.
Lennart Poettering announces casync, a tool for distributing system images. “casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well.”
In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).
The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that’s what makes it useful for a variety of usecases that other tools can’t cover that well.
Why?
I created casync after studying how today’s popular tools store and deliver file system images. To very incomprehensively and briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).
Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:
Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
Put boundaries on disk space usage on clients
Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
Simplicity to use for users, repository administrators and developers
I don’t think any of the tools mentioned above are really good on more than a small subset of these points.
Specifically: Docker’s layered tarball approach dumps the “delta” question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaing full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there’s no use for it there, and likely requires substantially more disk space and download sizes.
OSTree’s serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.
Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/metadata (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the usecase I am interested in.
(Note: all the mentioned systems have great properties, and it’s not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)
Security & Reproducability
Besides the issues pointed out above I wasn’t happy with the security and reproducability properties of these systems. In today’s world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously undeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there’s only one valid representatin of the data to deliver, that can easily be verified.
What casync Is
So much about the background why I created casync. Now, let’s have a look what casync actually is like, and what it does. Here’s the brief technical overview:
Encoding: Let’s take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk’s contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let’s call this directory a “chunk store”. At the same time, generate a “chunk index” file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.
Decoding: Let’s take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.
As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.
Finally, let’s put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.
Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.
Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.
The “chunking” algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.
Here’s a diagram, hopefully explaining a bit how the encoding process works, wasn’t it for my crappy drawing skills:
The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.
Details
Note that casync operates on two different layers, depending on the usecase of the user:
You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.
You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.
The fact that it may be used on both the block and file system layer opens it up for a variety of different usecases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.
Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.
Features
Here are a couple of other features casync has:
When downloading a new image you may use casync‘s --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.
When operating on the file system level, the user has fine-grained control on the metadata included in the serialization. This is relevant since different usecases tend to require a different set of saved/restored metadata. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the metadata cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected metadata features is also always part of the serialization, so that seeding can work correctly and automatically.
casync tries to be as accurate as possible when storing file system metadata. This means that besides the usual baseline of file metadata (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs subvolume information where available. Note that as described above every single type of metadata may be turned off and on individually, hence if you don’t need FAT file bits (and I figure it’s pretty likely you don’t), then they won’t be stored.
The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).
Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other usecases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.
The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file subtrees located at the top of the tree, with zero metadata references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.
When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.
Optionally, when extracting file trees, casync can create traditional UNIX hardlinks for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific usecases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hardlinked files, as that’s the nature of hardlinks. In this mode, casync exposes OSTree-like behaviour, which is built heavily around read-only hardlink trees.
casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are exluded from serialization, as they expose API objects, not real files. Moreover, the “nodump” (+d) chattr(1) flag is honoured by default, permitting users to mark files to exclude from serialization.
When creating and extracting file trees casync may apply am automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user namespacing.
In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote backend.
When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to “dm-verity”, as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel’s block layer. This feature is implemented though Linux’ NBD kernel facility.
Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux’ FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing thrings through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s hopefully just a temporary gap to be fixed soon).
In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the superblock of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there’s little point in seeding the whitespace after the file system within the partition.
Example Command Lines
Here’s how to use casync, explained with a few examples:
$ casync make foobar.caidx /some/directory
This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:
$ casync make foobar.caibx /dev/sda1
This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:
$ casync make foobar.caibx myimage.raw
To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:
This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:
$ casync make ssh.example.com:images/foobar.caidx /some/directory
This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is “smart”: this scheme will only upload chunks currently missing on the server side, and not retransmit what already is available.
Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.
Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:
When creating chunk indexes on the file system layer casync will by default store metadata as accurately as possible. Let’s create a chunk index with reduced metadata:
$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir
This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity timestamps, symbolic links and a single read-only bit. In this mode, all the other metadata bits are not stored, including nanosecond timestamps, full unix permission bits, file ownership or even ACLs or extended attributes.
Now let’s make a .caidx file available locally as a mounted file system, without extracting it:
$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar
And similar, let’s make a .caibx file available locally as a block device:
This will create a block device in /dev and print the used device node path to STDOUT.
As mentioned, casync is big about reproducability. Let’s make use of that to calculate the a digest identifying a very specific version of a file tree:
$ casync digest .
This digest will include all metadata bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what metadata to include:
$ casync digest --with=unix .
This makes use of the --with=unix shortcut for selecting metadata fields. Specifying --with-unix= selects all metadata that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.
Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:
$ casync digest --without=flag-immutable
This generates a digest with the most accurate metadata, but leaves one feature out: chattr(1)‘s immutable (+i) file flag.
To list the contents of a .caidx file use a command like the following:
$ casync list http://example.com/images/foobar.caidx
The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file metadata than mtree files can express, though.
What casync isn’t
casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google’s Courgette.
casync is not a replacement for rsync, or git or zsync or anything like that. They have very different usecases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.
Where next?
casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I’d like to make it useful for a couple other usecases, too. Specifically:
To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool would might become an alternative to restic or tarsnap.
Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.
In the longer run, I’d like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.
casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync‘s functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME’s gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting backends can be replaced easily, for example to replace casync‘s default HTTP/HTTPS backends built on CURL with GNOME’s own HTTP implementation, in order to share cookies, certificates, … There’s also an alternative method to integrate with casync in place already: simply invoke casync as a subprocess. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.
I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.
Further plans are listed tersely in the TODO file.
FAQ:
Is this a systemd project? — casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the codebases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.
Is casync portable? — At the moment: no. I only run Linux and that’s what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn’t really make sense on non-Linux systems), as long as they don’t interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.
Does casync require reflink-capable file systems to work, such as btrfs? No it doesn’t. The reflink magic in casync is employed when the file system permits it, and it’s good to have it, but it’s not a requirement, and casync will implicitly fall back to copying when it isn’t available. Note that casync supports a number of file system features on a variety of file systems that aren’t available everywhere, for example FAT’s system/hidden file flags or xfs‘s projinherit file flag.
Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don’t see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.
Are the .caidx/.caibx and .catar file formats open and documented? — casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It’s definitely my intention to add comprehensive docs for both formats however. Don’t forget this is just the initial version right now.
casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casyncisn’t “just like” some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.
Why did you invent your own serialization format for file trees? Why don’t you just use tar? That’s a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn’t enforce reproducability. It also doesn’t really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducability, random access, and metadata control. Much like traditional tar it can still be generated and extracted in a stream fashion though.
Does casync save/restore SELinux/SMACK file labels? At the moment not. That’s not because I wouldn’t want it to, but simply because I am not a guru of either of these systems, and didn’t want to implement something I do not fully grok nor can test. If you look at the sources you’ll find that there’s already some definitions in place that keep room for them though. I’d be delighted to accept a patch implementing this fully.
What about delivering squashfs images? How well does chunking work on compressed serializations? – That’s a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let’s say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn’t apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync‘s --chunk-size= option). How precisely to choose both values is left to reasearch by the user, for now.
What does the name casync mean? – It’s a synchronizing tool, hence the -sync suffix, following rsync‘s naming. It makes use of the content-addressable concept of git hence the ca- prefix.
Well, that’s up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.
Note that casync is an Open Source project: if it doesn’t do exactly what you need, prepare a patch that adds what you need, and we’ll consider it.
If you are interested in the project and would like to talk about this in person, I’ll be presenting casync soon at Kinvolk’s Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.
In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).
The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that’s what makes it useful for a variety of use-cases that other tools can’t cover that well.
Why?
I created casync after studying how today’s popular tools store and deliver file system images. To briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).
Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:
Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
Put boundaries on disk space usage on clients
Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
Simplicity to use for users, repository administrators and developers
I don’t think any of the tools mentioned above are really good on more than a small subset of these points.
Specifically: Docker’s layered tarball approach dumps the “delta” question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaining full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there’s no use for it there, and likely requires substantially more disk space and download sizes.
OSTree’s serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.
Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/meta-data (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the use-case I am interested in.
(Note: all the mentioned systems have great properties, and it’s not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)
Security & Reproducibility
Besides the issues pointed out above I wasn’t happy with the security and reproducibility properties of these systems. In today’s world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously nondeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there’s only one valid representation of the data to deliver, that can easily be verified.
What casync Is
So much about the background why I created casync. Now, let’s have a look what casync actually is like, and what it does. Here’s the brief technical overview:
Encoding: Let’s take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk’s contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let’s call this directory a “chunk store”. At the same time, generate a “chunk index” file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.
Decoding: Let’s take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.
As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.
Finally, let’s put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.
Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.
Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.
The “chunking” algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.
Here’s a diagram, hopefully explaining a bit how the encoding process works, wasn’t it for my crappy drawing skills:
The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.)
Details
Note that casync operates on two different layers, depending on the use-case of the user:
You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.
You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.
The fact that it may be used on both the block and file system layer opens it up for a variety of different use-cases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.
Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.
Features
Here are a couple of other features casync has:
When downloading a new image you may use casync‘s --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.
When operating on the file system level, the user has fine-grained control on the meta-data included in the serialization. This is relevant since different use-cases tend to require a different set of saved/restored meta-data. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the meta-data cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected meta-data features is also always part of the serialization, so that seeding can work correctly and automatically.
casync tries to be as accurate as possible when storing file system meta-data. This means that besides the usual baseline of file meta-data (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs sub-volume information where available. Note that as described above every single type of meta-data may be turned off and on individually, hence if you don’t need FAT file bits (and I figure it’s pretty likely you don’t), then they won’t be stored.
The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).
Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other use-cases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.
The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file sub-trees located at the top of the tree, with zero meta-data references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.
When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.
Optionally, when extracting file trees, casync can create traditional UNIX hard-links for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific use-cases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hard-linked files, as that’s the nature of hard-links. In this mode, casync exposes OSTree-like behavior, which is built heavily around read-only hard-link trees.
casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are excluded from serialization, as they expose API objects, not real files. Moreover, the “nodump” (+d) chattr(1) flag is honored by default, permitting users to mark files to exclude from serialization.
When creating and extracting file trees casync may apply an automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user name-spacing.
In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote back-end.
When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to “dm-verity”, as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel’s block layer. This feature is implemented though Linux’ NBD kernel facility.
Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux’ FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing things through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s hopefully just a temporary gap to be fixed soon).
In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the super-block of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there’s little point in seeding the white-space after the file system within the partition.
Example Command Lines
Here’s how to use casync, explained with a few examples:
$ casync make foobar.caidx /some/directory
This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:
$ casync make foobar.caibx /dev/sda1
This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:
$ casync make foobar.caibx myimage.raw
To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:
This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:
$ casync make ssh.example.com:images/foobar.caidx /some/directory
This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is “smart”: this scheme will only upload chunks currently missing on the server side, and not re-transmit what already is available.
Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.
Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:
When creating chunk indexes on the file system layer casync will by default store meta-data as accurately as possible. Let’s create a chunk index with reduced meta-data:
$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir
This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity time-stamps, symbolic links and a single read-only bit. In this mode, all the other meta-data bits are not stored, including nanosecond time-stamps, full UNIX permission bits, file ownership or even ACLs or extended attributes.
Now let’s make a .caidx file available locally as a mounted file system, without extracting it:
$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar
And similar, let’s make a .caibx file available locally as a block device:
This will create a block device in /dev and print the used device node path to STDOUT.
As mentioned, casync is big about reproducibility. Let’s make use of that to calculate the a digest identifying a very specific version of a file tree:
$ casync digest .
This digest will include all meta-data bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what meta-data to include:
$ casync digest --with=unix .
This makes use of the --with=unix shortcut for selecting meta-data fields. Specifying --with-unix= selects all meta-data that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.
Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:
$ casync digest --without=flag-immutable
This generates a digest with the most accurate meta-data, but leaves one feature out: chattr(1)‘s immutable (+i) file flag.
To list the contents of a .caidx file use a command like the following:
$ casync list http://example.com/images/foobar.caidx
The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file meta-data than mtree files can express, though.
What casync isn’t
casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google’s Courgette.
casync is not a replacement for rsync, or git or zsync or anything like that. They have very different use-cases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.
Where next?
casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I’d like to make it useful for a couple other use-cases, too. Specifically:
To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool might become an alternative to restic, BorgBackup or tarsnap.
Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.
In the longer run, I’d like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.
casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync‘s functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME’s gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting back-ends can be replaced easily, for example to replace casync‘s default HTTP/HTTPS back-ends built on CURL with GNOME’s own HTTP implementation, in order to share cookies, certificates, … There’s also an alternative method to integrate with casync in place already: simply invoke casync as a sub-process. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.
I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.
Further plans are listed tersely in the TODO file.
FAQ:
Is this a systemd project? — casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.
Is casync portable? — At the moment: no. I only run Linux and that’s what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn’t really make sense on non-Linux systems), as long as they don’t interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.
Does casync require reflink-capable file systems to work, such as btrfs? — No it doesn’t. The reflink magic in casync is employed when the file system permits it, and it’s good to have it, but it’s not a requirement, and casync will implicitly fall back to copying when it isn’t available. Note that casync supports a number of file system features on a variety of file systems that aren’t available everywhere, for example FAT’s system/hidden file flags or xfs‘s projinherit file flag.
Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don’t see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.
Are the .caidx/.caibx and .catar file formats open and documented? — casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It’s definitely my intention to add comprehensive docs for both formats however. Don’t forget this is just the initial version right now.
casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casyncisn’t “just like” some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.
Why did you invent your own serialization format for file trees? Why don’t you just use tar? — That’s a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn’t enforce reproducibility. It also doesn’t really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducibility, random access, and meta-data control. Much like traditional tar it can still be generated and extracted in a stream fashion though.
Does casync save/restore SELinux/SMACK file labels? — At the moment not. That’s not because I wouldn’t want it to, but simply because I am not a guru of either of these systems, and didn’t want to implement something I do not fully grok nor can test. If you look at the sources you’ll find that there’s already some definitions in place that keep room for them though. I’d be delighted to accept a patch implementing this fully.
What about delivering squashfs images? How well does chunking work on compressed serializations? – That’s a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let’s say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn’t apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync‘s --chunk-size= option). How precisely to choose both values is left a research subject for the user, for now.
What does the name casync mean? – It’s a synchronizing tool, hence the -sync suffix, following rsync‘s naming. It makes use of the content-addressable concept of git hence the ca- prefix.
Where can I get this stuff? Is it already packaged? – Check out the sources on GitHub. I just tagged the first version. Martin Pitt has packaged casync for Ubuntu. There is also an ArchLinux package. Zbigniew Jędrzejewski-Szmek has prepared a Fedora RPM that hopefully will soon be included in the distribution.
Should you care? Is this a tool for you?
Well, that’s up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.
Note that casync is an Open Source project: if it doesn’t do exactly what you need, prepare a patch that adds what you need, and we’ll consider it.
If you are interested in the project and would like to talk about this in person, I’ll be presenting casync soon at Kinvolk’s Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.
Today we are making EFS even more useful with the introduction of simple and reliable on-premises access via AWS Direct Connect. This has been a much-requested feature and I know that it will be useful for migration, cloudbursting, and backup. To use this feature for migration, you simply attach an EFS file system to your on-premises servers, copy your data to it, and then process it in the cloud as desired, leaving your data in AWS for the long term. For cloudbursting, you would copy on-premises data to an EFS file system, analyze it at high speed using a fleet of Amazon Elastic Compute Cloud (EC2) instances, and then copy the results back on-premises or visualize them in Amazon QuickSight.
You’ll get the same file system access semantics including strong consistency and file locking, whether you access your EFS file systems from your on-premises servers or from your EC2 instances (of course, you can do both concurrently). You will also be able to enjoy the same multi-AZ availability and durability that is part-and-parcel of EFS.
In order to take advantage of this new feature, you will need to use Direct Connect to set up a dedicated network connection between your on-premises data center and an Amazon Virtual Private Cloud. Then you need to make sure that your filesystems have mount targets in subnets that are reachable via the Direct Connect connection:
You also need to add a rule to the mount target’s security group in order to allow inbound TCP and UDP traffic to port 2049 (NFS) from your on-premises servers:
After you create the file system, you can reference the mount targets by their IP addresses, NFS-mount them on-premises, and start copying files. The IP addresses are available from within the AWS Management Console:
The Management Console also provides you with access to step-by-step directions! Simply click on the On-premises mount instructions:
And follow along:
This feature is available today at no extra charge in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and US East (Ohio) Regions.
I entered college in the fall of 1978. The Computer Science department at Montgomery College was built around a powerful (for its time) IBM 370/168 mainframe. I quickly learned how to use the keypunch machine to prepare my card decks, prefacing the actual code with some cryptic Job Control Language (JCL) statements that set the job’s name & priority, and then invoked the FORTRAN, COBOL, or PL/I compiler. I would take the deck to the submission window, hand it to the operator in exchange for a job identifier, and then come back several hours later to collect the printed output and the card deck. I studied that printed output with care, and was always shocked to find that after my jobs spent several hours waiting for its turn to run, the actual run time was just a few seconds. As my fellow students and I quickly learned, jobs launched by the school’s IT department ran at priority 4 while ours ran at 8; their jobs took precedence over ours. The goal of the entire priority mechanism was to keep the expensive hardware fully occupied whenever possible. Student productivity was assuredly secondary to efficient use of resources.
Batch Computing Today Today, batch computing remains important! Easier access to compute power has made movie studios, scientists, researchers, numerical analysts, and others with an insatiable appetite for compute cycles hungrier than ever. Many organizations have attempted to feed these needs by building in-house compute clusters powered by open source or commercial job schedulers. Once again, priorities come in to play and there never seems to be enough compute power to go around. Clusters are expensive to build and to maintain, and are often comprised of a large array of identical, undifferentiated processors, all of the same vintage and built to the same specifications.
We believe that cloud computing has the potential to change the batch computing model for the better, with fast access to many different types of EC2 instances, the ability to scale up and down in response to changing needs, and a pricing model that allows you to bid for capacity and to obtain it as economically as possible. In the past, many AWS customers have built their own batch processing systems using EC2 instances, containers, notifications, CloudWatch monitoring, and so forth. This turned out to be a very common AWS use case and we decided to make it even easier to achieve.
Introducing AWS Batch Today I would like to tell you about a new set of fully-managed batch capabilities. AWS Batch allows batch administrators, developers, and users to have access to the power of the cloud without having to provision, manage, monitor, or maintain clusters. There’s nothing to buy and no software to install. AWS Batch takes care of the undifferentiated heavy lifting and allows you to run your container images and applications on a dynamically scaled set of EC2 instances. It is efficient, easy to use, and designed for the cloud, with the ability to run massively parallel jobs that take advantage of the elasticity and selection provided by Amazon EC2 and EC2 Spot and can easily and securely interact with other other AWS services such as Amazon S3, DynamoDB, and SNS.
Let’s start by taking a look at some important AWS Batch terms and concepts (if you are already doing batch computing, many of these terms will be familiar to you, and still apply). Here goes:
Job – A unit of work (a shell script, a Linux executable, or a container image) that you submit to AWS Batch. It has a name, and runs as a containerized app on EC2 using parameters that you specify in a Job Definition. Jobs can reference other jobs by name or by ID, and can be dependent on the successful completion of other jobs.
Job Definition – Specifies how Jobs are to be run. Includes an AWS Identity and Access Management (IAM) role to provide access to AWS resources, and also specifies both memory and CPU requirements. The definition can also control container properties, environment variables, and mount points. Many of the specifications in a Job Definition can be overridden by specifying new values when submitting individual Jobs.
Job Queue – Where Jobs reside until scheduled onto a Compute Environment. A priority value is associated with each queue.
Scheduler – Attached to a Job Queue, a Scheduler decides when, where, and how to run Jobs that have been submitted to a Job Queue. The AWS Batch Scheduler is FIFO-based, and is aware of dependencies between jobs. It enforces priorities, and runs jobs from higher-priority queues in preference to lower-priority ones when the queues share a common Compute Environment. The Scheduler also ensures that the jobs are run in a Compute Environment of an appropriate size.
Compute Environment – A set of managed or unmanaged compute resources that are used to run jobs. Managed environments allow you to specify desired instance types at several levels of detail. You can set up Compute Environments that use a particular type of instance, a particular model such as c4.2xlarge or m4.10xlarge, or simply specify that you want to use the newest instance types. You can also specify the minimum, desired, and maximum number of vCPUs for the environment, along with a percentage value for bids on the Spot Market and a target set of VPC subnets. Given these parameters and constraints, AWS Batch will efficiently launch, manage, and terminate EC2 instances as needed. You can also launch your own Compute Environments. In this case you are responsible for setting up and scaling the instances in an Amazon ECS cluster that AWS Batch will create for you.
The Status Dashboard displays my Jobs, Job Queues, and Compute Environments:
I need a place to run my Jobs, so I will start by selecting Compute environments and clicking on Create environment. I begin by choosing to create a Managed environment, give it a name, and choosing the IAM roles (these were created automatically for me):
Then I set up the provisioning model (On-Demand or Spot), choose the desired instance families (or specific types), and set the size of my Compute Environment (measured in vCPUs):
I wrap up by choosing my VPC, the desired subnets for compute resources, and the security group that will be associated with those resources:
I click on Create and my first Compute Environment (MainCompute) is ready within seconds:
Next, I need a Job Queue to feed work to my Compute Environment. I select Queues and click on Create Queue to set this up. I accept all of the defaults, connect the Job Queue to my new Compute Environment, and click on Create queue:
Again, it is available within seconds:
Now I can set up a Job Definition. I select Job definitions and click on Create, then set up my definition (this is a very simple job; I am sure you can do better). My job runs the sleep command, needs 1 vCPU, and fits into 128 MB of memory:
I can also pass in environment variables, disable privileged access, specify the user name for the process, and arrange to make file systems available within the container:
I click on Save and my Job Definition is ready to go:
Now I am ready to run my first Job! I select Jobs and click on Submit job:
I can also override many aspect of the job, add additional tags, and so forth. I’ll everything as-is and click on Submit:
And there it is:
I can also submit jobs by specifying the Ruby, Python, Node, or Bash script that implements the job. For example:
The command line equivalents to the operations that I used in the console include create-compute-environment, describe-compute-environments, create-job-queue, describe-job-queues, register-job-definition, submit-job, list-jobs, and describe-jobs.
I expect to see the AWS Batch APIs used in some interesting ways. For example, imagine a Lambda function that is invoked when a new object (a digital X-Ray, a batch of seismic observations, or a 3D scene description) is uploaded to an S3 bucket. The function can examine the object, extract some metadata, and then use the SubmitJob function to submit one or more Jobs to process the data, with updated data stored in Amazon DynamoDB and notifications sent to Amazon Simple Notification Service (SNS) along the way.
Pricing & Availability AWS Batch is in Preview today in the US East (Northern Virginia) Region. In addition to regional expansion, we have many other interesting features on the near-term AWS Batch roadmap. For example, you will be able to use an AWS Lambda function as a Job.
There’s no charge for the use of AWS Batch; you pay only for the underlying AWS resources that you consume.
Editor’s note: We posted this article originally in August 2016. We’ve since updated it with new information.
APFS, or Apple File System, is one of the biggest changes coming to every new Apple device. It makes its public debut with the release of iOS 10.3, but it’s also coming to the Mac (in fact, it’s already available if you’re a developer). APFS changes how the Mac, iPhone and iPad store files. Backing up your data is our job, so we think a new file system is fascinating. Let’s take a look at APFS to understand what it is and why it’s so important, then answer some questions about it.
File systems are a vital component of any computer or electronic device. The file system tells the computer how to interact with data. Whether it’s a picture you’ve taken on your phone, a Microsoft Word document, or an invisible file the computer needs, the file system accounts for all of that stuff.
File systems may not be the sexiest feature, but the underlying technology is so important that it gets developers interested. Apple revealed plans for APFS at its annual Worldwide Developers Conference in June 2016. APFS thoroughly modernizes the way Apple devices track stored information. APFS also adds some cool features that we haven’t seen before in other file systems.
APFS first appeared with macOS 10.12 Sierra as an early test release for developers to try out. Its first general release is iOS 10.3. Apple will migrate everything to use it in the future. Since our Mac client is a native app, we like many Apple developers, have been boning up on APFS and what it means.
What Is APFS?
Apple hasn’t defined the P in APFS, but that differentiates it from Apple File Service (AFS), a term used to describe older Apple file and network services.
APFS is designed to scale from the smallest Apple device to the biggest. It works with watchOS, tvOS, iOS and macOS, spanning the entire Apple product line. It’s designed from the get-go to work well on modern Apple device architectures and offers plenty of scalability going forward.
APFS won’t change how you see files. The Finder, the main way you interact with files on your Mac, won’t undergo any major cosmetic changes because of APFS (at least none Apple has told us about yet). Neither will iOS, which mostly abstracts file management: the under-the-hood stuff that tells the computer where to put and how to work with data.
Why Did Apple Make APFS?
The current file system Apple uses is HFS+. HFS was introduced in 1985, back when the Mac was still new. That’s right, more than thirty years ago now. (HFS+ came later with some improvements for newer Macs.)
To give you an idea of how “state of the art” has changed since then, consider this. My first Mac, which came out late in 1984, had 512 KB of RAM (four times the original Mac’s memory) and a single floppy drive that could store 400K. This computer I’m writing from now has 8 GB of RAM – almost 16 thousand times more RAM than my first Mac – and 512 GB of storage capacity. That’s more than 1.2 million times the size of that first Mac. Think about that the next time you get a message that your drive is full!
Given the pace of computer technology and development, it’s a bit startling that we still use anything developed so long ago. But that’s how essential and important a file system is to a computer’s operation.
HFS+ was cutting-edge for its time, but Apple made it for computers with floppy disk drives and hard drives. Floppies are long gone. Most Apple devices now use solid state storage like built-in Flash and Solid State Drives (SSDs), and those store data differently than hard drives and floppies did.
Why Is APFS Better?
APFS better suits the needs of today’s and tomorrow’s computers and mobile devices because it’s made for solid-state storage – Flash, and SSDs. These storage technologies work differently than spinning drives do, so it only makes sense to optimize the file system to take advantage.
Apple’s paving the way to store lots more data with APFS. HFS+ supports 32-bit file IDs, for example, while APFS ups that to 64-bit. That means that today, your Mac can keep track of about 4 billion individual pieces of information on its hard drive. APFS ups that to 9 quintillion. That’s a nine followed by 18 zeroes (actually, much more than that, because of hexadecimal values).
Even though APFS can keep track of orders of magnitude more data than HFS+, you’ll see much faster performance. When you need to save or duplicate files, APFS shares data between files whenever possible. Instead of duplicating information like HFS+ does, APFS updates metadata links to the actual stored information. Clones, or copies of files or folders can be created instantly. You won’t have to sit and watch as gigabytes of files are duplicated en masse, wasting extreme amounts of space in the process. In fact, clones take up no additional space, since they’re pointing back to the original data! You’ll get much better bang for your storage buck with APFS than HFS+ can manage.
Speaking of space, Space Sharing is another new feature of APFS. Space Sharing helps the Mac manage free space on its hard drives more efficiently. You can set up multiple partitions, even multiple file systems, on a single physical device, and all of them can share the same space. You presently have to jump through hoops if you’re resizing partitions and want to re-use de-allocated space. APFS views individual physical devices as “containers,” with multiple “volumes” inside.
How Does APFS Affect Performance?
Networking is crucial for almost all computers and computing devices. Over the years there’s been a lot of emphasis on tuning operating system performance for maximum throughput. That’s helpful to developers like us because we store data in the cloud. But that’s not the whole story. Latency – the amount of time between you telling your computer to do something and when it happens – also has a significant effect on performance.
Has “the Beachball of Death” ever plagued you? You’ll click a button or try to open something, and the cursor changes to a spinning disk that looks for all the world like a beachball. Apple’s doing a lot more with APFS to make beachballs go away. That’s because they’re prioritizing latency – the amount of time between when you ask your device to do something and when it does it.
Apple has found other ways to improve performance wherever possible. Take crash protection, for example. HFS+ uses journaling as a form of crash protection: It keeps track of changes not yet made to the file system in log files. Unfortunately, journaling creates performance overhead. Those log files are always being written and read. APFS replaces that with a new copy-on-write metadata scheme that’s much more efficient.
How Is APFS Security?
Apple is very concerned with user privacy. Their protection of their users’ privacy has occasionally put Apple at loggerheads with governments and individuals who want your data. Apple’s taking your privacy seriously with APFS, thanks to much more sophisticated encryption options than before.
Apple’s current encryption scheme is called FileVault. FileVault is “whole disk” encryption. You turn it on, and your Mac encrypts your hard drive. That encrypted data is, for all intents and purposes, unrecognizable unless you enter a password or key to unlock it.
The problem is that FileVault is either on or off, and it’s on or off for the whole volume. So once you’ve unlocked it, your data is potentially vulnerable. APFS still supports full disk encryption, but it can also encrypt individual files and metadata, with single or multi-key support. That provides additional security for your most sensitive data.
As a backup company, one feature of APFS we’re particularly interested in is its support for snapshots. Snapshots are a pretty standard feature of enterprise backups, but we haven’t seen them yet on the Mac. A snapshot contains pointers to data stored on your disk, providing fast access to data stored on the disk. Because the snapshot contains pointers, not the actual data, it’s compact, and accessing it is very fast.
How Do I Get APFS?
If you’ve upgraded to iOS 10.3 or later, your iPhone or iPad has already made the switch. There’s nothing more to do. If you’re a Mac user, you’re best off waiting for now. APFS support on the Mac is still provisional and mainly the purview of developers. But it’s coming soon, and when it does, Apple promises the same sort of seamless conversion that iPhone and iPad customers have.
When the time is right, make sure to back up your Mac before making any essential changes – just as you should with your iPhone or iPad if you haven’t yet installed 10.3. If you need help, head over to our Computer Backup Guide for more tips.
There’s a lot more under the hood in APFS, but that gives you a broad overview of what it is and why we’re excited. We hope you are too. APFS is an “under the hood” enhancement in iOS 10.3 that shouldn’t have any significant effect on how your Apple gear works today, but it paves the way for what’s to come in the future.
Yesterday, we introduced the first of two new boot modes which have now been added to the Raspberry Pi 3. Today, we introduce an even more exciting addition: network booting a Raspberry Pi with no SD card.
Again, rather than go through a description of the boot mode here, we’ve written a fairly comprehensive guide on the Raspberry Pi documentation pages, and you can find a tutorial to get you started here. Below are answers to what we think will be common questions, and a look at some limitations of the boot mode.
Note: this is still in beta testing and uses the “next” branch of the firmware. If you’re unsure about using the new boot modes, it’s probably best to wait until we release it fully.
What is network booting?
Network booting is a computer’s ability to load all its software over a network. This is useful in a number of cases, such as remotely operated systems or those in data centres; network booting means they can be updated, upgraded, and completely re-imaged, without anyone having to touch the device!
The main advantages when it comes to the Raspberry Pi are:
SD cards are difficult to make reliable unless they are treated well; they must be powered down correctly, for example. A Network File System (NFS) is much better in this respect, and is easy to fix remotely.
NFS file systems can be shared between multiple Raspberry Pis, meaning that you only have to update and upgrade a single Pi, and are then able to share users in a single file system.
Network booting allows for completely headless Pis with no external access required. The only desirable addition would be an externally controlled power supply.
I’ve tried doing things like this before and it’s really hard editing DHCP configurations!
It can be quite difficult to edit DHCP configurations to allow your Raspberry Pi to boot, while not breaking the whole network in the process. Because of this, and thanks to input from Andrew Mulholland, I added the support of proxy DHCP as used with PXE booting computers.
What’s proxy DHCP and why does it make it easier?
Standard DHCP is the protocol that gives a system an IP address when it powers up. It’s one of the most important protocols, because it allows all the different systems to coexist. The problem is that if you edit the DHCP configuration, you can easily break your network.
So proxy DHCP is a special protocol: instead of handing out IP addresses, it only hands out the TFTP server address. This means it will only reply to devices trying to do netboot. This is much easier to enable and manage, because we’ve given you a tutorial!
Are there any bugs?
At the moment we know of three problems which need to be worked around:
When the boot ROM enables the Ethernet link, it first waits for the link to come up, then sends its first DHCP request packet. This is sometimes too quick for the switch to which the Raspberry Pi is connected: we believe that the switch may throw away packets it receives very soon after the link first comes up.
The second bug is in the retransmission of the DHCP packet: the retransmission loop is not timing out correctly, so the DHCP packet will not be retransmitted.
The solution to both these problems is to find a suitable switch which works with the Raspberry Pi boot system. We have been using a Netgear GS108 without a problem.
Finally, the failing timeout has a knock-on effect. This means it can require the occasional random packet to wake it up again, so having the Raspberry Pi network wired up to a general network with lots of other computers actually helps!
Can I use network boot with Raspberry Pi / Pi 2?
Unfortunately, because the code is actually in the boot ROM, this won’t work with Pi 1, Pi B+, Pi 2, and Pi Zero. But as with the MSD instructions, there’s a special mode in which you can copy the ‘next’ firmware bootcode.bin to an SD card on its own, and then it will try and boot from the network.
This is also useful if you’re having trouble with the bugs above, since I’ve fixed them in the bootcode.bin implementation.
Finally, I would like to thank my Slack beta testing team who provided a great testing resource for this work. It’s been a fun few weeks! Thanks in particular to Rolf Bakker for this current handy status reference…
Want to a join a rapidly expanding team and help us grow Backblaze to new heights? We’re looking for a Sys Admin who is looking for a challenging and fast-paced working environment. The position can either be in San Mateo, California or in our Rancho Cordova datacenter! Interested? Check out the job description and application details below:
Here’s what you’ll be working on:
– Rebuild failed RAID arrays, diagnose and repair file system problems (ext4) and debug other operations problems with minimal supervision. – Administrative proficiency in software patches, releases and system upgrades. – Troubleshoot and resolve operational problems. – Help deploy, configure and maintain production systems. – Assist with networks and services (static/dynamic web servers, etc) as needed. – Assist in efforts to automate provisioning and other tasks that need to be run across hundreds of servers. – Help maintain monitoring systems to measure system availability and detect issues. – Help qualify hardware and components. – Participate in the 24×7 on-call pager rotation and respond to alerts as needed. This may include occasional trips to Backblaze datacenter(s). – Write, design, maintain and support operational Documentation and scripts. – Help train operations staff as needed.
This is a must:
– Strong knowledge of Linux system administration, Debian experience preferred. – 4+ years of experience. – Bash scripting skills required. – Ability to lift/move 50-75 lbs and work down near the floor as needed. – Position based in the San Mateo Corporate Office or the Rancho Cordova Datacenter, California.
It would be nice if you had:
– Experience configuring and supporting (Debian) Linux software RAID (mdadm). – Experience configuring and supporting file systems on Linux (Debian). – Experience troubleshooting server hardware/component issues. – Experience supporting Apache, Tomcat, and Java services. – Experience with automation in a production environment (Puppet/Chef/Ansible). – Experience supporting network equipment (layer 2 switches).
Required for all Backblaze Employees:
– Good attitude and willingness to do whatever it takes to get the job done. – Strong desire to work for a small fast paced company. – Desire to learn and adapt to rapidly changing technologies and work environment. – Occasional visits to Backblaze datacenters necessary. – Rigorous adherence to best practices. – Relentless attention to detail. – Excellent interpersonal skills and good oral/written communication. – Excellent troubleshooting and problem solving skills. – OK with pets in office.
Backblaze is an Equal Opportunity Employer and we offer competitive salary and benefits, including our no policy vacation policy.
If this sounds like you — follow these steps:
Send an email to [email protected] with the position in the subject line.
Include your resume.
Tell us a bit about your Sys Admin experience and why you’re excited to work with Backblaze.
So I accidentally ordered too many Raspberry Pi’s. Therefore, I built a small cluster out of them. I thought I’d write up a parts list for others wanting to build a cluster.
To start with is some pics of the cluster What you see is a stack of 7 RPis. At the bottom of the stack is a USB multiport charger and also an Ethernet hub. You see USB cables coming out of the charger to power the RPis, and out the other side you see Ethernet cables connecting the RPis to a network. I’ve including the mouse and keyboard in the picture to give you a sense of perspective.
Here is the same stack turn around, seeing it from the other side. Out the bottom left you see three external cables, one Ethernet to my main network and power cables for the USB charger and Ethernet hub. You can see that the USB hub is nicely tied down to the frame, but that the Ethernet hub is just sort jammed in there somehow.
The concept is to get things as cheap as possible, on per unit basis. Otherwise, one might as well just buy more expensive computers. My parts list for a 7x Pi cluster are:
…or $54.65 per unit (or $383 for entire cluster), or around 50% more than the base Raspberry Pis alone. This is getting a bit expensive, as Newegg. always has cheap Android tablets on closeout for $30 to $50.
So here’s a discussion of the parts.
Raspberry Pi 2 These are old boards I’d ordered a while back. They are up to RPi3 now with slightly faster processors and WiFi/Bluetooth on board, neither of which are useful for a cluster. It has four CPUs each running at 900 MHz as opposed to the RPi3 which has four 1.2 GHz processors. If you order a Raspberry Pi now, it’ll be the newer, better one. The case
You’ll notice that the RPi’s are mounted on acrylic sheets, which are in turn held together with standoffs/spaces. This is a relatively expensive option.
A cheaper solution would be just to buy the spaces/standoffs yourself. They are a little hard to find, because the screws need to fit the 2.9mm holes, where are unusually tiny. Such spaces/standoffs are usually made of brass, but you can also find nylon ones. For the ends, you need some washers and screws. This will bring the price down to about $2/unit — or a lot cheaper if you are buying in bulk for a lot of units.
The micro-SD
The absolute cheapest micro SD’s I could find were $2.95/unit for 4gb, or half the price than the ones I bought. But the ones I chose are 4x the size and 2x the speed. RPi distros are getting large enough that they no longer fit well on 4gig cards, and are even approaching 8gigs. Thus, 16gigs are the best choice, especially when I could get hen for $6/unit. By the time you read this, the price of flash will have changed up or down. I search on Newegg, because that’s the easiest way to focus on the cheapest. Most cards should work, but check http://elinux.org/RPi_SD_cards to avoid any known bad chips.
Note that different cards have different speeds, which can have a major impact on performance. You probably don’t care for a cluster, but if you are buying a card for a development system, get the faster ones. The Samsung EVO cards are a good choice for something fast.
USB Charging Hub
What we want here is a charger not a hub. Both can work, but the charger works better.
A normal hub is about connecting all your USB devices to your desktop/laptop. That doesn’t work for this RPi — the connector is just for power. It’s just leveraging the fact that there’s already lots of USB power cables/chargers out there, so that it doesn’t have to invite a custom one.
USB hubs an supply some power to the RPi, enough to boot it. However, under load, or when you connect further USB devices to the RPi, there may not be enough power available. You might be able to run a couple RPis from a normal hub, but when you’ve got all seven running (as in this stack), there might not be enough power. Power problems can outright crash the devices, but worse, it can lead to things like corrupt writes to the flash drives, slowly corrupting the system until it fails.
Luckily, in the last couple years we’ve seen suppliers of multiport chargers. These are designed for families (and workplaces) that have a lot of phones and tablets to charge. They can charge high-capacity batteries on all ports — supplying much more power than your RPi will ever need.
If want to go ultra cheaper, then cheap hubs at $1/port may be adequate. Chargers cost around $4/port.
The charger I chose in particular is the Bolse 60W 7-port charger. I only need exactly 7 ports. More ports would be nicer, in case I needed to power something else along with the stack, but this Bolse unit has the nice property that it fits snugly within the stack. The frame came with extra spacers which I could screw together to provide room. I then used zip ties to hold it firmly in place.
Ethernet hub
The RPis only have 100mbps Ethernet. Therefore, you don’t need a gigabit hub, which you’d normally get, but can choose a 100mbps hub instead: it’s cheaper, smaller, and lower power. The downside is that while each RPi only does 100-mbps, combined they will do 700-mbps, which the hub can’t handle.
I got a $10 hub from Newegg. As you can see, it fits within the frame, though not well. Every gigabit hub I’ve seen is bigger and could not fit this way.
Note that I have a couple extra RPis, but I only built a 7-high stack, because of the Ethernet hub. Hubs have only 8 ports, one of which is needed for the uplink. That leaves 7 devices. I’d have to upgrade to an unwieldy 16-port hub if I wanted more ports, which wouldn’t fit the nice clean case I’ve got.
For a gigabit option, Ethernet switches will cost between $23 and $35 dollars. That $35 option is a “smart” switch that supports not only gigabit, but also a web-based configuration tool, VLANs, and some other high-end features. If I paid more for a switch, I’d probably go with the smart/managed one.
Cables (Ethernet, USB)
Buying cables is expensive, as everyone knows whose bought an Apple cable for $30. But buying in bulk from specialty sellers can reduce the price to under $1/cable.
The chief buy factor is length. We want short cables that will just barely be long enough. in the pictures above, the Ethernet cables are 1-foot, as are two of the USB cables. The colored USB cables are 6-inches. I got these off Amazon because they looked cool, but now I’m regretting it.
The easiest, cheapest, and highest quality place to buy cables is Monoprice.com. It allows you to easily select the length and color.
To reach everything in this stack, you’ll need 1-foot cables. Though, 6-inch cables will work for some (but not all) of the USB devices. Although, instead of putting the hubs on the bottom, I could’ve put them in the middle of the stack, then 6-inch cables would’ve worked better — but I didn’t think that’d look as pretty. (I chose these colored cables because somebody suggested them, but they won’t work for the full seven-high tower).
Power consumption
The power consumption of the entire stack is 13.3 watts while it’s idle. The Ethernet hub by itself was 1.3 watts (so low because it’s 100-mbps instead of gigabit).
So, round it up, that’s 2-watts per RPi while idle.
In previous power tests, it’s an extra 2 to 3 watts while doing heavy computations, so for the entire stack, that can start consuming a significant amount of power. I mention this because people think terms of a low-power alternative to Intel’s big CPUs, but in truth, once you’ve gotten enough RPis in a cluster to equal the computational power of an Intel processor, you’ll probably be consuming more electricity.
The operating system I grabbed the lasted Raspbian image and installed it on one of the RPis. I then removed it, copied the files off (cp -a), reformatted it to use the f2fs flash file system, then copied the files back on. I then made an image of the card (using dd), then wrote that image to 6 other cards. I then I logged into each one ad renamed them rpi-a1, …, rpi-a7. (Security note: this means they all have the same SSH private key, but I don’t care).
About flash file systems The micro SD flash has a bit of wear leveling, but not enough. A lot of RPi servers I’ve installed in the past have failed after a few months with corrupt drives. I don’t know why, I suspect it’s because the flash is getting corrupted.
Thus, I installed f2fs, a wear leveling file system designed especially for this sort of situation. We’ll see if that helps at all.
One big thing is to make sure atime is disabled, a massively brain dead feature inherited from 1980s Unix that writes to the disk every time you read from a file.
I notice that the green LED on the RPi, indicating disk activity, flashes very briefly once per second, (so quick you’ll miss it unless you look closely at the light). I used iotop -a to find out what it is. I think it’s just a hardware feature and not related to disk activity. On the other hand, it’s worth tracking down what writes might be happening in the background that will affect flash lifetime.
What I found was that there is some kernel thread that writes rarely to the disk, and a “f2fs garbage collector” that’s cleaning up the disk for wear leveling. I saw nothing that looked like it was writing regularly to the disk.
What to use it for? So here’s the thing about an RPi cluster — it’s technically useless. If you run the numbers, it’s got less compute power and higher power consumption than a normal desktop/laptop computer. Thus, an entire cluster of them will still perform slower than laptops/desktops.
Thus, the point of a cluster is to have something to play with, to experiment with, not that it’s the best form of computation. The point of individual RPis is not that they have better performance/watt — but that you don’t need as much performance but want a package with very low watts.
With that said, I should do some password cracking benchmarks with them, compared across CPUs and GPUs, measuring power consumption. That’ll be a topic for a later post.
With that said, I will be using these, though as individual computers rather than as a “cluster”. There’s lots of services I want to run, but I don’t want to run a full desktop running VMware. I’d rather control individual devices.
Conclusion I’m not sure what I’m going to do with my little RPi stack/cluster, but I wanted to document everything about it so that others can replicate it if they want to.
The portfolio of AWS storage products has grown increasingly rich and diverse over time. Amazon S3 started out with a single storage class and has grown to include storage classes for regular, infrequently accessed, and archived objects. Similarly, Amazon Elastic Block Store (EBS) began with a single volume type and now offers a choice of four types of SAN-style block storage, each designed to be a great for a particular set of access patterns and data types.
With object storage and block storage capably addressed by S3 and EBS, we turned our attention to the file system. We announced the Amazon Elastic File System (EFS) last year in order to provide multiple EC2 instances with shared, low-latency access to a fully-managed file system.
I am happy to announce that EFS is now available for production use in the US East (Northern Virginia), US West (Oregon), and Europe (Ireland) Regions.
We are launching today after an extended preview period that gave us insights into an extraordinarily wide range of customer use cases. The EFS preview was a great fit for large-scale, throughput-heavy processing workloads, along with many forms of content and web serving. During the preview we received a lot of positive feedback about the performance of EFS for these workloads, along with requests to provide equally good support for workloads that are sensitive to latency and/or make heavy use of file system metadata. We’ve been working to address this feedback and today’s launch is designed to handle a very wide range of use cases. Based on what I have heard so far, our customers are really excited about EFS and plan to put it to use right away.
Why We Built EFS Many AWS customers have asked us for a way to more easily manage file storage on a scalable basis. Some of these customers run farms of web servers or content management systems that benefit from a common namespace and easy access to a corporate or departmental file hierarchy. Others run HPC and Big Data applications that create, process, and then delete many large files, resulting in storage utilization and throughput demands that vary wildly over time. Our customers also insisted on high availability, and durability, along with a strongly consistent model for access and modification.
Amazon Elastic File System EFS lets you create POSIX-compliant file systems and attach them to one or more of your EC2 instances via NFS. The file system grows and shrinks as necessary (there’s no fixed upper limit and you can grow to petabyte scale) and you don’t pre-provision storage space or bandwidth. You pay only for the storage that you use.
EFS protects your data by storing copies of your files, directories, links, and metadata in multiple Availability Zones.
In order to provide the performance needed to support large file systems accessed by multiple clients simultaneously, Elastic File System performance scales with storage (I’ll say more about this later).
Each Elastic File System is accessible from a single VPC, and is accessed by way of mount targets that you create within the VPC. You have the option to create a mount target in any desired subnet of your VPC. Access to each mount target is controlled, as usual, via Security Groups.
EFS offers two distinct performance modes. The first mode, General Purpose, is the default. You should use this mode unless you expect to have tens, hundreds, or thousands of EC2 instances access the file system concurrently. The second mode, Max I/O, is optimized for higher levels of aggregate throughput and operations per second, but incurs slightly higher latencies for file operations. In most cases, you should start with general purpose mode and watch the relevant CloudWatch metric (PercentIOLimit). When you begin to push the I/O limit of General Purpose mode, you can create a new file system in Max I/O mode, migrate your files, and enjoy even higher throughput and operations per second.
I opened the console and clicked on the Create file system button:
Then I selected one of my VPCs and created a mount target in my public subnet:
My security group (corp-vpc-mount-target) allows my EC2 instance to access the mount point on port 2049. Here’s the inbound rule; the outbound one is the same:
I added Name and Owner tags, and opted for the General Purpose performance mode:
Then I confirmed the information and clicked on Create File System:
My file system was ready right away (the mount targets took another minute or so):
I clicked on EC2 mount instructions to learn how to mount my file system on an EC2 instance:
I mounted my file system as /efs, and there it was:
I copied a bunch of files over, and spent some time watching the NFS stats:
The console reports on the amount of space consumed by my file systems (this information is collected every hour and is displayed 2-3 hours after it is collected):
CloudWatch Metrics Each file system delivers the following metrics to CloudWatch:
BurstCreditBalance – The amount of data that can be transferred at the burst level of throughput.
ClientConnections – The number of clients that are connected to the file system.
DataReadIOBytes – The number of bytes read from the file system.
DataWriteIOBytes -The number of bytes written to the file system.
MetadataIOBytes – The number of bytes of metadata read and written.
TotalIOBytes -The sum of the preceding three metrics.
PermittedThroughput -The maximum allowed throughput, based on file system size.
PercentIOLimit – The percentage of the available I/O utilized in General Purpose mode.
You can see the metrics in the CloudWatch Console:
EFS Bursting, Workloads, and Performance The throughput available to each of your EFS file systems will grow as the file system grows. Because file-based workloads are generally spiky, with demands for high levels of throughput for short amounts of time and low levels the rest of the time, EFS is designed to burst to high throughput levels on an as-needed basis.
All file systems can burst to 100 MB per second of throughput. Those over 1 TB can burst to an additional 100 MB per second for each TB stored. For example, a 2 TB file system can burst to 200 MB per second and a 10 TB file system can burst to 1,000 MB per second of throughput. File systems larger than 1 TB can always burst for 50% of the time if they are inactive for the other 50%.
EFS uses a credit system to determine when a file system can burst. Each one accumulates credits at a baseline rate (50 MB per TB of storage) that is determined by the size of the file system, and spends them whenever it reads or writes data. The accumulated credits give the file system the ability to drive throughput beyond the baseline rate.
Here are some examples to give you a better idea of what this means in practice:
A 100 GB file system can burst up to 100 MB per second for up to 72 minutes each day, or drive up to 5 MB per second continuously.
A 10 TB file system can burst up to 1 GB per second for 12 hours each day, or drive 500 MB per second continuously.
To learn more about how the credit system works, read about File System Performance in the EFS documentation.
In order to gain a better understanding of this feature, I spent a couple of days copying and concatenating files, ultimately ending up using well over 2 TB of space on my file system. I watched the PermittedThroughput metric grow in concert with my usage as soon as my file collection exceeed 1 TB. Here’s what I saw:
As is the case with any file system, the throughput you’ll see is dependent on the characteristics of your workload. The average I/O size, the number of simultaneous connections to EFS, the file access pattern (random or sequential), the request model (synchronous or asynchronous), the NFS client configuration, and the performance characteristics of the EC2 instances running the NFS clients each have an effect (positive or negative). Briefly:
Average I/O Size – The work associated with managing the metadata associated with small files via the NFS protocol, coupled with the work that EFS does to make your data highly durable and highly available, combine to create some per-operation overhead. In general, overall throughput will increase in concert with the average I/O size since the per-operation overhead is amortized over a larger amount of data. Also, reads will generally be faster than writes.
Simultaneous Connections – Each EFS file system can accommodate connections from thousands of clients. Environments that can drive highly parallel behavior (from multiple EC2 instances) will benefit from the ability that EFS has to support a multitude of concurrent operations.
Request Model – If you enable asynchronous writes to the file system by including the async option at mount time, pending writes will be buffered on the instance and then written to EFS asynchronously. Accessing a file system that has been mounted with the sync option or opening files using an option that bypasses the cache (e.g. O_DIRECT) will, in turn, issue synchronous requests to EFS.
NFS Client Configuration – Some NFS clients use laughably small (by today’s standards) values for the read and write buffers by default. Consider increasing it to 1 MiB (again, this is an option to the mount command). You can use an NFS 4.0 or 4.1 client with EFS; the latter will provide better performance.
EC2 Instances – Applications that perform large amounts of I/O sometimes require a large amount of memory and/or compute power as well. Be sure that you have plenty of both; choose an appropriate instance size and type. If you are performing asynchronous reads and writes, the kernel use additional memory for caching. As a side note, the performance characteristics of EFS file systems are not dependent on the use of EBS-optimized instances.
Benchmarking of file systems is a blend of art and science. Make sure that you use mature, reputable tools, run them more than once, and make sure that you examine your results in light of the considerations listed above. You can also find some detailed data regarding expected performance on the Amazon Elastic File System page.
Available Now EFS is available now in the US East (Northern Virginia), US West (Oregon), and Europe (Ireland) Regions and you can start using it today. Pricing is based on the amount of data that you store, sampled several times per day and charged by the Gigabyte-month, pro-rated as usual, starting at $0.30 per GB per month in the US East (Northern Virginia) Region. There are no minimum fees and no setup costs (see the EFS Pricing page for more information). If you are eligible for the AWS Free Tier, you can use up to 5 GB of EFS storage per month at no charge.
The AWS team spends a lot of time looking in to ways to deliver innovation based around improvements in price/performance. Quite often, this means wrestling with interesting economic and technical dilemmas.
For example, it turns out that there are some really interesting trade-offs between HDD and SSD storage. On the one hand, today’s SSD devices provide more IOPS per dollar, more throughput per gigabyte, and lower latency than today’s HDD devices. On the other hand, continued density improvements in HDD technology drive the cost per gigabyte down, but also reduce the effective throughput per gigabyte. We took this as a challenge and asked ourselves—could we use cost-effective HDD devices to build a high-throughput storage option for EBS that would deliver consistent performance for common workloads like big data and log processing?
Of course we could!
Today we are launching a new pair of low-cost EBS volume types that take advantage of the scale of the cloud to deliver high throughput on a consistent basis, for use with EC2 instances and Amazon EMR clusters (prices are for the US East (Northern Virginia) Region; please see the EBS Pricing page for other regions):
Throughput Optimized HDD (st1) – Designed for high-throughput MapReduce, Kafka, ETL, log processing, and data warehouse workloads; $0.045 / gigabyte / month.
Cold HDD (sc1) – Designed for workloads similar to those for Throughput Optimized HDD that are accessed less frequently; $0.025 / gigabyte / month.
Like the existing General Purpose SSD (gp2) volume type, the new magnetic volumes give you baseline performance, burst performance, and a burst credit bucket. While the SSD volumes defines performance in terms of IOPS (Input/Output Operations Per Second), the new volumes define it in terms of throughput. The burst values are based on the amount of storage provisioned for the volume:
Throughput Optimized HDD (st1) – Starts at 250 MB/s for a 1 terabyte volume, and grows by 250 MB/s for every additional provisioned terabyte until reaching a maximum burst throughput of 500 MB/s.
Cold HDD (sc1) – Starts at 80 MB/s for a 1 terabyte volume, and grows by 80 MB/s for every additional provisioned terabyte until reaching a maximum burst throughput of 250 MB/s.
Evolution of EBS I like to think of customer-driven product and feature development in evolutionary terms. New offerings within a category often provide broad solutions that are a good fit for a wide variety of use cases. Over time, as we see how customers put the new offering to use and provide us with feedback on how we can do even better, a single initial offering will often speciate into several new offerings, each one tuned to the needs of a particular customer type and/or use case.
The various storage options for EC2 instances are a great example of this. Here’s a brief timeline of some of the most significant developments:
Workload Characteristics We tuned these volumes to deliver great price/performance when used for big data workloads. In order to achieve the levels of performance that are possible with the volumes, your application must perform large and sequential I/O operations, which is typical of big data workloads. This is due to the nature of the underlying magnetic storage, which can transfer contiguous data with great rapidity. Small random access I/O operations (often generated by database engines) are less efficient and will result in lower throughput. The General Purpose SSD volumes are a much better fit for this access pattern.
For both of the new magnetic volume types, the burst credit bucket can grow until it reaches the size of the volume. In other words, when a volume’s bucket is full, you can scan the entire volume at the burst rate. Each I/O request of 1 megabyte or less counts as 1 megabyte’s worth of credit. Sequential I/O operations are merged into larger ones where possible; this can increase throughput and maximizes the value of the burst credit bucket (to learn more about how the bucket operates, visit the Performance Burst Details section of my New SSD-Backed Elastic Block Storage post).
If your application makes use of the file system and the operating system’s page cache (as just about all applications do), we recommend that you set the volume’s read-ahead buffer to 1 MiB on the EC2 instance that the volume is attached to. Here’s how you do that using an instance that is running Ubuntu or that was booted from the Amazon Linux AMI (adjust the device name as needed):
$ sudo blockdev --setra 2048 /dev/xvdf
The value is expressed as the number of 512-byte sectors to be used for buffering.
This value will improve read performance for workloads that consist of large, sequential reads. However, it may increase latency for workloads that consist of small, random read operations.
Most customers are using Linux kernel versions before 4.2 and the read ahead setting is all they need to tune. For customers using newer kernels, we also recommend setting xen_blkfront.max to 256 for the best performance. To set this parameter on an instance that runs the Amazon Linux AMI, edit /boot/grub/menu.list so that it invokes the kernel as follows:
If your file contains multiple entries, edit the one that corresponds to the active kernel. This is a boot-time setting so you’ll need to reboot the instance in order for the setting to take effect. If you are using a Linux distribution that does not use the Grub bootloader, you will need to figure out how to make the equivalent change to your configuration.
Comparing EBS Volume Types Here’s a table that summarizes the specifications and use cases of each EBS volume type (Although not shown in the table, the original EBS Magnetic offering is still available if needed for your application):
CloudFormation Template for Testing In order to make it as easy as possible for you to set up a test environment on a reproducible basis, we have created a simple CloudFormation template. You can launch the st1 template to create an EC2 instance with a 2 terabyte st1 volume attached. The st1 template instructions contain some additional information.
As you can see from the table above, this new offering gives you a unique combination of high throughput and a very low cost per gigabyte.
I am looking forward to your feedback so that we can continue to evolve EBS to meet your ever-growing (and continually diversifying) needs. Leave me a comment and I’ll make sure that the team sees it.
Ian Meyers is a Principal Solutions Architect with Amazon Web Services
Zach Christopherson, an Amazon Redshift Database Engineer, contributed to this post
Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. Customers use Amazon Redshift for everything from accelerating existing database environments that are struggling to scale, to ingestion of web logs for big data analytics use cases. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.
Amazon Redshift can run any type of data model, from a production transaction system third-normal-form model, to star and snowflake schemas, or simple flat tables. As customers adopt Amazon Redshift, they must consider its architecture in order to ensure that their data model is correctly deployed and maintained by the database. This post takes you through the most common issues that customers find as they adopt Amazon Redshift, and gives you concrete guidance on how to address each. If you address each of these items, you should be able to achieve optimal performance of queries and be able to scale effectively to meet customer demand.
Issue #1: Incorrect column encoding
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded (see Choosing a Column Compression Type in the Amazon Redshift Database Developer Guide) , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied. To determine if you are deviating from this best practice, run the following query to determine if any tables have NO column encoding applied:
SELECT database, schema || ‘.’ || "table" AS "table", encoded, size FROM svv_table_info WHERE encoded=’N’ ORDER BY 2;
Afterward, review the tables and columns which aren’t encoded by running the following query:
SELECT trim(n.nspname || ‘.’ || c.relname) AS "table",trim(a.attname) AS "column",format_type(a.atttypid, a.atttypmod) AS "type", format_encoding(a.attencodingtype::integer) AS "encoding", a.attsortkeyord AS "sortkey" FROM pg_namespace n, pg_class c, pg_attribute a WHERE n.oid = c.relnamespace AND c.oid = a.attrelid AND a.attnum > 0 AND NOT a.attisdropped and n.nspname NOT IN (‘information_schema’,’pg_catalog’,’pg_toast’) AND format_encoding(a.attencodingtype::integer) = ‘none’ AND c.relkind=’r’ AND a.attsortkeyord != 1 ORDER BY n.nspname, c.relname, a.attnum;
If you find that you have tables without optimal column encoding, then use the Amazon Redshift Column Encoding Utility on AWS Labs GitHub to apply encoding. This command line utility uses the ANALYZE COMPRESSION command on each table. If encoding is required, it generates a SQL script which creates a new table with the correct encoding, copies all the data into the new table, and then transactionally renames the new table to the old name while retaining the original data. (Please note that the first column in a compound sort key should not be encoded, and is not encoded by this utility.)
Issue #2 – Skewed table data
Amazon Redshift is a distributed, shared nothing database architecture where each node in the cluster stores a subset of the data. When a table is created, decide whether to spread the data evenly among nodes (default), or place data on a node on the basis of one of the columns. By choosing columns for distribution that are commonly joined together, you can minimize the amount of data transferred over the network during the join. This can significantly increase performance on these types of queries.
High cardinality – There should be a large number of unique data values in the column relative to the number of nodes in the cluster.
Uniform distribution/low skew – Each unique value in the distribution key should occur in the table an even number of times. This allows Amazon Redshift to put the same number of records on each node in the cluster.
Commonly joined – The columns in a distribution key should be those that you usually join to other tables. If you have many possible columns that fit this criterion, then you may choose the column that joins to the largest table.
A skewed distribution key results in nodes not working equally hard as each other on query execution, requiring unbalanced CPU or memory, and ultimately only running as fast as the slowest node:
If skew is a problem, you typically see that node performance is uneven on the cluster. Use one of the admin scripts in the Amazon Redshift Utils GitHub repository, such as table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster.
If you find that you have tables with skewed distribution keys, then consider changing the distribution key to a column that exhibits high cardinality and uniform distribution. Evaluate a candidate column as a distribution key by creating a new table using CTAS:
CREATE TABLE MY_TEST_TABLE DISTKEY (<COLUMN NAME>) AS SELECT * FROM <TABLE NAME>;
Run the table_inspector.sql script against the table again to analyze data skew.
If there is no good distribution key in any of your records, you may find that moving to EVEN distribution works better, due to the lack of a single node being a hotspot. For small tables, you can also use DISTSTYLE ALL to place table data onto every node in the cluster.
Issue #3 – Queries not benefiting from sort keys
Amazon Redshift tables can have a sort key column identified, which acts like an index in other databases but which does not incur a storage cost as with other platforms (for more information, see Choosing Sort Keys). A sort key should be created on those columns which are most commonly used in WHERE clauses. If you have a known query pattern, then COMPOUND sort keys give the best performance; if end users query different columns equally, then use an INTERLEAVED sort key.
To determine which tables don’t have sort keys, and how often they have been queried, run the following query:
SELECT database, table_id, schema || ‘.’ || "table" AS "table", size, nvl(s.num_qs,0) num_qs FROM svv_table_info t LEFT JOIN (SELECT tbl, COUNT(distinct query) num_qs FROM stl_scan s WHERE s.userid > 1 AND s.perm_table_name NOT IN (‘Internal Worktable’,’S3′) GROUP BY tbl) s ON s.tbl = t.table_id WHERE t.sortkey1 IS NULL ORDER BY 5 desc;
You can run a tutorial that walks you through how to address unsorted tables in the Amazon Redshift Developer Guide. You can also take advantage of another GitHub admin script that recommends sort keys based on query activity. Bear in mind that queries evaluated against a sort key column must not apply a SQL function to the sort key; instead, ensure that you apply the functions to the compared values so that the sort key is used. This is commonly found on TIMESTAMP columns that are used as sort keys.
Issue #4 – Tables without statistics or which need vacuum
Amazon Redshift, like other databases, requires statistics about tables and the composition of data blocks being stored in order to make good decisions when planning a query (for more information, see Analyzing Tables). Without good statistics, the optimiser may make suboptimal or incorrect choices about the order in which to access tables, or how to join datasets together.
The ANALYZE Command History topic in the Amazon Redshift Developer Guide supplies queries to help you address missing or stale statistics, and you can also simply run the missing_table_stats.sql admin script to determine which tables are missing stats, or the statement below to determine tables that have stale statistics:
SELECT database, schema || ‘.’ || "table" AS "table", stats_off FROM svv_table_info WHERE stats_off > 5 ORDER BY 2;
In Amazon Redshift, data blocks are immutable. When rows are DELETED or UPDATED, they are simply logically deleted (flagged for deletion) but not physically removed from disk. Updates result in a new block being written with new data appended. Both of these operations cause the previous version of the row to continue consuming disk space and continue being scanned when a query scans the table. As a result, table storage space is increased and performance degraded due to otherwise avoidable disk I/O during scans. A VACUUM command recovers the space from deleted rows and restores the sort order.
To address issues with tables with missing or stale statistics or where vacuum is required, run another AWS Labs utility, Analyze & Vacuum Schema. This ensures that you always keep up-to-date statistics, and only vacuum tables that actually need reorganisation.
Issue #5 – Tables with very large VARCHAR columns
During processing of complex queries, intermediate query results might need to be stored in temporary blocks. These temporary tables are not compressed, so unnecessarily wide columns consume excessive memory and temporary disk space, which can affect query performance. For more information, see Use the Smallest Possible Column Size.
Use the following query to generate a list of tables that should have their maximum column widths reviewed:
SELECT database, schema || ‘.’ || "table" AS "table", max_varchar FROM svv_table_info WHERE max_varchar > 150 ORDER BY 2;
After you have a list of tables, identify which table columns have wide varchar columns and then determine the true maximum width for each wide column, using the following query:
SELECT max(len(rtrim(column_name))) FROM table_name;
In some cases, you may have large VARCHAR type columns because you are storing JSON fragments in the table, which you then query with JSON functions. If you query the top running queries for the database using the top_queries.sql admin script, pay special attention to SELECT * queries which include the JSON fragment column. If end users query these large columns but don’t use actually execute JSON functions against them, consider moving them into another table that only contains the primary key column of the original table and the JSON column.
If you find that the table has columns that are wider than necessary, then you need to re-create a version of the table with appropriate column widths by performing a deep copy.
Issue #6 – Queries waiting on queue slots
Amazon Redshift runs queries using a queuing system known as workload management (WLM). You can define up to 8 queues to separate workloads from each other, and set the concurrency on each queue to meet your overall throughput requirements.
In some cases, the queue to which a user or query has been assigned is completely busy and a user’s query must wait for a slot to be open. During this time, the system is not executing the query at all, which is a sign that you may need to increase concurrency.
First, you need to determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that increasing concurrency allows more queries to run, but they share the same memory allocation (unless you increase it). You may find that by increasing concurrency, some queries must use temporary disk storage to complete, which is also sub-optimal as well see next.
Issue #7 – Queries that are disk-based
If a query isn’t able to completely execute in memory, it may need to use disk-based temporary storage for parts of an explain plan. The additional disk I/O slows down the query, and can be addressed by increasing the amount of memory allocated to a session (for more information, see WLM Dynamic Memory Allocation).
To determine if any queries have been writing to disk, use the following query:
SELECT q.query, trim(q.cat_text) FROM (SELECT query, replace( listagg(text,’ ‘) WITHIN GROUP (ORDER BY sequence), ‘\n’, ‘ ‘) AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q JOIN (SELECT distinct query FROM svl_query_summary WHERE is_diskbased=’t’ AND (LABEL LIKE ‘hash%’ OR LABEL LIKE ‘sort%’ OR LABEL LIKE ‘aggr%’) AND userid > 1) qs ON qs.query = q.query;
Based on the user or the queue assignment rules, you can increase the amount of memory given to the selected queue to prevent queries needing to spill to disk to complete. You can also increase the WLM_QUERY_SLOT_COUNT (http://docs.aws.amazon.com/redshift/latest/dg/r_wlm_query_slot_count.html) for the session from the default of 1 to the maximum concurrency for the queue. As outlined in Issue #6, this may result in queueing queries, so use with care
Issue #8 – Commit queue waits
Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to a commit queue.
If you are committing too often on your database, you will start to see waits on the commit queue increase, which can be viewed with the commit_stats.sql admin script. This script shows the largest queue length and queue time for queries run in the past two days. If you have queries that are waiting on the commit queue, then look for sessions that are committing multiple times per session, such as ETL jobs that are logging progress or inefficient data loads.
Issue #9 – Inefficient data loads
Amazon Redshift best practices suggest the use of the COPY command to perform data loads. This API operation uses all compute nodes in the cluster to load data in parallel, from sources such as Amazon S3, Amazon DynamoDB, Amazon EMR HDFS file systems, or any SSH connection.
When performing data loads, you should compress the files to be loaded whenever possible; Amazon Redshift supports both GZIP and LZO compression. It is more efficient to load a large number of small files than one large one, and the ideal file count is a multiple of the slice count. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 16 slices. By ensuring you have an even number of files per slices, you can know that COPY execution will evenly use cluster resources and complete as quickly as possible.
An anti-pattern is to insert data directly into Amazon Redshift, with single record inserts or the use of a multi-value INSERT statement, which allows up to 16 MB of data to be inserted at one time. These are leader node–based operations, and can create significant performance bottlenecks by maxing out the leader node CPU or memory.
Issue #10 – Inefficient use of Temporary Tables
Amazon Redshift provides temporary tables, which are like normal tables except that they are only visible within a single session. When the user disconnects the session, the tables are automatically deleted. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a query SELECT … INTO #TEMP_TABLE. The CREATE TABLE statement gives you complete control over the definition of the temporary table, while the SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and uses default storage properties.
These default storage properties may cause issues if not carefully considered. Amazon Redshift’s default table structure is to use EVEN distribution with no column encoding. This is a sub-optimal data structure for many types of queries, and if you are using select/into syntax you cannot set the column encoding or distribution and sort keys.
It is highly recommended that you convert all select/into syntax to use the CREATE statement. This ensures that your temporary tables have column encoding and are distributed in a fashion that is sympathetic the other entities that are part of the workflow. To perform a conversion of a statement which uses:
select column_a, column_b into #my_temp_table from my_table;
You would analyse the temporary table for optimal column encoding:
And then convert the select/into statement to:
BEGIN; create temporary table my_temp_table( column_a varchar(128) encode lzo, column_b char(4) encode bytedict) distkey (column_a) — Assuming you intend to join this table on column_a sortkey (column_b); — Assuming you are sorting or grouping by column_b
insert into my_temp_table select column_a, column_b from my_table; COMMIT;
You may also wish to analyze statistics on the temporary table, if it is used as a join table for subsequent queries:
analyze my_temp_table;
This way, you retain the functionality of using temporary tables but control data placement on the cluster through distkey assignment and take advantage of the columnar nature of Amazon Redshift through use of Column Encoding.
Tip: Using explain plan alerts
The last tip is to use diagnostic information from the cluster during query execution. This is stored in an extremely useful view called STL_ALERT_EVENT_LOG. Use the perf_alert.sql admin script to diagnose issues that the cluster has encountered over the last seven days. This is an invaluable resource in understanding how your cluster develops over time.
Summary
Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. While Amazon Redshift can run any type of data model, you can avoid possible pitfalls that might decrease performance or increase cost, by being aware of how data is stored and managed. Run a simple set of diagnostic queries for common issues and ensure that you get the best performance possible.
If you have questions or suggestions, please leave a comment below.
Ian Meyers is a Principal Solutions Architect with Amazon Web Services
Zach Christopherson, an Amazon Redshift Database Engineer, contributed to this post
Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. Customers use Amazon Redshift for everything from accelerating existing database environments that are struggling to scale, to ingestion of web logs for big data analytics use cases. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.
Amazon Redshift can run any type of data model, from a production transaction system third-normal-form model, to star and snowflake schemas, or simple flat tables. As customers adopt Amazon Redshift, they must consider its architecture in order to ensure that their data model is correctly deployed and maintained by the database. This post takes you through the most common issues that customers find as they adopt Amazon Redshift, and gives you concrete guidance on how to address each. If you address each of these items, you should be able to achieve optimal performance of queries and be able to scale effectively to meet customer demand.
Issue #1: Incorrect column encoding
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded (see Choosing a Column Compression Type in the Amazon Redshift Database Developer Guide) , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied. To determine if you are deviating from this best practice, run the following query to determine if any tables have NO column encoding applied:
SELECT database, schema || ‘.’ || "table" AS "table", encoded, size FROM svv_table_info WHERE encoded=’N’ ORDER BY 2;
Afterward, review the tables and columns which aren’t encoded by running the following query:
SELECT trim(n.nspname || ‘.’ || c.relname) AS "table",trim(a.attname) AS "column",format_type(a.atttypid, a.atttypmod) AS "type", format_encoding(a.attencodingtype::integer) AS "encoding", a.attsortkeyord AS "sortkey" FROM pg_namespace n, pg_class c, pg_attribute a WHERE n.oid = c.relnamespace AND c.oid = a.attrelid AND a.attnum > 0 AND NOT a.attisdropped and n.nspname NOT IN (‘information_schema’,’pg_catalog’,’pg_toast’) AND format_encoding(a.attencodingtype::integer) = ‘none’ AND c.relkind=’r’ AND a.attsortkeyord != 1 ORDER BY n.nspname, c.relname, a.attnum;
If you find that you have tables without optimal column encoding, then use the Amazon Redshift Column Encoding Utility on AWS Labs GitHub to apply encoding. This command line utility uses the ANALYZE COMPRESSION command on each table. If encoding is required, it generates a SQL script which creates a new table with the correct encoding, copies all the data into the new table, and then transactionally renames the new table to the old name while retaining the original data. (Please note that the first column in a compound sort key should not be encoded, and is not encoded by this utility.)
Issue #2 – Skewed table data
Amazon Redshift is a distributed, shared nothing database architecture where each node in the cluster stores a subset of the data. When a table is created, decide whether to spread the data evenly among nodes (default), or place data on a node on the basis of one of the columns. By choosing columns for distribution that are commonly joined together, you can minimize the amount of data transferred over the network during the join. This can significantly increase performance on these types of queries.
High cardinality – There should be a large number of unique data values in the column relative to the number of nodes in the cluster.
Uniform distribution/low skew – Each unique value in the distribution key should occur in the table an even number of times. This allows Amazon Redshift to put the same number of records on each node in the cluster.
Commonly joined – The columns in a distribution key should be those that you usually join to other tables. If you have many possible columns that fit this criterion, then you may choose the column that joins to the largest table.
A skewed distribution key results in nodes not working equally hard as each other on query execution, requiring unbalanced CPU or memory, and ultimately only running as fast as the slowest node:
If skew is a problem, you typically see that node performance is uneven on the cluster. Use one of the admin scripts in the Amazon Redshift Utils GitHub repository, such as table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster.
If you find that you have tables with skewed distribution keys, then consider changing the distribution key to a column that exhibits high cardinality and uniform distribution. Evaluate a candidate column as a distribution key by creating a new table using CTAS:
CREATE TABLE MY_TEST_TABLE DISTKEY (<COLUMN NAME>) AS SELECT * FROM <TABLE NAME>;
Run the table_inspector.sql script against the table again to analyze data skew.
If there is no good distribution key in any of your records, you may find that moving to EVEN distribution works better, due to the lack of a single node being a hotspot. For small tables, you can also use DISTSTYLE ALL to place table data onto every node in the cluster.
Issue #3 – Queries not benefiting from sort keys
Amazon Redshift tables can have a sort key column identified, which acts like an index in other databases but which does not incur a storage cost as with other platforms (for more information, see Choosing Sort Keys). A sort key should be created on those columns which are most commonly used in WHERE clauses. If you have a known query pattern, then COMPOUND sort keys give the best performance; if end users query different columns equally, then use an INTERLEAVED sort key.
To determine which tables don’t have sort keys, and how often they have been queried, run the following query:
SELECT database, table_id, schema || ‘.’ || "table" AS "table", size, nvl(s.num_qs,0) num_qs FROM svv_table_info t LEFT JOIN (SELECT tbl, COUNT(distinct query) num_qs FROM stl_scan s WHERE s.userid > 1 AND s.perm_table_name NOT IN (‘Internal Worktable’,’S3′) GROUP BY tbl) s ON s.tbl = t.table_id WHERE t.sortkey1 IS NULL ORDER BY 5 desc;
You can run a tutorial that walks you through how to address unsorted tables in the Amazon Redshift Developer Guide. You can also take advantage of another GitHub admin script that recommends sort keys based on query activity. Bear in mind that queries evaluated against a sort key column must not apply a SQL function to the sort key; instead, ensure that you apply the functions to the compared values so that the sort key is used. This is commonly found on TIMESTAMP columns that are used as sort keys.
Issue #4 – Tables without statistics or which need vacuum
Amazon Redshift, like other databases, requires statistics about tables and the composition of data blocks being stored in order to make good decisions when planning a query (for more information, see Analyzing Tables). Without good statistics, the optimiser may make suboptimal or incorrect choices about the order in which to access tables, or how to join datasets together.
The ANALYZE Command History topic in the Amazon Redshift Developer Guide supplies queries to help you address missing or stale statistics, and you can also simply run the missing_table_stats.sql admin script to determine which tables are missing stats, or the statement below to determine tables that have stale statistics:
SELECT database, schema || ‘.’ || "table" AS "table", stats_off FROM svv_table_info WHERE stats_off > 5 ORDER BY 2;
In Amazon Redshift, data blocks are immutable. When rows are DELETED or UPDATED, they are simply logically deleted (flagged for deletion) but not physically removed from disk. Updates result in a new block being written with new data appended. Both of these operations cause the previous version of the row to continue consuming disk space and continue being scanned when a query scans the table. As a result, table storage space is increased and performance degraded due to otherwise avoidable disk I/O during scans. A VACUUM command recovers the space from deleted rows and restores the sort order.
To address issues with tables with missing or stale statistics or where vacuum is required, run another AWS Labs utility, Analyze & Vacuum Schema. This ensures that you always keep up-to-date statistics, and only vacuum tables that actually need reorganisation.
Issue #5 – Tables with very large VARCHAR columns
During processing of complex queries, intermediate query results might need to be stored in temporary blocks. These temporary tables are not compressed, so unnecessarily wide columns consume excessive memory and temporary disk space, which can affect query performance. For more information, see Use the Smallest Possible Column Size.
Use the following query to generate a list of tables that should have their maximum column widths reviewed:
SELECT database, schema || ‘.’ || "table" AS "table", max_varchar FROM svv_table_info WHERE max_varchar > 150 ORDER BY 2;
After you have a list of tables, identify which table columns have wide varchar columns and then determine the true maximum width for each wide column, using the following query:
SELECT max(len(rtrim(column_name))) FROM table_name;
In some cases, you may have large VARCHAR type columns because you are storing JSON fragments in the table, which you then query with JSON functions. If you query the top running queries for the database using the top_queries.sql admin script, pay special attention to SELECT * queries which include the JSON fragment column. If end users query these large columns but don’t use actually execute JSON functions against them, consider moving them into another table that only contains the primary key column of the original table and the JSON column.
If you find that the table has columns that are wider than necessary, then you need to re-create a version of the table with appropriate column widths by performing a deep copy.
Issue #6 – Queries waiting on queue slots
Amazon Redshift runs queries using a queuing system known as workload management (WLM). You can define up to 8 queues to separate workloads from each other, and set the concurrency on each queue to meet your overall throughput requirements.
In some cases, the queue to which a user or query has been assigned is completely busy and a user’s query must wait for a slot to be open. During this time, the system is not executing the query at all, which is a sign that you may need to increase concurrency.
First, you need to determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that increasing concurrency allows more queries to run, but they share the same memory allocation (unless you increase it). You may find that by increasing concurrency, some queries must use temporary disk storage to complete, which is also sub-optimal as well see next.
Issue #7 – Queries that are disk-based
If a query isn’t able to completely execute in memory, it may need to use disk-based temporary storage for parts of an explain plan. The additional disk I/O slows down the query, and can be addressed by increasing the amount of memory allocated to a session (for more information, see WLM Dynamic Memory Allocation).
To determine if any queries have been writing to disk, use the following query:
SELECT q.query, trim(q.cat_text) FROM (SELECT query, replace( listagg(text,’ ‘) WITHIN GROUP (ORDER BY sequence), ‘\n’, ‘ ‘) AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q JOIN (SELECT distinct query FROM svl_query_summary WHERE is_diskbased=’t’ AND (LABEL LIKE ‘hash%’ OR LABEL LIKE ‘sort%’ OR LABEL LIKE ‘aggr%’) AND userid > 1) qs ON qs.query = q.query;
Based on the user or the queue assignment rules, you can increase the amount of memory given to the selected queue to prevent queries needing to spill to disk to complete. You can also increase the WLM_QUERY_SLOT_COUNT (http://docs.aws.amazon.com/redshift/latest/dg/r_wlm_query_slot_count.html) for the session from the default of 1 to the maximum concurrency for the queue. As outlined in Issue #6, this may result in queueing queries, so use with care
Issue #8 – Commit queue waits
Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to a commit queue.
If you are committing too often on your database, you will start to see waits on the commit queue increase, which can be viewed with the commit_stats.sql admin script. This script shows the largest queue length and queue time for queries run in the past two days. If you have queries that are waiting on the commit queue, then look for sessions that are committing multiple times per session, such as ETL jobs that are logging progress or inefficient data loads.
Issue #9 – Inefficient data loads
Amazon Redshift best practices suggest the use of the COPY command to perform data loads. This API operation uses all compute nodes in the cluster to load data in parallel, from sources such as Amazon S3, Amazon DynamoDB, Amazon EMR HDFS file systems, or any SSH connection.
When performing data loads, you should compress the files to be loaded whenever possible; Amazon Redshift supports both GZIP and LZO compression. It is more efficient to load a large number of small files than one large one, and the ideal file count is a multiple of the slice count. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 16 slices. By ensuring you have an even number of files per slices, you can know that COPY execution will evenly use cluster resources and complete as quickly as possible.
An anti-pattern is to insert data directly into Amazon Redshift, with single record inserts or the use of a multi-value INSERT statement, which allows up to 16 MB of data to be inserted at one time. These are leader node–based operations, and can create significant performance bottlenecks by maxing out the leader node CPU or memory.
Issue #10 – Inefficient use of Temporary Tables
Amazon Redshift provides temporary tables, which are like normal tables except that they are only visible within a single session. When the user disconnects the session, the tables are automatically deleted. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a query SELECT … INTO #TEMP_TABLE. The CREATE TABLE statement gives you complete control over the definition of the temporary table, while the SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and uses default storage properties.
These default storage properties may cause issues if not carefully considered. Amazon Redshift’s default table structure is to use EVEN distribution with no column encoding. This is a sub-optimal data structure for many types of queries, and if you are using select/into syntax you cannot set the column encoding or distribution and sort keys.
It is highly recommended that you convert all select/into syntax to use the CREATE statement. This ensures that your temporary tables have column encoding and are distributed in a fashion that is sympathetic the other entities that are part of the workflow. To perform a conversion of a statement which uses:
select column_a, column_b into #my_temp_table from my_table;
You would analyse the temporary table for optimal column encoding:
And then convert the select/into statement to:
BEGIN; create temporary table my_temp_table( column_a varchar(128) encode lzo, column_b char(4) encode bytedict) distkey (column_a) — Assuming you intend to join this table on column_a sortkey (column_b); — Assuming you are sorting or grouping by column_b
insert into my_temp_table select column_a, column_b from my_table; COMMIT;
You may also wish to analyze statistics on the temporary table, if it is used as a join table for subsequent queries:
analyze my_temp_table;
This way, you retain the functionality of using temporary tables but control data placement on the cluster through distkey assignment and take advantage of the columnar nature of Amazon Redshift through use of Column Encoding.
Tip: Using explain plan alerts
The last tip is to use diagnostic information from the cluster during query execution. This is stored in an extremely useful view called STL_ALERT_EVENT_LOG. Use the perf_alert.sql admin script to diagnose issues that the cluster has encountered over the last seven days. This is an invaluable resource in understanding how your cluster develops over time.
Summary
Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. While Amazon Redshift can run any type of data model, you can avoid possible pitfalls that might decrease performance or increase cost, by being aware of how data is stored and managed. Run a simple set of diagnostic queries for common issues and ensure that you get the best performance possible.
If you have questions or suggestions, please leave a comment below.
Container Integration Since a while containers have been one of the hot topics on Linux. Container managers such as libvirt-lxc, LXC or Docker are widely known and used these days. In this blog story I want to shed some light on systemd‘s integration points with container managers, to allow seamless management of services across container boundaries. We’ll focus on OS containers here, i.e. the case where an init system runs inside the container, and the container hence in most ways appears like an independent system of its own. Much of what I describe here is available on pretty much any container manager that implements the logic described here, including libvirt-lxc. However, to make things easy we’ll focus on systemd-nspawn, the mini-container manager that is shipped with systemd itself. systemd-nspawn uses the same kernel interfaces as the other container managers, however is less flexible as it is designed to be a container manager that is as simple to use as possible and “just works”, rather than trying to be a generic tool you can configure in every low-level detail. We use systemd-nspawn extensively when developing systemd. Anyway, so let’s get started with our run-through. Let’s start by creating a Fedora container tree in a subdirectory: # yum -y –releasever=20 –nogpg –installroot=/srv/mycontainer –disablerepo='*' –enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
This downloads a minimal Fedora system and installs it in in /srv/mycontainer. This command line is Fedora-specific, but most distributions provide similar functionality in one way or another. The examples section in the systemd-nspawn(1) man page contains a list of the various command lines for other distribution. We now have the new container installed, let’s set an initial root password: # systemd-nspawn -D /srv/mycontainer Spawning container mycontainer on /srv/mycontainer Press ^] three times within 1s to kill container. -bash-4.2# passwd Changing password for user root. New password: Retype new password: passwd: all authentication tokens updated successfully. -bash-4.2# ^D Container mycontainer exited successfully. #
We use systemd-nspawn here to get a shell in the container, and then use passwd to set the root password. After that the initial setup is done, hence let’s boot it up and log in as root with our new password: $ systemd-nspawn -D /srv/mycontainer -b Spawning container mycontainer on /srv/mycontainer. Press ^] three times within 1s to kill container. systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ) Detected virtualization 'systemd-nspawn'.
Welcome to Fedora 20 (Heisenbug)!
[ OK ] Reached target Remote File Systems. [ OK ] Created slice Root Slice. [ OK ] Created slice User and Session Slice. [ OK ] Created slice System Slice. [ OK ] Created slice system-getty.slice. [ OK ] Reached target Slices. [ OK ] Listening on Delayed Shutdown Socket. [ OK ] Listening on /dev/initctl Compatibility Named Pipe. [ OK ] Listening on Journal Socket. Starting Journal Service… [ OK ] Started Journal Service. [ OK ] Reached target Paths. Mounting Debug File System… Mounting Configuration File System… Mounting FUSE Control File System… Starting Create static device nodes in /dev… Mounting POSIX Message Queue File System… Mounting Huge Pages File System… [ OK ] Reached target Encrypted Volumes. [ OK ] Reached target Swap. Mounting Temporary Directory… Starting Load/Save Random Seed… [ OK ] Mounted Configuration File System. [ OK ] Mounted FUSE Control File System. [ OK ] Mounted Temporary Directory. [ OK ] Mounted POSIX Message Queue File System. [ OK ] Mounted Debug File System. [ OK ] Mounted Huge Pages File System. [ OK ] Started Load/Save Random Seed. [ OK ] Started Create static device nodes in /dev. [ OK ] Reached target Local File Systems (Pre). [ OK ] Reached target Local File Systems. Starting Trigger Flushing of Journal to Persistent Storage… Starting Recreate Volatile Files and Directories… [ OK ] Started Recreate Volatile Files and Directories. Starting Update UTMP about System Reboot/Shutdown… [ OK ] Started Trigger Flushing of Journal to Persistent Storage. [ OK ] Started Update UTMP about System Reboot/Shutdown. [ OK ] Reached target System Initialization. [ OK ] Reached target Timers. [ OK ] Listening on D-Bus System Message Bus Socket. [ OK ] Reached target Sockets. [ OK ] Reached target Basic System. Starting Login Service… Starting Permit User Sessions… Starting D-Bus System Message Bus… [ OK ] Started D-Bus System Message Bus. Starting Cleanup of Temporary Directories… [ OK ] Started Cleanup of Temporary Directories. [ OK ] Started Permit User Sessions. Starting Console Getty… [ OK ] Started Console Getty. [ OK ] Reached target Login Prompts. [ OK ] Started Login Service. [ OK ] Reached target Multi-User System. [ OK ] Reached target Graphical Interface.
Fedora release 20 (Heisenbug) Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)
mycontainer login: root Password: -bash-4.2#
Now we have everything ready to play around with the container integration of systemd. Let’s have a look at the first tool, machinectl. When run without parameters it shows a list of all locally running containers: $ machinectl MACHINE CONTAINER SERVICE mycontainer container nspawn
1 machines listed.
The “status” subcommand shows details about the container: $ machinectl status mycontainer mycontainer: Since: Mi 2014-11-12 16:47:19 CET; 51s ago Leader: 5374 (systemd) Service: nspawn; class container Root: /srv/mycontainer Address: 192.168.178.38 10.36.6.162 fd00::523f:56ff:fe00:4994 fe80::523f:56ff:fe00:4994 OS: Fedora 20 (Heisenbug) Unit: machine-mycontainer.scope ├─5374 /usr/lib/systemd/systemd └─system.slice ├─dbus.service │ └─5414 /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-act… ├─systemd-journald.service │ └─5383 /usr/lib/systemd/systemd-journald ├─systemd-logind.service │ └─5411 /usr/lib/systemd/systemd-logind └─console-getty.service └─5416 /sbin/agetty –noclear -s console 115200 38400 9600
With this we see some interesting information about the container, including its control group tree (with processes), IP addresses and root directory. The “login” subcommand gets us a new login shell in the container: # machinectl login mycontainer Connected to container mycontainer. Press ^] three times within 1s to exit session.
Fedora release 20 (Heisenbug) Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)
mycontainer login:
The “reboot” subcommand reboots the container: # machinectl reboot mycontainer
The “poweroff” subcommand powers the container off: # machinectl poweroff mycontainer
So much about the machinectl tool. The tool knows a couple of more commands, please check the man page for details. Note again that even though we use systemd-nspawn as container manager here the concepts apply to any container manager that implements the logic described here, including libvirt-lxc for example. machinectl is not the only tool that is useful in conjunction with containers. Many of systemd’s own tools have been updated to explicitly support containers too! Let’s try this (after starting the container up again first, repeating the systemd-nspawn command from above.): # hostnamectl -M mycontainer set-hostname "wuff"
This uses hostnamectl(1) on the local container and sets its hostname. Similar, many other tools have been updated for connecting to local containers. Here’s systemctl(1)‘s -M switch in action: # systemctl -M mycontainer UNIT LOAD ACTIVE SUB DESCRIPTION -.mount loaded active mounted / dev-hugepages.mount loaded active mounted Huge Pages File System dev-mqueue.mount loaded active mounted POSIX Message Queue File System proc-sys-kernel-random-boot_id.mount loaded active mounted /proc/sys/kernel/random/boot_id […] time-sync.target loaded active active System Time Synchronized timers.target loaded active active Timers systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories
LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type.
49 loaded units listed. Pass –all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'.
As expected, this shows the list of active units on the specified container, not the host. (Output is shortened here, the blog story is already getting too long). Let’s use this to restart a service within our container: # systemctl -M mycontainer restart systemd-resolved.service
systemctl has more container support though than just the -M switch. With the -r switch it shows the units running on the host, plus all units of all local, running containers: # systemctl -r UNIT LOAD ACTIVE SUB DESCRIPTION boot.automount loaded active waiting EFI System Partition Automount proc-sys-fs-binfmt_misc.automount loaded active waiting Arbitrary Executable File Formats File Syst sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0x2dLVDSx2d1-intel_backlight.device loaded active plugged /sys/devices/pci0000:00/0000:00:02.0/drm/ca […] timers.target loaded active active Timers mandb.timer loaded active waiting Daily man-db cache update systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories mycontainer:-.mount loaded active mounted / mycontainer:dev-hugepages.mount loaded active mounted Huge Pages File System mycontainer:dev-mqueue.mount loaded active mounted POSIX Message Queue File System […] mycontainer:time-sync.target loaded active active System Time Synchronized mycontainer:timers.target loaded active active Timers mycontainer:systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories
LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type.
191 loaded units listed. Pass –all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'.
We can see here first the units of the host, then followed by the units of the one container we have currently running. The units of the containers are prefixed with the container name, and a colon (“:”). (The output is shortened again for brevity’s sake.) The list-machines subcommand of systemctl shows a list of all running containers, inquiring the system managers within the containers about system state and health. More specifically it shows if containers are properly booted up, or if there are any failed services: # systemctl list-machines NAME STATE FAILED JOBS delta (host) running 0 0 mycontainer running 0 0 miau degraded 1 0 waldi running 0 0
4 machines listed.
To make things more interesting we have started two more containers in parallel. One of them has a failed service, which results in the machine state to be degraded. Let’s have a look at journalctl(1)‘s container support. It too supports -M to show the logs of a specific container: # journalctl -M mycontainer -n 8 Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface. Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface. Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes… Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup. Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes. Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms. Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24. Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.
However, it also supports -m to show the combined log stream of the host and all local containers: # journalctl -m -e
(Let’s skip the output here completely, I figure you can extrapolate how this looks.) But it’s not only systemd’s own tools that understand container support these days, procps sports support for it, too: # ps -eo pid,machine,args PID MACHINE COMMAND 1 – /usr/lib/systemd/systemd –switched-root –system –deserialize 20 […] 2915 – emacs contents/projects/containers.md 3403 – [kworker/u16:7] 3415 – [kworker/u16:9] 4501 – /usr/libexec/nm-vpnc-service 4519 – /usr/sbin/vpnc –non-inter –no-detach –pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid – 4749 – /usr/libexec/dconf-service 4980 – /usr/lib/systemd/systemd-resolved 5006 – /usr/lib64/firefox/firefox 5168 – [kworker/u16:0] 5192 – [kworker/u16:4] 5193 – [kworker/u16:5] 5497 – [kworker/u16:1] 5591 – [kworker/u16:8] 5711 – sudo -s 5715 – /bin/bash 5749 – /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b 5750 mycontainer /usr/lib/systemd/systemd 5799 mycontainer /usr/lib/systemd/systemd-journald 5862 mycontainer /usr/lib/systemd/systemd-logind 5863 mycontainer /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-activation 5868 mycontainer /sbin/agetty –noclear –keep-baud console 115200 38400 9600 vt102 5871 mycontainer /usr/sbin/sshd -D 6527 mycontainer /usr/lib/systemd/systemd-resolved […]
This shows a process list (shortened). The second column shows the container a process belongs to. All processes shown with “-” belong to the host itself. But it doesn’t stop there. The new “sd-bus” D-Bus client library we have been preparing in the systemd/kdbus context knows containers too. While you use sd_bus_open_system() to connect to your local host’s system bus sd_bus_open_system_container() may be used to connect to the system bus of any local container, so that you can execute bus methods on it. sd-login.h and machined’s bus interface provide a number of APIs to add container support to other programs too. They support enumeration of containers as well as retrieving the machine name from a PID and similar. systemd-networkd also has support for containers. When run inside a container it will by default run a DHCP client and IPv4LL on any veth network interface named host0 (this interface is special under the logic described here). When run on the host networkd will by default provide a DHCP server and IPv4LL on veth network interface named ve- followed by a container name. Let’s have a look at one last facet of systemd’s container integration: the hook-up with the name service switch. Recent systemd versions contain a new NSS module nss-mymachines that make the names of all local containers resolvable via gethostbyname() and getaddrinfo(). This only applies to containers that run within their own network namespace. With the systemd-nspawn command shown above the the container shares the network configuration with the host however; hence let’s restart the container, this time with a virtual veth network link between host and container: # machinectl poweroff mycontainer # systemd-nspawn -D /srv/mycontainer –network-veth -b
Now, (assuming that networkd is used in the container and outside) we can already ping the container using its name, due to the simple magic of nss-mymachines: # ping mycontainer PING mycontainer (10.0.0.2) 56(84) bytes of data. 64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms 64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms
Of course, name resolution not only works with ping, it works with all other tools that use libc gethostbyname() or getaddrinfo() too, among them venerable ssh. And this is pretty much all I want to cover for now. We briefly touched a variety of integration points, and there’s a lot more still if you look closely. We are working on even more container integration all the time, so expect more new features in this area with every systemd release. Note that the whole machine concept is actually not limited to containers, but covers VMs too to a certain degree. However, the integration is not as close, as access to a VM’s internals is not as easy as for containers, as it usually requires a network transport instead of allowing direct syscall access. Anyway, I hope this is useful. For further details, please have a look at the linked man pages and other documentation.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.