Python code creates curious, wordless comic strips at random, spewing them from the thermal printer mouth of a laser-cut body reminiscent of Disney Pixar’s WALL-E: meet the Vomit Comic Robot!
The age of the thermal printer!
Thermal printers allow you to instantly print photos, data, and text using a few lines of code, with no need for ink. More and more makers are using this handy, low-maintenance bit of kit for truly creative projects, from Pierre Muth’s tiny PolaPi-Zero camera to the sound-printing Waves project by Eunice Lee, Matthew Zhang, and Bomani McClendon (and our own Secret Santa Babbage).
Interaction designer and developer Cadin Batrack, whose background is in game design and interactivity, has built the Vomit Comic Robot, which creates “one-of-a-kind comics on demand by processing hand-drawn images through a custom software algorithm.”
The robot is made up of a Raspberry Pi 3, a USB thermal printer, and a handful of LEDs.
At the press of a button, Processing code selects one of a set of Cadin’s hand-drawn empty comic grids and then randomly picks images from a library to fill in the gaps.
Each image is associated with data that allows the code to fit it correctly into the available panels. Cadin says about the concept behing his build:
Although images are selected and placed randomly, the comic panel format suggests relationships between elements. Our minds create a story where there is none in an attempt to explain visuals created by a non-intelligent machine.
The Raspberry Pi saves the final image as a high-resolution PNG file (so that Cadin can sell prints on thick paper via Etsy), and a Python script sends it to be vomited up by the thermal printer.
We have a soft spot for cute robots here at Pi Towers, and of course we make no exception for the Vomit Comic Robot. If, like us, you’re a fan of adorable bots, check out Mira, the tiny interactive robot by Alonso Martinez, and Peeqo, the GIF bot by Abhishek Singh.
Last week Backblaze made the exciting announcement that through partnerships with Packet and ServerCentral, cloud computing is available to Backblaze B2 Cloud Storage customers.
Those of you familiar with cloud computing will understand the significance of this news. We are now offering the least expensive cloud storage + cloud computing available anywhere. You no longer have to submit to the lock-in tactics and exorbitant prices charged by the other big players in the cloud services biz.
We understand that some of our cloud backup and storage customers might be unfamiliar with cloud computing. Backblaze made its name in cloud backup and object storage, and that’s what our customers know us for. In response to customers requests, we’ve directly connected our B2 cloud object storage with cloud compute providers. This adds the ability to use and run programs on data once it’s in the B2 cloud, opening up a world of new uses for B2. Just some of the possibilities include media transcoding and rendering, web hosting, application development and testing, business analytics, disaster recovery, on-demand computing capacity (cloud bursting), AI, and mobile and IoT applications.
The world has been moving to a multi-cloud / hybrid cloud world, and customers are looking for more choices than those offered by the existing cloud players. Our B2 compute partnerships build on our mission to offer cloud storage that’s astonishingly easy and low-cost. They enable our customers to move into a more flexible and affordable cloud services ecosystem that provides a greater variety of choices and costs far less. We believe we are helping to fulfill the promise of the internet by allowing customers to choose the best-of-breed services from the best vendors.
If You’re Not Familiar with Cloud Computing, Here’s a Quick Overview
Cloud computing is another component of cloud services, like object storage, that replicates in the cloud a basic function of a computer system. Think of services that operate in a cloud as an infinitely scalable version of what happens on your desktop computer. In your desktop computer you have computing/processing (CPU), fast storage (like an SSD), data storage (like your disk drive), and memory (RAM). Their counterparts in the cloud are computing (CPU), block storage (fast storage), object storage (data storage), and processing memory (RAM).
CPU, RAM, fast internal storage, and a hard drive are the basic building blocks of a computer They also are the basic building blocks of cloud computing
Some customers require only some of these services, such as cloud storage. B2 as a standalone service has proven to be an outstanding solution for those customers interested in backing up or archiving data. There are many customers that would like additional capabilities, such as performing operations on that data once it’s in the cloud. They need object storage combined with computing.
With the just announced compute partnerships, Backblaze is able to offer computing services to anyone using B2. A direct connection between Backblaze’s and our partners’ data centers means that our customers can process data stored in B2 with high speed, low latency, and zero data transfer costs.
Cloud service providers package up CPU, storage, and memory into services that you can rent on an hourly basis You can scale up and down and add or remove services as you need them
How Does Computing + B2 Work?
Those wanting to use B2 with computing will need to sign up for accounts with Backblaze and either Packet or ServerCentral. Packet customers need only select “SJC1” as their region and then get started. The process is also simple for ServerCentral customers — they just need to register with a ServerCentral account rep.
The direct connection between B2 and our compute partners means customers will experience very low latency (less than 10ms) between services. Even better, all data transfers between B2 and the compute partner are free. When combined with Backblaze B2, customers can obtain cloud computing services for as little as 50% of the cost of Amazon’s Elastic Compute Cloud (EC2).
Opening Up the Cloud “Walled Garden”
Traditionally, cloud vendors charge fees for customers to move data outside the “walled garden” of that particular vendor. These fees reach upwards of $0.12 per gigabyte (GB) for data egress. This large fee for customers accessing their own data restricts users from using a multi-cloud approach and taking advantage of less expensive or better performing options. With free transfers between B2 and Packet or ServerCentral, customers now have a predictable, scalable solution for computing and data storage while avoiding vendor lock in. Dropbox made waves when they saved $75 million by migrating off of AWS. Adding computing to B2 helps anyone interested in moving some or all of their computing off of AWS and thereby cutting their AWS bill by 50% or more.
What are the Advantages of Cloud Storage + Computing?
Using computing and storage in the cloud provide a number of advantages over using in-house resources.
You don’t have to purchase the actual hardware, software licenses, and provide space and IT resources for the systems.
Cloud computing is available with just a few minutes notice and you only pay for whatever period of time you need. You avoid having additional hardware on your balance sheet.
Resources are in the cloud and can provide online services to customers, mobile users, and partners located anywhere in the world.
You can isolate the work on these systems from your normal production environment, making them ideal for testing and trying out new applications and development projects.
Computing resources scale when you need them to, providing temporary or ongoing extra resources for expected or unexpected demand.
They can provide redundant and failover services when and if your primary systems are unavailable for whatever reason.
Where Can I Learn More?
We encourage B2 customers to explore the options available at our partner sites, Packet and ServerCentral. They are happy to help customers understand what services are available and how to get started.
We are excited to see what you build! And please tell us in the comments what you are doing or have planned with B2 + computing.
In Part 2, we take a deeper look at the differences between HDDs and SSDs, how both HDD and SSD technologies are evolving, and how Backblaze takes advantage of SSDs in our operations and data centers.
The first time you booted a computer or opened an app on a computer with a solid-state-drive (SSD), you likely were delighted. I know I was. I loved the speed, silence, and just the wow factor of this new technology that seemed better in just about every way compared to hard drives.
I was ready to fully embrace the promise of SSDs. And I have. My desktop uses an SSD for booting, applications, and for working files. My laptop has a single 512GB SSD. I still use hard drives, however. The second, third, and fourth drives in my desktop computer are HDDs. The external USB RAID I use for local backup uses HDDs in four drive bays. When my laptop is at my desk it is attached to a 1.5TB USB backup hard drive. HDDs still have a place in my personal computing environment, as they likely do in yours.
Nothing stays the same for long, however, especially in the fast-changing world of computing, so we are certain to see new storage technologies coming to the fore, perhaps with even more wow factor.
Before we get to what’s coming, let’s review the primary differences between HDDs and SSDs in a little more detail in the following table.
A Comparison of HDDs to SSDs
Power Draw/Battery Life
More power draw, averages 6–7 watts and therefore uses more battery
Less power draw, averages 2–3 watts, resulting in 30+ minute battery boost
Only around $0.03 per gigabyte, very cheap (buying a 4TB model)
Expensive, roughly $0.20- $0.30 per gigabyte (based on buying a 1TB drive)
Typically around 500GB and 2TB maximum for notebook size drives; 10TB max for desktops
Typically not larger than 1TB for notebook size drives; 4TB for desktops
Operating System Boot Time
Around 30-40 seconds average bootup time
Around 8-13 seconds average bootup time
Audible clicks and spinning platters can be heard
There are no moving parts, hence no sound
The spinning of the platters can sometimes result in vibration
No vibration as there are no moving parts
HDD doesn’t produce much heat, but it will have a measurable amount more heat than an SSD due to moving parts and higher power draw
Lower power draw and no moving parts so little heat is produced
Mean time between failure rate of 1.5 million hours
Mean time between failure rate of 2.0 million hours
File Copy / Write Speed
The range can be anywhere from 50–120MB/s
Generally above 200 MB/s and up to 550 MB/s for cutting edge drives
Full Disk Encryption (FDE) Supported on some models
Full Disk Encryption (FDE) Supported on some models
The HDD has an amazing history of improvement and innovation. From its inception in 1956 the hard drive has decreased in size 57,000 times, increased storage 1 million times, and decreased cost 2,000 times. In other words, the cost per gigabyte has decreased by 2 billion times in about 60 years.
Hard drive manufacturers made these dramatic advances by reducing the size, and consequently the seek times, of platters while increasing their density, improving disk reading technologies, adding multiple arms and read/write heads, developing better bus interfaces, and increasing spin speed and reducing friction with techniques such as filling drives with helium.
In 2005, the drive industry introduced perpendicular recording technology to replace the older longitudinal recording technology, which enabled areal density to reach more than 100 gigabits per square inch. Longitudinal recording aligns data bits horizontally in relation to the drive’s spinning platter, parallel to the surface of the disk, while perpendicular recording aligns bits vertically, perpendicular to the disk surface.
Other technologies such as bit patterned media recording (BPMR) are contributing to increased densities, as well. Introduced by Toshiba in 2010, BPMR is a proposed hard disk drive technology that could succeed perpendicular recording. It records data using nanolithography in magnetic islands, with one bit per island. This contrasts with current disk drive technology where each bit is stored in 20 to 30 magnetic grains within a continuous magnetic film.
Shingled magnetic recording (SMR) is a magnetic storage data recording technology used in HDDs to increase storage density and overall per-drive storage capacity. Shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track narrower and allowing for higher track density. Thus, the tracks partially overlap similar to roof shingles. This approach was selected because physical limitations prevent recording magnetic heads from having the same width as reading heads, leaving recording heads wider.
Track Spacing Enabled by SMR Technology (Seagate)
To increase the amount of data stored on a drive’s platter requires cramming the magnetic regions closer together, which means the grains need to be smaller so they won’t interfere with each other. In 2002, Seagate successfully demoed heat-assisted magnetic recording (HAMR). HAMR records magnetically using laser-thermal assistance that ultimately could lead to a 20 terabyte drive by 2019. (See our post on HAMR by Seagate’s CTO Mark Re, What is HAMR and How Does It Enable the High-Capacity Needs of the Future?)
Western Digital claims that its competing microwave-assisted magnetic recording (MAMR) could enable drive capacity to increase up to 40TB by the year 2025. Some industry watchers and drive manufacturers predict increases in areal density from today’s .86 tbpsi terabit-per-square-inch (TBPSI) to 10 tbpsi by 2025 resulting in as much as 100TB drive capacity in the next decade.
The future certainly does look bright for HDDs continuing to be with us for a while.
The Outlook for SSDs
SSDs are also in for some amazing advances.
SATA (Serial Advanced Technology Attachment) is the common hardware interface that allows the transfer of data to and from HDDs and SSDs. SATA SSDs are fine for the majority of home users, as they are generally cheaper, operate at a lower speed, and have a lower write life.
While fine for everyday computing, in a RAID (Redundant Array of Independent Disks), server array or data center environment, often a better alternative has been to use ‘SAS’ drives, which stands for Serial Attached SCSI. This is another type of interface that, again, is usable either with HDDs or SSDs. ‘SCSI’ stands for Small Computer System Interface (which is why SAS drives are sometimes referred to as ‘scuzzy’ drives). SAS has increased IOPS (Inputs Outputs Per Second) over SATA, meaning it has the ability to read and write data faster. This has made SAS an optimal choice for systems that require high performance and availability.
On an enterprise level, SAS prevails over SATA, as SAS supports over-provisioning to prolong write life and has been specifically designed to run in environments that require constant drive usage.
PCIe (Peripheral Component Interconnect Express) is a high speed serial computer expansion bus standard that supports drastically higher data transfer rates over SATA or SAS interfaces due to the fact that there are more channels available for the flow of data.
Many leading drive manufacturers have been adopting PCIe as the standard for new home and enterprise storage and some peripherals. For example, you’ll see that the latest Apple Macbooks ship with PCIe-based flash storage, something that Apple has been adopting over the years with their consumer devices.
PCIe can also be used within data centers for RAID systems and to create high-speed networking capabilities, increasing overall performance and supporting the newer and higher capacity HDDs.
As we covered in Part 1, SSDs are based on a type of non-volatile flash memory called NAND.The latest trend in NAND flash is quad-level-cell (QLC) NAND. NAND is subdivided into types based on how many bits of data are stored in each physical memory cell. SLC (single-level-cell) stores one bit, MLC (multi-level-cell) stores two, TLC (triple-level cell) stores three, and QLC (quad-level-cell) stores four.
Storing more data per cell makes NAND more dense, but it also makes the memory slower — it takes more time to read and write data when so much additional information (and so many more charge states) are stored within the same cell of memory.
QLC NAND memory is built on older process nodes with larger cells that can more easily store multiple bits of data. The new NAND tech has higher overall reliability with higher total number of program / erase cycles (P/E cycles).
QLC NAND wafer from which individual microcircuits are made
QLC NAND promises to produce faster and denser SSDs. The effect on price also could be dramatic. Tom’s Hardware is predicting that the advent of QLC could push 512GB SSDs down to $100.
Beyond HDDs and SSDs
There is significant work being done that is pushing the bounds of data storage beyond what is possible with spinning platters and microcircuits. A team at Harvard University has used genome-editing to encode video into live bacteria.
We’ve already discussed the benefits of SSDs. The benefits of SSDs that apply particularly to the data center are:
Low power consumption — When you are running lots of drives, power usage adds up. Anywhere you can conserve power is a win.
Speed — Data can be accessed faster, which is especially beneficial for caching databases and other data affecting overall application or system performance.
Lack of vibration — Reducing vibration improves reliability thereby reducing problems and maintenance. Racks don’t need the size and structural rigidity housing SSDs that they need housing HDDs.
Low noise — Data centers will become quieter as more SSDs are deployed.
Low heat production — The less heat generated the less cooling and power required in the data center.
Faster booting — The faster a storage chassis can get online or a critical server can be rebooted after maintenance or a problem, the better.
Greater areal density — Data centers will be able to store more data in less space, which increases efficiency in all areas (power, cooling, etc.)
The top drive manufacturers say that they expect HDDs and SSDs to coexist for the foreseeable future in all areas — home, business, and data center, with customers choosing which technology and product will best fit their application.
How Backblaze Uses SSDs
In just about all respects, SSDs are superior to HDDs. So why don’t we replace the 100,000+ hard drives we have spinning in our data centers with SSDs?
Our operations team takes advantage of the benefits and savings of SSDs wherever they can, using them in every place that’s appropriate other than primary data storage. They’re particularly useful in our caching and restore layers, where we use them strategically to speed up data transfers. SSDs also speed up access to B2 Cloud Storage metadata. Our operations teams is considering moving to SSDs to boot our Storage Pods, where the cost of a small SSD is competitive with hard drives, and their other attributes (small size, lack of vibration, speed, low-power consumption, reliability) are all pluses.
A Future with Both HDDs and SSDs
IDC predicts that total data created will grow from approximately 33 zettabytes in 2018 to about 160 zettabytes in 2025. (See What’s a Byte? if you’d like help understanding the size of a zettabyte.)
Annual Size of the Global Datasphere
Over 90% of enterprise drive shipments today are HDD, according to IDC. By 2025, SSDs will comprise almost 20% of drive shipments. SSDs will gain share, but total growth in data created will result in massive sales of both HDDs and SSDs.
Enterprise Byte Shipments: NDD and SSD
As both HDD and SSD sales grow, so does the capacity of both technologies. Given the benefits of SSDs in many applications, we’re likely going to see SSDs replacing HDDs in all but the highest capacity uses.
It’s clear that there are merits to both HDDs and SSDs. If you’re not running a data center, and don’t have more than one or two terabytes of data to store on your home or business computer, your first choice likely should be an SSD. They provide a noticeable improvement in performance during boot-up and data transfer, and are smaller, quieter, and more reliable as well. Save the HDDs for secondary drives, NAS, RAID, and local backup devices in your system.
Perhaps some day we’ll look back at the days of spinning platters with the same nostalgia we look back at stereo LPs, and some of us will have an HDD paperweight on our floating anti-gravity desk as a conversation piece. Until the day that SSD’s performance, capacity, and finally, price, expel the last HDD out of the home and data center, we can expect to live in a world that contains both solid state SSDs and magnetic platter HDDs, and as users we will reap the benefits from both technologies.
Don’t miss future posts on HDDs, SSDs, and other topics, including hard drive stats, cloud storage, and tips and tricks for backing up to the cloud. Use the Join button above to receive notification of future posts on our blog.
Jason Barnett used the pots feature of the Monzo banking API to create a simple e-paper display so that his kids can keep track of their pocket money.
For those outside the UK: Monzo is a smartphone-based bank that allows costumers to manage their money and payment cards via an app, removing the bank clerk middleman.
In the Monzo banking app, users can set up pots, which allow them to organise their money into various, you guessed it, pots. You want to put aside holiday funds, budget your food shopping, or, like Jason, manage your kids’ pocket money? Using pots is an easy way to do it.
Jason’s Monzo Pot ePaper tracker
After failed attempts at keeping track of his sons’ pocket money via a scrap of paper stuck to the fridge, Jason decided to try a new approach.
He started his build by installing Stretch Lite to the SD card of his Raspberry Pi Zero W. “The Pi will be running headless (without screen, mouse or keyboard)”, he explains on his blog, “so there is no need for a full-fat Raspbian image.” While Stretch Lite was downloading, he set up the Waveshare ePaper HAT on his Zero W. He notes that Pimoroni’s “Inky pHAT would be easiest,” but his tutorial is specific to the Waveshare device.
Before ejecting the SD card, Jason updated the boot partition to allow him to access the Pi via SSH. He talks makers through that process here.
Among the libraries he installed for the project is pyMonzo, a Python wrapper for the Monzo API created by Paweł Adamczak. Monzo is still in its infancy, and the API is partly under construction. Until it’s completed, Paweł’s wrapper offers a more stable way to use it.
After installing the software, it was time to set up the e-paper screen for the tracker. Jason adjusted the code for the API so that the screen reloads information every 15 minutes, displaying the up-to-date amount of pocket money in both kids’ pots.
Here is how Jason describes going to the supermarket with his sons, now that he has completed the tracker:
“Daddy, I want (insert first thing picked up here), I’ve always wanted one of these my whole life!” […] Even though you have never seen that (insert thing here) before, I can quickly open my Monzo app, flick to Account, and say “You have £3.50 in your money box”. If my boy wants it, a 2-second withdrawal is made whilst queueing, and done — he walks away with a new (again, insert whatever he wanted his whole life here) and is happy!
Jason’s blog offers a full breakdown of his project, including all necessary code and the specs for the physical build. Be sure to head over and check it out.
Have you used an API in your projects? What would you build with one?
Halloween: that glorious time of year when you’re officially allowed to make your friends jump out of their skin with your pranks. For those among us who enjoy dressing up, Halloween is also the occasion to go all out with costumes. And so, dear reader, we present to you: a steampunk tentacle hat, created by Derek Woodroffe.
Derek is an engineer who loves all things electronics. He’s part of Extreme Kits, and he runs the website Extreme Electronics. Raspberry Pi Zero-controlled Tesla coils are Derek’s speciality — he’s even been on one of the Royal Institution’s Christmas Lectures with them! Skip ahead to 15:06 in this video to see Derek in action:
The first Lecture from Professor Saiful Islam’s 2016 series of CHRISTMAS LECTURES, ‘Supercharged: Fuelling the future’. Watch all three Lectures here: http://richannel.org/christmas-lectures 2016 marked the 80th anniversary since the BBC first broadcast the Christmas Lectures on TV. To celebrate, chemist Professor Saiful Islam explores a subject that the lectures’ founder – Michael Faraday – addressed in the very first Christmas Lectures – energy.
Wearables are electronically augmented items you can wear. They might take the form of spy eyeglasses, clothes with integrated sensors, or, in this case, headgear adorned with mechanised tentacles.
Why did Derek make this? We’re not entirely sure, but we suspect he’s a fan of the Cthulu mythos. In any case, we were a little astounded by his project. This is how we reacted when Derek tweeted us about it:
@ExtElec @extkits This is beyond incredible and completely unexpected.
In fact, we had to recover from a fit of laughter before we actually managed to type this answer.
Making a steampunk tentacle hat
Derek made the ‘skeleton’ of each tentacle out of a net curtain spring, acrylic rings, and four lengths of fishing line. Two servomotors connect to two ends of fishing line each, and pull them to move the tentacle.
Then he covered the tentacles with nylon stockings and liquid latex, glued suckers cut out of MDF onto them, and mounted them on an acrylic base. The eight motors connect to a Raspberry Pi via an I2C 8-port PWM controller board.
The Pi makes the servos pull the tentacles so that they move in sine waves in both the x and y directions, seemingly of their own accord. Derek cut open the top of a hat to insert the mounted tentacles, and he used more liquid latex to give the whole thing a slimy-looking finish.
Iä! Iä! Cthulhu fhtagn!
You can read more about Derek’s steampunk tentacle hat here. He will be at the Beeston Raspberry Jam in November to show off his build, so if you’re in the Nottingham area, why not drop by?
Wearables for Halloween
This build is already pretty creepy, but just imagine it with a sensor- or camera-powered upgrade that makes the tentacles reach for people nearby. You’d have nightmare fodder for weeks.
With the help of the Raspberry Pi, any Halloween costume can be taken to the next level. How could Pi technology help you to win that coveted ‘Scariest costume’ prize this year? Tell us your ideas in the comments, and be sure to share pictures of you in your get-up with us on Twitter, Facebook, or Instagram.
For fun, Eunice Lee, Matthew Zhang, and Bomani McClendon have worked together to create Waves, an audiovisual project that records people’s spoken responses to personal questions and prints them in the form of a sound wave as a gift for being truthful.
Waves is a Raspberry Pi project centered around transforming the transience of the spoken word into something concrete and physical. In our setup, a user presses a button corresponding to an intimate question (ex: what’s your motto?) and answers it into a microphone while pressing down on the button.
What are you grateful for?
“I’m grateful for finishing this project,” admits maker Eunice Lee as she presses a button and speaks into the microphone that is part of the Waves project build. After a brief moment, her confession appears on receipt paper as a waveform, and she grins toward the camera, happy with the final piece.
Waves is a Raspberry Pi project centered around transforming the transience of the spoken word into something concrete and physical. In our setup, a user presses a button corresponding to an intimate question (ex: what’s your motto?) and answers it into a microphone while pressing down on the button.
Sound wave machine
Alongside a Raspberry Pi 3, the Waves device is comprised of four tactile buttons, a standard USB microphone, and a thermal receipt printer. This type of printer has become easily available for the maker movement from suppliers such as Adafruit and Pimoroni.
Definitely more fun than a polygraph test
The trio designed four colour-coded cards that represent four questions, each of which has a matching button on the breadboard. Press the button that belongs to the question to be answered, and Python code directs the Pi to record audio via the microphone. Releasing the button stops the audio recording. “Once the recording has been saved, the script viz.py is launched,” explains Lee. “This script takes the audio file and, using Python matplotlib magic, turns it into a nice little waveform image.”
From there, the Raspberry Pi instructs the thermal printer to produce a printout of the sound wave image along with the question.
Making for fun
Eunice, Bomani, and Matt, students of design and computer science at Northwestern University in Illinois, built Waves as a side project. They wanted to make something at the intersection of art and technology and were motivated by the pure joy of creating.
Making makes people happy
They have noted improvements that can be made to increase the scope of their sound wave project. We hope to see many more interesting builds from these three, and in the meantime we invite you all to look up their code on Eunice’s GitHub to create your own Waves at home.
Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact. These strategies require an understanding of how the various components of your system interact, how those components can fail and how those failures impact your customers. This blog post discusses some of the deployment strategies that we’ve made on the Route 53 team and how these choices affect the availability of our service.
To begin, I’ll briefly describe some of the deployment procedures and the Route 53 architecture in order to provide some context for the deployment strategies that we have chosen. Hopefully, these examples will reveal strategies that could benefit your own service’s availability. Like many services, Route 53 consists of multiple environments or stages: one for active development, one for staging changes to production and the production stage itself. The natural tension with trying to reduce the number of failed deployments in production is to add more rigidity and processes that slow down the release of new code. At Route 53, we do not enforce a strict release or deployment schedule; individual developers are responsible for verifying their changes in the staging environment and pushing their changes into production. Typically, our deployments proceed in a pipelined fashion. Each step of the pipeline is referred to as a “wave” and consists of some portion of our fleet. A pipeline is a good abstraction as each wave can be thought of as an independent and separate step. After each wave of the pipeline, the change can be verified — this can include automatic, scheduled and manual testing as well as the verification of service metrics. Furthermore, we typically space out the earlier waves of production deployment at least 24 hours apart, in order to allow the changes to “bake.” Letting our software bake refers to rolling out software changes slowly to allow us to validate those changes and verify service metrics with production traffic before pushing the deployment to the next wave. The clear advantage of deploying new code to only a portion of your fleet is that it reduces the impact of a failed deployment to just the portion of the fleet containing the new code. Another benefit of our deployment infrastructure is that it provides us a mechanism to quickly “roll back” a deployment to a previous software version if any problems are detected which, in many cases, enables us to quickly mitigate a failed deployment.
Based on our experiences, we have further organized our deployments to try and match our failure conditions to further reduce impact. First, our deployment strategies are tailored to the part of the system that is the target of our deployment. We commonly refer to two main components of Route 53: the control plane and the data plane (pictured below). The control plane consists primarily of our API and DNS change propagation system. Essentially, this is the part of our system that accepts a customer request to create or delete a DNS record and then the transmission of that update to all of our DNS servers distributed across the world. The data plane consists of our fleet of DNS servers that are responsible for answering DNS queries on behalf of our customers. These servers currently reside in more than 50 locations around the world. Both of these components have their own set of failure conditions and differ in how a failed deployment will impact customers. Further, a failure of one component may not impact the other. For example, an API outage where customers are unable to create new hosted zones or records has no impact on our data plane continuing to answer queries for all records created prior to the outage. Given their distinct set of failure conditions, the control plane and data plane have their own deployment strategies, which are each discussed in more detail below.
Control Plane Deployments
The bulk of the of the control plane actually consists of two APIs. The first is our external API that is reachable from the Internet and is the entry point for customers to create, delete and view their DNS records. This external API performs authentication and authorization checks on customer requests before forwarding them to our internal API. The second, internal API supports a much larger set of operations than just the ones needed by the external API; it also includes operations required to monitor and propagate DNS changes to our DNS servers as well as other operations needed to operate and monitor the service. Failed deployments to the external API typically impact a customer’s ability to view or modify their DNS records. The availability of this API is critical as our customers may rely on the ability to update their DNS records quickly and reliably during an operational event for their own service or site.
Deployments to the external API are fairly straightforward. For increased availability, we host the external API in multiple availability zones. Each wave of deployment consists of the hosts within a single availability zone, and each host in that availability zone is deployed to individually. If any single host deployment fails, the deployment to the entire availability zone is halted automatically. Some host failures may be quickly caught and mitigated by the load balancer for our hosts in that particular availability zone, which is responsible for health checking the hosts. Hosts that fail these load balancer health checks are automatically removed from service by the load balancer. Thus, a failed deployment to just a single host would result in it being removed from service automatically and the deployment halted without any operator intervention. For other types of failed deployments that may not cause the load balancer health checks to fail, restricting waves to a single availability zone allows us to easily flip away from that availability zone as soon as the failure is detected. A similar approach could be applied to services that utilize Route 53 plus ELB in multiple regions and availability zones for their services. ELBs automatically health check their back-end instances and remove unhealthy instances from service. By creating Route 53 alias records marked to evaluate target health (see ELB documentation for how to set this up), if all instances behind an ELB are unhealthy, Route 53 will fail away from this alias and attempt to find an alternate healthy record to serve. This configuration will enable automatic failover at the DNS-level for an unhealthy region or availability zone. To enable manual failover, simply convert the alias resource record set for your ELB to either a weighted alias or associate it with a health check whose health you control. To initiate a failover, simply set the weight to 0 or fail the health check. A weighted alias also allows you the ability to slowly increase the traffic to that ELB, which can be useful for verifying your own software deployments to the back-end instances.
For our internal API, the deployment strategy is more complicated (pictured below). Here, our fleet is partitioned by the type of traffic it handles. We classify traffic into three types: (1) low-priority, long-running operations used to monitor the service (batch fleet), (2) all other operations used to operate and monitor the service (operations fleet) and (3) all customer operations (customer fleet). Deployments to the production internal API are then organized by how critical their traffic is to the service as a whole. For instance, the batch fleet is deployed to first because their operations are not critical to the running of the service and we can tolerate long outages of this fleet. Similarly, we prioritize the operations fleet below that of customer traffic as we would rather continue accepting and processing customer traffic after a failed deployment to the operations fleet. For the internal API, we have also organized our staging waves differently from our production waves. In the staging waves, all three fleets are split across two waves. This is done intentionally to allow us to verify that the code changes work in a split-world where multiple versions of the software are running simultaneously. We have found this to be useful in catching incompatibilities between software versions. Since we never deploy software in production to 100% of our fleet at the same time, our software updates must be designed to be compatible with the previous version. Finally, as with the external API, all wave deployments proceed with a single host at a time. For this API, we also include a deep application health check as part of the deployment. Similar to the load balancer health checks for the external API, if this health check fails, the entire deployment is immediately halted.
Data Plane Deployments
As mentioned earlier, our data plane consists of Route 53’s DNS servers, which are distributed across the world in more than 50 distinct locations (we refer to each location as an ‘edge location’). An important consideration with our deployment strategy is how we stripe our anycast IP space across locations. In summary, each hosted zone is assigned four delegation name servers, each of which belong to a “stripe” (i.e., one quarter of our anycast range). Generally speaking, each edge location announces only a single stripe, so each stripe is therefore announced by roughly 1/4 of our edge locations worldwide. Thus, when a resolver issues a query against each of the four delegation name servers, those queries are directed via BGP to the closest (in a network sense) edge location from each stripe. While the availability and correctness of our API is important, the availability and correctness of our data plane are even more critical. In this case, an outage directly results in an outage for our customers. Furthermore, the impact of serving even a single wrong answer on behalf of a customer is magnified by that answer being cached by both intermediate resolvers and end clients alike. Thus, deployments to our data plane are organized even more carefully to both prevent failed deployments and to reduce potential impact.
The safest way to deploy and minimize impact would be to deploy to a single edge location at a time. However, with manual deployments that are overseen by a developer, this approach is just not scalable with how frequently we deploy new software to over 50 locations (with more added each year). Thus, most of our production deployment waves consist of multiple locations; the one exception is our first wave that includes just a single location. Furthermore, this location is specifically chosen because it runs our oldest hardware, which provides us a quick notification for any unintended performance degradation. It is important to note that while the caching behavior for resolvers can cause issues if we serve an incorrect answer, they handle other types failures well. When a recursive resolver receives a query for a record that is not cached, it will typically issue queries to at least three of the four delegation name servers in parallel and it will use the first response it receives. Thus, in the event where one of our locations is black holing customer queries (i.e., not replying to DNS queries), the resolver should receive a response from one of the other delegation name servers. In this case, the only impact is to resolvers where the edge location that is not answering would have been the fastest responder. Now, that resolver will effectively be waiting for the response from the second fastest stripe. To take advantage of this resiliency, our other waves are organized such that they include edge locations that are geographically diverse, with the intent that for any single resolver, there will be nearby locations that are not included in the current deployment wave. Furthermore, to guarantee that at most a single nameserver for all customers is affected, waves are actually organized by stripe. Finally, each stripe is spread across multiple waves so that failures impact only a single name server for a portion of our customers. An example of this strategy is depicted below. A few notes: our staging environment consists of a much smaller number of edge locations than production, so single-location waves are possible. Second, each stripe is denoted by color; in this example, we see deployments spread across a blue and orange stripe. You, too, can think about organizing your deployment strategy around your failure conditions. For example, if you have a database schema used by both your production system and a warehousing system, deploy the change to the warehousing system first to ensure you haven’t broken any compatibility. You might catch problems with the warehousing system before it affects customer traffic.
Our team’s experience with operating Route 53 over the last 3+ years have highlighted the importance of reducing the impact from failed deployments. Over the years, we have been able to identify some of the common failure conditions and to organize our software deployments in such a way so that we increase the ease of mitigation while decreasing the potential impact to our customers.
– Nick Trebon
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.