# Conundrum

Post Syndicated from Eevee original https://eev.ee/blog/2018/03/20/conundrum/

Here’s a problem I’m having. Or, rather, a problem I’m solving, but so slowly that I wonder if I’m going about it very inefficiently.

I intended to just make a huge image out of this and tweet it, but it takes so much text to explain that I might as well put it on my internet website.

## The setup

I want to do pathfinding through a Doom map. The ultimate goal is to be able to automatically determine the path the player needs to take to reach the exit — what switches to hit in what order, what keys to get, etc.

Doom maps are 2D planes cut into arbitrary shapes. Everything outside a shape is ｔｈｅ　ｖｏｉｄ, which we don’t care about. Here are some shapes.

The shapes are defined implicitly by their edges. All of the edges touching the red area, for example, say that they’re red on one side.

That’s very nice, because it means I don’t have to do any geometry to detect which areas touch each other. I can tell at a glance that the red and blue areas touch, because the line between them says it’s red on one side and blue on the other.

Unfortunately, this doesn’t seem to be all that useful. The player can’t necessarily move from the red area to the blue area, because there’s a skinny bottleneck. If the yellow area were a raised platform, the player couldn’t fit through the gap. Worse, if there’s a switch somewhere that lowers that platform, then the gap is conditionally passable.

I thought this would be uncommon enough that I could get started only looking at neighbors and do actual geometry later, but that “conditionally passable” pattern shows up all the time in the form of locked “bars” that let you peek between or around them. So I might as well just do the dang geometry.

The player is a 32×32 square and always axis-aligned (i.e., the hitbox doesn’t actually rotate). That’s very convenient, because it means I can “dilate the world” — expand all the walls by 16 units in both directions, while shrinking the player to a single point. That expansion eliminates narrow gaps and leaves a map of everywhere the player’s center is allowed to be. Allegedly this is how Quake did collision detection — but in 3D! How hard can it be in 2D?

The plan, then, is to do this:

This creates a bit of an unholy mess. (I could avoid some of the overlap by being clever at points where exactly two lines touch, but I have to deal with a ton of overlap anyway so I’m not sure if that buys anything.)

The gray outlines are dilations of inner walls, where both sides touch a shape. The black outlines are dilations of outer walls, touching ｔｈｅ　ｖｏｉｄ on one side. This map tells me that the player’s center can never go within 16 units of an outer wall, which checks out — their hitbox would get in the way! So I can delete all that stuff completely.

Consider that bottom-left outline, where red and yellow touch horizontally. If the player is in the red area, they can only enter that outlined part if they’re also allowed to be in the yellow area. Once they’re inside it, though, they can move around freely. I’ll color that piece orange, and similarly blend colors for the other outlines. (A small sliver at the top requires access to all three areas, so I colored it gray, because I can’t be bothered to figure out how to do a stripe pattern in Inkscape.)

This is the final map, and it’s easy to traverse because it works like a graph! Each contiguous region is a node, and each border is an edge. Some of the edges are one-way (falling off a ledge) or conditional (walking through a door), but the player can move freely within a region, so I don’t need to care about world geometry any more.

## The problem

I’m having a hell of a time doing this mass-intersection of a big pile of shapes.

I’m writing this in Rust, and I would very very very strongly prefer not to wrap a C library (or, god forbid, a C++ library), because that will considerably complicate actually releasing this dang software. Unfortunately, that also limits my options rather a lot.

I was referred to a paper (A simple algorithm for Boolean operations on polygons, Martínez et al, 2013) that describes doing a Boolean operation (union, intersection, difference, xor) on two shapes, and works even with self-intersections and holes and whatnot.

I spent an inordinate amount of time porting its reference implementation from very bad C++ to moderately bad Rust, and I extended it to work with an arbitrary number of polygons and to spit out all resulting shapes. It has been a very bumpy ride, and I keep hitting walls — the latest is that it panics when intersecting everything results in two distinct but exactly coincident edges, which obviously happens a lot with this approach.

So the question is: is there some better way to do this that I’m overlooking, or should I just keep fiddling with this algorithm and hope I come out the other side with something that works?

Bear in mind, the input shapes are not necessarily convex, and quite frequently aren’t. Also, they can have holes, and quite frequently do. That rules out most common algorithms. It’s probably possible to triangulate everything, but I’m a little wary of cutting the map into even more microscopic shards; feel free to convince me otherwise.

Also, the map format technically allows absolutely any arbitrary combination of lines, so all of these are possible:

It would be nice to handle these gracefully somehow, or at least not crash on them. But they’re usually total nonsense as far as the game is concerned. But also that middle one does show up in the original stock maps a couple times.

Another common trick is that lines might be part of the same shape on both sides:

The left example suggests that such a line is redundant and can simply be ignored without changing anything. The right example shows why this is a problem.

A common trick in vanilla Doom is the so-called self-referencing sector. Here, the edges of the inner yellow square all claim to be yellow — on both sides. The outer edges all claim to be blue only on the inside, as normal. The yellow square therefore doesn’t neighbor the blue square at all, because no edges that are yellow on one side and blue on the other. The effect in-game is that the yellow area is invisible, but still solid, so it can be used as an invisible bridge or invisible pit for various effects.

This does raise the question of exactly how Doom itself handles all these edge cases. Vanilla maps are preprocessed by a node builder and split into subsectors, which are all convex polygons. So for any given weird trick or broken geometry, the answer to “how does this behave” is: however the node builder deals with it.

Subsectors are built right into vanilla maps, so I could use those. The drawback is that they’re optional for maps targeting ZDoom (and maybe other ports as well?), because ZDoom has its own internal node builder. Also, relying on built nodes in general would make this code less useful for map editing, or generating, or whatever.

ZDoom’s node builder is open source, so I could bake it in? Or port it to Rust? (It’s only, ah, ten times bigger than the shape algorithm I ported.) It’d be interesting to have a fairly-correct reflection of how the game sees broken geometry, which is something no map editor really tries to do. Is it fast enough? Running it on the largest map I know to exist (MAP14 of Sunder) takes 1.4 seconds, which seems like a long time, but also that’s from scratch, and maybe it could be adapted to work incrementally…? Christ.

I’m not sure I have the time to dedicate to flesh this out beyond a proof of concept anyway, so maybe this is all moot. But all the more reason to avoid spending a lot of time on dead ends.

# A printing GIF camera? Is that even a thing?

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/printing-gif-camera/

Abhishek Singh’s printing GIF camera uses two Raspberry Pis, the Model 3 and the Zero W, to take animated images and display them on an ejectable secondary screen.

#### Instagif – A DIY Camera that prints GIFs instantly

I built a camera that snaps a GIF and ejects a little cartridge so you can hold a moving photo in your hand! I’m calling it the “Instagif NextStep”.

## The humble GIF

Created in 1987, Graphics Interchange Format files, better known as GIFs, have somewhat taken over the internet. And whether you pronounce it G-IF or J-IF, you’ve probably used at least one to express an emotion, animate images on your screen, or create small, movie-like memories of events.

In 2004, all patents on the humble GIF expired, which added to the increased usage of the file format. And by the early 2010s, sites such as giphy.com and phone-based GIF keyboards were introduced into our day-to-day lives.

Welcome to the age of the GIF

## Polaroid cameras

Polaroid cameras have a somewhat older history. While the first documented instant camera came into existence in 1923, commercial iterations made their way to market in the 1940s, with Polaroid’s model 95 Land Camera.

In recent years, the instant camera has come back into fashion, with camera stores and high street fashion retailers alike stocking their shelves with pastel-coloured, affordable models. But nothing beats the iconic look of the Polaroid Spirit series, and the rainbow colour stripe that separates it from its competitors.

Shake it like a Polaroid picture…

And if you’re one of our younger readers and find yourself wondering where else you’ve seen those stripes, you’re probably more familiar with previous versions of the Instagram logo, because, well…

I’m sorry for the comment on the previous image. It was just too easy.

## Abhishek Singh’s printing GIF camera

Abhishek labels his creation the Instagif NextStep, and cites his inspiration for the project as simply wanting to give it a go, and to see if he could hold a ‘moving photo’.

“What I love about these kinds of projects is that they involve a bunch of different skill sets and disciplines”, he explains at the start of his lengthy, highly GIFed and wonderfully detailed imugr tutorial. “Hardware, software, 3D modeling, 3D printing, circuit design, mechanical/electrical engineering, design, fabrication etc. that need to be integrated for it to work seamlessly. Ironically, this is also what I hate about these kinds of projects”

Care to see how the whole thing comes together? Well, in the true spirit of the project, Abhishek created this handy step-by-step GIF.

#### Piecing it together

I thought I’ll start off with the entire assembly and then break down the different elements. As you can see, everything is assembled from the base up in layers helping in easy assembly and quick disassembly for troubleshooting

The build comes in two parts – the main camera housing a Raspberry Pi 3 and Camera Module V2, and the ejectable cartridge fitted with Raspberry Pi Zero W and Adafruit PiTFT screen.

When the capture button is pressed, the camera takes 3 seconds’ worth of images and converts them into .gif format via a Python script. Once compressed and complete, the Pi 3 sends the file to the Zero W via a network connection. When it is satisfied that the Zero W has the image, the Pi 3 automatically ejects the ‘printed GIF’ cartridge, and the image is displayed.

For a full breakdown of code, 3D-printable files, and images, check out the full imgur post. You can see more of Abhishek’s work at his website here.

## Create GIFs with a Raspberry Pi

Want to create GIFs with your Raspberry Pi? Of course you do. Who wouldn’t? So check out our free time-lapse animations resource. As with all our learning resources, the project is free for you to use at home and in your clubs or classrooms. And once you’ve mastered the art of Pi-based GIF creation, why not incorporate it into another project? Say, a motion-detecting security camera or an on-the-go tweeting GIF camera – the possibilities are endless.

And make sure you check out Abhishek’s other Raspberry Pi GIF project, Peeqo, who we covered previously in the blog. So cute. SO CUTE.

The post A printing GIF camera? Is that even a thing? appeared first on Raspberry Pi.

# Top 10 Most Obvious Hacks of All Time (v0.9)

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/top-10-most-obvious-hacks-of-all-time.html

For teaching hacking/cybersecurity, I thought I’d create of the most obvious hacks of all time. Not the best hacks, the most sophisticated hacks, or the hacks with the biggest impact, but the most obvious hacks — ones that even the least knowledgeable among us should be able to understand. Below I propose some hacks that fit this bill, though in no particular order.

The reason I’m writing this is that my niece wants me to teach her some hacking. I thought I’d start with the obvious stuff first.

If you use the same password for every website, and one of those websites gets hacked, then the hacker has your password for all your websites. The reason your Facebook account got hacked wasn’t because of anything Facebook did, but because you used the same email-address and password when creating an account on “beagleforums.com”, which got hacked last year.

I’ve heard people say “I’m sure, because I choose a complex password and use it everywhere”. No, this is the very worst thing you can do. Sure, you can the use the same password on all sites you don’t care much about, but for Facebook, your email account, and your bank, you should have a unique password, so that when other sites get hacked, your important sites are secure.

And yes, it’s okay to write down your passwords on paper.

Tools: HaveIBeenPwned.com

### PIN encrypted PDFs

My accountant emails PDF statements encrypted with the last 4 digits of my Social Security Number. This is not encryption — a 4 digit number has only 10,000 combinations, and a hacker can guess all of them in seconds.
PIN numbers for ATM cards work because ATM machines are online, and the machine can reject your card after four guesses. PIN numbers don’t work for documents, because they are offline — the hacker has a copy of the document on their own machine, disconnected from the Internet, and can continue making bad guesses with no restrictions.
Passwords protecting documents must be long enough that even trillion upon trillion guesses are insufficient to guess.

Tools: Hashcat, John the Ripper

### SQL and other injection

The lazy way of combining websites with databases is to combine user input with an SQL statement. This combines code with data, so the obvious consequence is that hackers can craft data to mess with the code.
No, this isn’t obvious to the general public, but it should be obvious to programmers. The moment you write code that adds unfiltered user-input to an SQL statement, the consequence should be obvious. Yet, “SQL injection” has remained one of the most effective hacks for the last 15 years because somehow programmers don’t understand the consequence.
CGI shell injection is a similar issue. Back in early days, when “CGI scripts” were a thing, it was really important, but these days, not so much, so I just included it with SQL. The consequence of executing shell code should’ve been obvious, but weirdly, it wasn’t. The IT guy at the company I worked for back in the late 1990s came to me and asked “this guy says we have a vulnerability, is he full of shit?”, and I had to answer “no, he’s right — obviously so”.

XSS (“Cross Site Scripting”) [*] is another injection issue, but this time at somebody’s web browser rather than a server. It works because websites will echo back what is sent to them. For example, if you search for Cross Site Scripting with the URL https://www.google.com/search?q=cross+site+scripting, then you’ll get a page back from the server that contains that string. If the string is JavaScript code rather than text, then some servers (thought not Google) send back the code in the page in a way that it’ll be executed. This is most often used to hack somebody’s account: you send them an email or tweet a link, and when they click on it, the JavaScript gives control of the account to the hacker.

Cross site injection issues like this should probably be their own category, but I’m including it here for now.

More: Wikipedia on SQL injection, Wikipedia on cross site scripting.
Tools: Burpsuite, SQLmap

### Buffer overflows

In the C programming language, programmers first create a buffer, then read input into it. If input is long than the buffer, then it overflows. The extra bytes overwrite other parts of the program, letting the hacker run code.
Again, it’s not a thing the general public is expected to know about, but is instead something C programmers should be expected to understand. They should know that it’s up to them to check the length and stop reading input before it overflows the buffer, that there’s no language feature that takes care of this for them.
We are three decades after the first major buffer overflow exploits, so there is no excuse for C programmers not to understand this issue.

What makes particular obvious is the way they are wrapped in exploits, like in Metasploit. While the bug itself is obvious that it’s a bug, actually exploiting it can take some very non-obvious skill. However, once that exploit is written, any trained monkey can press a button and run the exploit. That’s where we get the insult “script kiddie” from — referring to wannabe-hackers who never learn enough to write their own exploits, but who spend a lot of time running the exploit scripts written by better hackers than they.

More: Wikipedia on buffer overflow, Wikipedia on script kiddie,  “Smashing The Stack For Fun And Profit” — Phrack (1996)
Tools: bash, Metasploit

### SendMail DEBUG command (historical)

The first popular email server in the 1980s was called “SendMail”. It had a feature whereby if you send a “DEBUG” command to it, it would execute any code following the command. The consequence of this was obvious — hackers could (and did) upload code to take control of the server. This was used in the Morris Worm of 1988. Most Internet machines of the day ran SendMail, so the worm spread fast infecting most machines.
This bug was mostly ignored at the time. It was thought of as a theoretical problem, that might only rarely be used to hack a system. Part of the motivation of the Morris Worm was to demonstrate that such problems was to demonstrate the consequences — consequences that should’ve been obvious but somehow were rejected by everyone.

More: Wikipedia on Morris Worm

I’m conflicted whether I should add this or not, because here’s the deal: you are supposed to click on attachments and links within emails. That’s what they are there for. The difference between good and bad attachments/links is not obvious. Indeed, easy-to-use email systems makes detecting the difference harder.
On the other hand, the consequences of bad attachments/links is obvious. That worms like ILOVEYOU spread so easily is because people trusted attachments coming from their friends, and ran them.
We have no solution to the problem of bad email attachments and links. Viruses and phishing are pervasive problems. Yet, we know why they exist.

### Default and backdoor passwords

The Mirai botnet was caused by surveillance-cameras having default and backdoor passwords, and being exposed to the Internet without a firewall. The consequence should be obvious: people will discover the passwords and use them to take control of the bots.
Surveillance-cameras have the problem that they are usually exposed to the public, and can’t be reached without a ladder — often a really tall ladder. Therefore, you don’t want a button consumers can press to reset to factory defaults. You want a remote way to reset them. Therefore, they put backdoor passwords to do the reset. Such passwords are easy for hackers to reverse-engineer, and hence, take control of millions of cameras across the Internet.
The same reasoning applies to “default” passwords. Many users will not change the defaults, leaving a ton of devices hackers can hack.

### Masscan and background radiation of the Internet

I’ve written a tool that can easily scan the entire Internet in a short period of time. It surprises people that this possible, but it obvious from the numbers. Internet addresses are only 32-bits long, or roughly 4 billion combinations. A fast Internet link can easily handle 1 million packets-per-second, so the entire Internet can be scanned in 4000 seconds, little more than an hour. It’s basic math.
Because it’s so easy, many people do it. If you monitor your Internet link, you’ll see a steady trickle of packets coming in from all over the Internet, especially Russia and China, from hackers scanning the Internet for things they can hack.
People’s reaction to this scanning is weirdly emotional, taking is personally, such as:
1. Why are they hacking me? What did I do to them?
2. Great! They are hacking me! That must mean I’m important!
3. Grrr! How dare they?! How can I hack them back for some retribution!?

I find this odd, because obviously such scanning isn’t personal, the hackers have no idea who you are.

Tools: masscan, firewalls

### Packet-sniffing, sidejacking

If you connect to the Starbucks WiFi, a hacker nearby can easily eavesdrop on your network traffic, because it’s not encrypted. Windows even warns you about this, in case you weren’t sure.

At DefCon, they have a “Wall of Sheep”, where they show passwords from people who logged onto stuff using the insecure “DefCon-Open” network. Calling them “sheep” for not grasping this basic fact that unencrypted traffic is unencrypted.

To be fair, it’s actually non-obvious to many people. Even if the WiFi itself is not encrypted, SSL traffic is. They expect their services to be encrypted, without them having to worry about it. And in fact, most are, especially Google, Facebook, Twitter, Apple, and other major services that won’t allow you to log in anymore without encryption.

But many services (especially old ones) may not be encrypted. Unless users check and verify them carefully, they’ll happily expose passwords.

What’s interesting about this was 10 years ago, when most services which only used SSL to encrypt the passwords, but then used unencrypted connections after that, using “cookies”. This allowed the cookies to be sniffed and stolen, allowing other people to share the login session. I used this on stage at BlackHat to connect to somebody’s GMail session. Google, and other major websites, fixed this soon after. But it should never have been a problem — because the sidejacking of cookies should have been obvious.

Tools: Wireshark, dsniff

### Stuxnet LNK vulnerability

Again, this issue isn’t obvious to the public, but it should’ve been obvious to anybody who knew how Windows works.
When Windows loads a .dll, it first calls the function DllMain(). A Windows link file (.lnk) can load icons/graphics from the resources in a .dll file. It does this by loading the .dll file, thus calling DllMain. Thus, a hacker could put on a USB drive a .lnk file pointing to a .dll file, and thus, cause arbitrary code execution as soon as a user inserted a drive.
I say this is obvious because I did this, created .lnks that pointed to .dlls, but without hostile DllMain code. The consequence should’ve been obvious to me, but I totally missed the connection. We all missed the connection, for decades.

### Social Engineering and Tech Support [* * *]

After posting this, many people have pointed out “social engineering”, especially of “tech support”. This probably should be up near #1 in terms of obviousness.

The classic example of social engineering is when you call tech support and tell them you’ve lost your password, and they reset it for you with minimum of questions proving who you are. For example, you set the volume on your computer really loud and play the sound of a crying baby in the background and appear to be a bit frazzled and incoherent, which explains why you aren’t answering the questions they are asking. They, understanding your predicament as a new parent, will go the extra mile in helping you, resetting “your” password.

One of the interesting consequences is how it affects domain names (DNS). It’s quite easy in many cases to call up the registrar and convince them to transfer a domain name. This has been used in lots of hacks. It’s really hard to defend against. If a registrar charges only \$9/year for a domain name, then it really can’t afford to provide very good tech support — or very secure tech support — to prevent this sort of hack.

Social engineering is such a huge problem, and obvious problem, that it’s outside the scope of this document. Just google it to find example after example.

A related issue that perhaps deserves it’s own section is OSINT [*], or “open-source intelligence”, where you gather public information about a target. For example, on the day the bank manager is out on vacation (which you got from their Facebook post) you show up and claim to be a bank auditor, and are shown into their office where you grab their backup tapes. (We’ve actually done this).

More: Wikipedia on Social Engineering, Wikipedia on OSINT, “How I Won the Defcon Social Engineering CTF” — blogpost (2011), “Questioning 42: Where’s the Engineering in Social Engineering of Namespace Compromises” — BSidesLV talk (2016)

### Blue-boxes (historical) [*]

Telephones historically used what we call “in-band signaling”. That’s why when you dial on an old phone, it makes sounds — those sounds are sent no differently than the way your voice is sent. Thus, it was possible to make tone generators to do things other than simply dial calls. Early hackers (in the 1970s) would make tone-generators called “blue-boxes” and “black-boxes” to make free long distance calls, for example.

These days, “signaling” and “voice” are digitized, then sent as separate channels or “bands”. This is call “out-of-band signaling”. You can’t trick the phone system by generating tones. When your iPhone makes sounds when you dial, it’s entirely for you benefit and has nothing to do with how it signals the cell tower to make a call.

Early hackers, like the founders of Apple, are famous for having started their careers making such “boxes” for tricking the phone system. The problem was obvious back in the day, which is why as the phone system moves from analog to digital, the problem was fixed.

More: Wikipedia on blue box, Wikipedia article on Steve Wozniak.

### Thumb drives in parking lots [*]

A simple trick is to put a virus on a USB flash drive, and drop it in a parking lot. Somebody is bound to notice it, stick it in their computer, and open the file.

This can be extended with tricks. For example, you can put a file labeled “third-quarter-salaries.xlsx” on the drive that required macros to be run in order to open. It’s irresistible to other employees who want to know what their peers are being paid, so they’ll bypass any warning prompts in order to see the data.

Another example is to go online and get custom USB sticks made printed with the logo of the target company, making them seem more trustworthy.

We also did a trick of taking an Adobe Flash game “Punch the Monkey” and replaced the monkey with a logo of a competitor of our target. They now only played the game (infecting themselves with our virus), but gave to others inside the company to play, infecting others, including the CEO.

Thumb drives like this have been used in many incidents, such as Russians hacking military headquarters in Afghanistan. It’s really hard to defend against.

More: “Computer Virus Hits U.S. Military Base in Afghanistan” — USNews (2008), “The Return of the Worm That Ate The Pentagon” — Wired (2011), DoD Bans Flash Drives — Stripes (2008)

### Googling [*]

Search engines like Google will index your website — your entire website. Frequently companies put things on their website without much protection because they are nearly impossible for users to find. But Google finds them, then indexes them, causing them to pop up with innocent searches.
There are books written on “Google hacking” explaining what search terms to look for, like “not for public release”, in order to find such documents.

More: Wikipedia entry on Google Hacking, “Google Hacking” book.

### URL editing [*]

At the top of every browser is what’s called the “URL”. You can change it. Thus, if you see a URL that looks like this:

http://www.example.com/documents?id=138493

Then you can edit it to see the next document on the server:

http://www.example.com/documents?id=138494

The owner of the website may think they are secure, because nothing points to this document, so the Google search won’t find it. But that doesn’t stop a user from manually editing the URL.
An example of this is a big Fortune 500 company that posts the quarterly results to the website an hour before the official announcement. Simply editing the URL from previous financial announcements allows hackers to find the document, then buy/sell the stock as appropriate in order to make a lot of money.
Another example is the classic case of Andrew “Weev” Auernheimer who did this trick in order to download the account email addresses of early owners of the iPad, including movie stars and members of the Obama administration. It’s an interesting legal case because on one hand, techies consider this so obvious as to not be “hacking”. On the other hand, non-techies, especially judges and prosecutors, believe this to be obviously “hacking”.

### DDoS, spoofing, and amplification [*]

For decades now, online gamers have figured out an easy way to win: just flood the opponent with Internet traffic, slowing their network connection. This is called a DoS, which stands for “Denial of Service”. DoSing game competitors is often a teenager’s first foray into hacking.
A variant of this is when you hack a bunch of other machines on the Internet, then command them to flood your target. (The hacked machines are often called a “botnet”, a network of robot computers). This is called DDoS, or “Distributed DoS”. At this point, it gets quite serious, as instead of competitive gamers hackers can take down entire businesses. Extortion scams, DDoSing websites then demanding payment to stop, is a common way hackers earn money.
Another form of DDoS is “amplification”. Sometimes when you send a packet to a machine on the Internet it’ll respond with a much larger response, either a very large packet or many packets. The hacker can then send a packet to many of these sites, “spoofing” or forging the IP address of the victim. This causes all those sites to then flood the victim with traffic. Thus, with a small amount of outbound traffic, the hacker can flood the inbound traffic of the victim.
This is one of those things that has worked for 20 years, because it’s so obvious teenagers can do it, yet there is no obvious solution. President Trump’s executive order of cyberspace specifically demanded that his government come up with a report on how to address this, but it’s unlikely that they’ll come up with any useful strategy.

More: Wikipedia on DDoS, Wikipedia on Spoofing

### Conclusion

Tweet me (@ErrataRob) your obvious hacks, so I can add them to the list.

# GnuPG funding campaign

Post Syndicated from ris original https://lwn.net/Articles/724721/rss

The GnuPG Project has announced the launch of a funding campaign to further
support and improve its mail and data encryption software, GnuPG.
The 6 person development team is currently financed from a
successful campaign in early 2015, regular donations from the Linux
Foundation, Stripe, Facebook, and a few paid development projects. To
ensure long-term stability the new campaign focuses on recurring donations
and not one-time donations.

# Operating OpenStack at Scale

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/159795571841

By James Penick, Cloud Architect & Gurpreet Kaur, Product Manager

A version of this byline was originally written for and appears in CIO Review.

A successful private cloud presents a consistent and reliable facade over the complexities of hyperscale infrastructure. It must simultaneously handle constant organic traffic growth, unanticipated spikes, a multitude of hardware vendors, and discordant customer demands. The depth of this complexity only increases with the age of the business, leaving a private cloud operator saddled with legacy hardware, old network infrastructure, customers dependent on legacy operating systems, and the list goes on. These are the foundations of the horror stories told by grizzled operators around the campfire.

Providing a plethora of services globally for over a billion active users requires a hyperscale infrastructure. Yahoo’s on-premises infrastructure is comprised of datacenters housing hundreds of thousands of physical and virtual compute resources globally, connected via a multi-terabit network backbone. As one of the very first hyperscale internet companies in the world, Yahoo’s infrastructure had grown organically – things were built, and rebuilt, as the company learned and grew. The resulting web of modern and legacy infrastructure became progressively more difficult to manage. Initial attempts to manage this via IaaS (Infrastructure-as-a-Service) taught some hard lessons. However, those lessons served us well when OpenStack was selected to manage Yahoo’s datacenters, some of which are shared below.

Centralized team offering Infrastructure-as-a-Service

Chief amongst the lessons learned prior to OpenStack was that IaaS must be presented as a core service to the whole organization by a dedicated team. An a-la-carte-IaaS, where each user is expected to manage their own control plane and inventory, just isn’t sustainable at scale. Multiple teams tackling the same challenges involved in the curation of software, deployment, upkeep, and security within an organization is not just a duplication of effort; it removes the opportunity for improved synergy with all levels of the business. The first OpenStack cluster, with a centralized dedicated developer and service engineering team, went live in June 2012.  This model has served us well and has been a crucial piece of making OpenStack succeed at Yahoo. One of the biggest advantages to a centralized, core team is the ability to collaborate with the foundational teams upon which any business is built: Supply chain, Datacenter Site-Operations, Finance, and finally our customers, the engineering teams. Building a close relationship with these vital parts of the business provides the ability to streamline the process of scaling inventory and presenting on-demand infrastructure to the company.

Developers love instant access to compute resources

Our developer productivity clusters, named “OpenHouse,” were a huge hit. Ideation and experimentation are core to developers’ DNA at Yahoo. It empowers our engineers to innovate, prototype, develop, and quickly iterate on ideas. No longer is a developer reliant on a static and costly development machine under their desk. OpenHouse enables developer agility and cost savings by obviating the desktop.

Dynamic infrastructure empowers agile products

From a humble beginning of a single, small OpenStack cluster, Yahoo’s OpenStack footprint is growing beyond 100,000 VM instances globally, with our single largest virtual machine cluster running over a thousand compute nodes, without using Nova Cells.

Until this point, Yahoo’s production footprint was nearly 100% focused on baremetal – a part of the business that one cannot simply ignore. In 2013, Yahoo OpenStack Baremetal began to manage all new compute deployments. Interestingly, after moving to a common API to provision baremetal and virtual machines, there was a marked increase in demand for virtual machines.

Developers across all major business units ranging from Yahoo Mail, Video, News, Finance, Sports and many more, were thrilled with getting instant access to compute resources to hit the ground running on their projects. Today, the OpenStack team is continuing to fully migrate the business to OpenStack-managed. Our baremetal footprint is well beyond that of our VMs, with over 100,000 baremetal instances provisioned by OpenStack Nova via Ironic.

How did Yahoo hit this scale?

Scaling OpenStack begins with understanding how its various components work and how they communicate with one another. This topic can be very deep and for the sake of brevity, we’ll hit the high points.

1. Start at the bottom and think about the underlying hardware

Do not overlook the unique resource constraints for the services which power your cloud, nor the fashion in which those services are to be used. Leverage that understanding to drive hardware selection. For example, when one examines the role of the database server in an OpenStack cluster, and considers the multitudinous calls to the database: compute node heartbeats, instance state changes, normal user operations, and so on; they would conclude this core component is extremely busy in even a modest-sized Nova cluster, and in need of adequate computational resources to perform. Yet many deployers skimp on the hardware. The performance of the whole cluster is bottlenecked by the DB I/O. By thinking ahead you can save yourself a lot of heartburn later on.

2. Think about how things communicate

Our cluster databases are configured to be multi-master single-writer with automated failover. Control plane services have been modified to split DB reads directly to the read slaves and only write to the write-master. This distributes load across the database servers.

3. Scale wide

OpenStack has many small horizontally-scalable components which can peacefully cohabitate on the same machines: the Nova, Keystone, and Glance APIs, for example. Stripe these across several small or modest hardware. Some services, such as the Nova scheduler, run the risk of race conditions when running multi-active. If the risk of race conditions is unacceptable, use ZooKeeper to manage leader election.

4. Remove dependencies

In a Yahoo datacenter, DHCP is only used to provision baremetal servers. By statically declaring IPs in our instances via cloud-init, our infrastructure is less prone to outage from a failure in the DHCP infrastructure.

5. Don’t be afraid to replace things

Neutron used Dnsmasq to provide DHCP services, however it was not designed to address the complexity or scale of a dynamic environment. For example, Dnsmasq must be restarted for any config change, such as when a new host is being provisioned.  In the Yahoo OpenStack clusters this has been replaced by ISC-DHCPD, which scales far better than Dnsmasq and allows dynamic configuration updates via an API.

6. Or split them apart

Some of the core imaging services provided by Ironic, such as DHCP, TFTP, and HTTPS communicate with a host during the provisioning process. These services are normally  part of the Ironic Conductor (IC) service. In our environment we split these services into a new and physically-distinct service called the Ironic Transport Service (ITS). This brings value by:

• Adding security: Splitting the ITS from the IC allows us to block all network traffic from production compute nodes to the IC, and other parts of our control plane. If a malicious entity attacks a node serving production traffic, they cannot escalate from it  to our control plane.
• Scale: The ITS hosts allow us to horizontally scale the core provisioning services with which nodes communicate.
• Flexibility: ITS allows Yahoo to manage remote sites, such as peering points, without building a new cluster in that site. Resources in those sites can now be managed by the nearest Yahoo owned & operated (O&O) datacenter, without needing to build a whole cluster in each site.

Be prepared for faulty hardware!

Running IaaS reliably at hyperscale is more than just scaling the control plane. One must take a holistic look at the system and consider everything. In fact, when examining provisioning failures, our engineers determined the majority root cause was faulty hardware. For example, there are a number of machines from varying vendors whose IPMI firmware fails from time to time, leaving the host inaccessible to remote power management. Some fail within minutes or weeks of installation. These failures occur on many different models, across many generations, and across many hardware vendors. Exposing these failures to users would create a very negative experience, and the cloud must be built to tolerate this complexity.

Focus on the end state

Yahoo’s experience shows that one can run OpenStack at hyperscale, leveraging it to wrap infrastructure and remove perceived complexity. Correctly leveraged, OpenStack presents an easy, consistent, and error-free interface. Delivering this interface is core to our design philosophy as Yahoo continues to double down on our OpenStack investment. The Yahoo OpenStack team looks forward to continue collaborating with the OpenStack community to share feedback and code.

# Now Available – I3 Instances for Demanding, I/O Intensive Applications

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-i3-instances-for-demanding-io-intensive-applications/

On the first day of AWS re:Invent I published an EC2 Instance Update and promised to share additional information with you as soon as I had it.

Today I am happy to be able to let you know that we are making six sizes of our new I3 instances available in fifteen AWS regions! Designed for I/O intensive workloads and equipped with super-efficient NVMe SSD storage, these instances can deliver up to 3.3 million IOPS at a 4 KB block and up to 16 GB/second of sequential disk throughput. This makes them a great fit for any workload that requires high throughput and low latency including relational databases, NoSQL databases, search engines, data warehouses, real-time analytics, and disk-based caches. When compared to the I2 instances, I3 instances deliver storage that is less expensive and more dense, with the ability to deliver substantially more IOPS and more network bandwidth per CPU core.

The Specs
Here are the instance sizes and the associated specs:

 Instance Name vCPU Count Memory Instance Storage (NVMe SSD) Price/Hour i3.large 2 15.25 GiB 0.475 TB \$0.15 i3.xlarge 4 30.5 GiB 0.950 TB \$0.31 i3.2xlarge 8 61 GiB 1.9 TB \$0.62 i3.4xlarge 16 122 GiB 3.8 TB (2 disks) \$1.25 i3.8xlarge 32 244 GiB 7.6 TB (4 disks) \$2.50 i3.16xlarge 64 488 GiB 15.2 TB (8 disks) \$4.99

The prices shown are for On-Demand instances in the US East (Northern Virginia) Region; see the EC2 pricing page for more information.

I3 instances are available in On-Demand, Reserved, and Spot form in the US East (Northern Virginia), US West (Oregon), US West (Northern California), US East (Ohio), Canada (Central), South America (São Paulo), EU (Ireland), EU (London), EU (Frankfurt), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai), Asia Pacific (Sydney), and AWS GovCloud (US) Regions. You can also use them as Dedicated Hosts and as Dedicated Instances.

These instances support Hardware Virtualization (HVM) AMIs only, and must be run within a Virtual Private Cloud. In order to benefit from the performance made possible by the NVMe storage, you must run one of the following operating systems:

• Amazon Linux AMI
• RHEL – 6.5 or better
• CentOS – 7.0 or better
• Ubuntu – 16.04 or 16.10
• SUSE 12
• SUSE 11 with SP3
• Windows Server 2008 R2, 2012 R2, and 2016

The I3 instances offer up to 8 NVMe SSDs. In order to achieve the best possible throughput and to get as many IOPS as possible, you can stripe multiple volumes together, or spread the I/O workload across them in another way.

Each vCPU (Virtual CPU) is a hardware hyperthread on an Intel E5-2686 v4 (Broadwell) processor running at 2.3 GHz. The processor supports the AVX2 instructions, along with Turbo Boost and NUMA.

Go For Launch
The I3 instances are available today in fifteen AWS regions and you can start to use them right now.

Jeff;

# Kosovo’s First Pi Wars

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/kosovos-first-pi-wars/

British Engineer Andy Moxon recently contacted us to highlight a Pi Wars event he was organising in Kosovo.

I write to inform you about an event I am running akin to Pi Wars, here in the newly independent country of Kosovo, South-East Europe.

I am a British engineer and have been living in Kosovo as a volunteer for the last two and a half years.  For the past eight months I have been working with two groups of twelve to fifteen year old students in a club we have called ‘Young Innovators’.  It is an after-school club centred around the Raspberry Pi.  We have mainly focused on physical computing, with the aim of building Raspberry Pi powered robots, similar to those that compete in Pi Wars.

Eager to see the outcome of the event, Liz asked if he would write a blog post for us and, being the lovely chap he is, Andy agreed. We think Mike and Tim, creators of the original Pi Wars, will be thrilled to see this.

Here’s his rundown of the successful event:

Many people are confused about the country of Kosovo, and there’s much that could be written here to rectify this – perhaps most important to us is the fact that it declared independence from Serbia just eight years ago. However, even more importantly (for this blog at least!), the country is not without Python coding, physical computing, robots and a good number of Raspberry Pis.

Since the start of 2016, I’ve been running an after-school club called ‘Young Innovators’, diving into the world of the Raspberry Pi, to prepare for our (much smaller) version of Pi Wars, happening this December. Based in the small town of Shtime, the club aims to bring to life maths and physics, while also teaching the students programming and robotics.

In one sense our robots are pretty standard. A single Raspberry Pi Zero is powered by a thin mobile phone power bank, and four AA batteries power two motors via a L293D motor-controller chip. At the front, we have a HC-SR04 ultrasonic distance sensor and two infra-red line sensors underneath. Additionally, we use two additional infra-red sensors to count wheel revolutions, having painted white stripes on our wheels using nail polish! This opens the robots up to some interesting autonomous challenges, such as the three-point turn, which was included in the last Pi Wars competition.

An area which has caused a lot of excitement in the club has been the recent introduction of an Ultimaker 2+ 3D printer.  Using FreeCAD (available for the Pi2 and above) we have designed the chassis of the robots from nothing. This has been a tough but worthwhile exercise, demonstrating the wonders of 3D prototyping.

At the time of writing, the robots have been screwed together and the electronics connected. We’re now in the thick of programming using Pygame (now integral to Python), preparing our eight robots for the battle.

Big thanks must go to the Raspberry Pi blogging community.  I first used a Raspberry Pi just a year ago and, without the dedication of excellent bloggers, we would never have been able to reach this stage.

You can follow our progress on our blog: www.younginnovators-ks.com

See? Told you he was a lovely chap!

The post Kosovo’s First Pi Wars appeared first on Raspberry Pi.

# Amazon Aurora Update – PostgreSQL Compatibility

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-aurora-update-postgresql-compatibility/

Just two years ago (it seems like yesterday), I introduced you to Amazon Aurora in my post Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS. In that post I told you how the RDS team took a fresh, unconstrained look at the relational database model and explained how they built a relational database for the cloud.

The feedback that we have received from our customers since then has been heart-warming.  Customers love the MySQL compatibility, the focus on high availability, and the built-in encryption. They count on the fact that Aurora is built around fault-tolerant, self-healing storage that allows them to scale from 10 GB all the way up to 64 TB without pre-provisioning. They know that Aurora makes six copies of their data across three Availability Zones and backs it up to Amazon Simple Storage Service (S3) without impacting performance or availability. As they scale, they know that they can create up to 15 low-latency read replicas that draw from common storage. To learn more about how our customers are using Aurora in world-scale production environments, take some time to read our Amazon Aurora Testimonials.

Of course, customers are always asking for more, and we do our best to understand their needs and to oblige. Here is a look back at some recent updates that were made in response to specific feedback from customers:

And Now, PostgreSQL Compatibility
In addition to the feature-level feedback, we received many requests for additional database compatibility. At the top of the list was compatibility with PostgreSQL. This open source database has been under continuous development for 20 years and has found a home in many enterprises and startups. Customers like the enterprise features (similar to those offered by SQL Server and Oracle), performance benefits, and the geospatial objects associated with PostgreSQL.  They would love to have access to these capabilities while also taking advantage of all that Aurora has to offer.

Today we are launching a preview of Amazon Aurora PostgreSQL-Compatible Edition. It offers all of the benefits that I listed above, including high durability, high availability, and the ability to quickly create and deploy read replicas. Here are some of the things you will love about it:

Performance – Aurora delivers up to 2x the performance of PostgreSQL running in traditional environments.

Compatibility – Aurora is fully compatible with the open source version of PostgreSQL (version 9.6.1). On the stored procedure side, we are planning to support Perl, pgSQL, Tcl, and JavaScript (via the V8 JavaScript engine). We are also planning to support all of the PostgreSQL features and extensions that are supported in Amazon RDS for PostgreSQL.

Cloud Native – Aurora takes full advantage of the fact that it is running within AWS. Here are some of the touch points:

Here’s how you access all of this from the RDS Console. You start by selecting the PostgresSQL Compatible option:

Then you choose your database instance type, decide on Multi-AZ deployment, name your database instance, and set up a user name & password:

We are making a preview of PostgreSQL compatibility for Amazon Aurora available in the US East (Northern Virginia) Region today and you can sign up now for access!

A Quick Comparison
My colleagues David Wein and Grant McAlister ran some tests that compared the performance of PostgreSQL compatibility for Amazon Aurora against PostgreSQL 9.6.1. The database servers were run on m4.16xlarge instances and the test clients were run on c4.8xlarge instances.

PostgreSQL was run using 45K of Provisioned IOPS storage consisting of three 15K IOPS EBS volumes striped into one logical volume, topped off with an ext4 file system. They enabled WAL compression and aggressive autovacuum, both of which improve the performance of PostgreSQL on the workloads that they tested.

David & Grant ran the standard PostgreSQL pgbench benchmarking tool. They used a scaling factor of 2000 which creates a 30 GiB database and uses several different client counts. Each data point ran for one hour, with the database recreated before each run. The graph below shows the results:

David also shared the final seconds of one of his runs:

``````progress: 3597.0 s, 39048.4 tps, lat 26.075 ms stddev 9.883
progress: 3598.0 s, 38047.7 tps, lat 26.959 ms stddev 10.197
progress: 3599.0 s, 38111.1 tps, lat 27.009 ms stddev 10.257
progress: 3600.0 s, 34371.7 tps, lat 29.363 ms stddev 14.468
transaction type:
scaling factor: 2000
query mode: prepared
number of clients: 1024
number of threads: 1024
duration: 3600 s
number of transactions actually processed: 137508938
latency average = 26.800 ms
latency stddev = 19.222 ms
tps = 38192.805529 (including connections establishing)
tps = 38201.099738 (excluding connections establishing)

``````

They also shared a per-second throughput graph that covered the last 40 minutes of a similar run:

As you can see, Amazon Aurora delivered higher throughput than PostgreSQL, with about 1/3 of the jitter (standard deviations of 1395 TPS and 5081 TPS, respectively).

David and Grant are now collecting data for a more detailed post that they plan to publish in early 2017.

Coming Soon – Performance Insights
We are also working on a new tool that is designed to help you to understand database performance at a very detailed level. You will be able to look inside of each query and learn more about how your database handles it. Here’s a sneak preview screen shot:

You will be able to access the new Performance Insights as part of the preview. I’ll have more details and a full tour later.

Jeff;

# Bringing the Viewer In: The Video Opportunity in Virtual Reality

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/151940036881

By Satender Saroha, Video Engineering

Virtual reality (VR) 360° videos are the next frontier of how we engage with and consume content. Unlike a traditional scenario in which a person views a screen in front of them, VR places the user inside an immersive experience. A viewer is “in” the story, and not on the sidelines as an observer.

Ivan Sutherland, widely regarded as the father of computer graphics, laid out the vision for virtual reality in his famous speech, “Ultimate Display” in 1965 [1]. In that he said, “You shouldn’t think of a computer screen as a way to display information, but rather as a window into a virtual world that could eventually look real, sound real, move real, interact real, and feel real.”

Over the years, significant advancements have been made to bring reality closer to that vision. With the advent of headgear capable of rendering 3D spatial audio and video, realistic sound and visuals can be virtually reproduced, delivering immersive experiences to consumers.

When it comes to entertainment and sports, streaming in VR has become the new 4K HEVC/UHD of 2016. This has been accelerated by the release of new camera capture hardware like GoPro and streaming capabilities such as 360° video streaming from Facebook and YouTube. Yahoo streams lots of engaging sports, finance, news, and entertainment video content to tens of millions of users. The opportunity to produce and stream such content in 360° VR opens a unique opportunity to Yahoo to offer new types of engagement, and bring the users a sense of depth and visceral presence.

While this is not an experience that is live in product, it is an area we are actively exploring. In this blog post, we take a look at what’s involved in building an end-to-end VR streaming workflow for both Live and Video on Demand (VOD). Our experiments and research goes from camera rig setup, to video stitching, to encoding, to the eventual rendering of videos on video players on desktop and VR headsets. We also discuss challenges yet to be solved and the opportunities they present in streaming VR.

1. The Workflow

Yahoo’s video platform has a workflow that is used internally to enable streaming to an audience of tens of millions with the click of a few buttons. During experimentation, we enhanced this same proven platform and set of APIs to build a complete 360°/VR experience. The diagram below shows the end-to-end workflow for streaming 360°/VR that we built on Yahoo’s video platform.

Figure 1: VR Streaming Workflow at Yahoo

1.1. Capturing 360° video

In order to capture a virtual reality video, you need access to a 360°-capable video camera. Such a camera uses either fish-eye lenses or has an array of wide-angle lenses to collectively cover a 360 (θ) by 180 (ϕ) sphere as shown below.

Though it sounds simple, there is a real challenge in capturing a scene in 3D 360° as most of the 360° video cameras offer only 2D 360° video capture.

In initial experiments, we tried capturing 3D video using two cameras side-by-side, for left and right eyes and arranging them in a spherical shape. However this required too many cameras – instead we use view interpolation in the stitching step to create virtual cameras.

Another important consideration with 360° video is the number of axes the camera is capturing video with. In traditional 360° video that is captured using only a single-axis (what we refer as horizontal video), a user can turn their head from left to right. But this setup of cameras does not support a user tilting their head at 90°.

To achieve true 3D in our setup, we went with 6-12 GoPro cameras having 120° field of view (FOV) arranged in a ring, and an additional camera each on top and bottom, with each one outputting 2.7K at 30 FPS.

1.2. Stitching 360° video

Projection Layouts

Because a 360° view is a spherical video, the surface of this sphere needs to be projected onto a planar surface in 2D so that video encoders can process it. There are two popular layouts:

Equirectangular layout: This is the most widely-used format in computer graphics to represent spherical surfaces in a rectangular form with an aspect ratio of 2:1. This format has redundant information at the poles which means some pixels are over-represented, introducing distortions at the poles compared to the equator (as can be seen in the equirectangular mapping of the sphere below).

Figure 2: Equirectangular Layout [2]

CubeMap layout: CubeMap layout is a format that has also been used in computer graphics. It contains six individual 2D textures that map to six sides of a cube. The figure below is a typical cubemap representation. In a cubemap layout, the sphere is projected onto six faces and the images are folded out into a 2D image, so pieces of a video frame map to different parts of a cube, which leads to extremely efficient compact packing. Cubemap layouts require about 25% fewer pixels compared to equirectangular layouts.

Figure 3: CubeMap Layout [3]

Stitching Videos

In our setup, we experimented with a couple of stitching softwares. One was from Vahana VR [4], and the other was a modified version of the open-source Surround360 technology that works with a GoPro rig [5]. Both softwares output equirectangular panoramas for the left and the right eye. Here are the steps involved in stitching together a 360° image:

Raw frame image processing: Converts uncompressed raw video data to RGB, which involves several steps starting from black-level adjustment, to applying Demosaic algorithms in order to figure out RGB color parts for each pixel based on the surrounding pixels. This also involves gamma correction, color correction, and anti vignetting (undoing the reduction in brightness on the image periphery). Finally, this stage applies sharpening and noise-reduction algorithms to enhance the image and suppress the noise.

Calibration: During the calibration step, stitching software takes steps to avoid vertical parallax while stitching overlapping portions in adjacent cameras in the rig. The purpose is to align everything in the scene, so that both eyes see every point at the same vertical coordinate. This step essentially matches the key points in images among adjacent camera pairs. It uses computer vision algorithms for feature detection like Binary Robust Invariant Scalable Keypoints (BRISK) [6] and AKAZE [7].

Optical Flow: During stitching, to cover the gaps between adjacent real cameras and provide interpolated view, optical flow is used to create virtual cameras. The optical flow algorithm finds the pattern of apparent motion of image objects between two consecutive frames caused by the movement of the object or camera. It uses OpenCV algorithms to find the optical flow [8].

Below are the frames produced by the GoPro camera rig:

Figure 4: Individual frames from 12-camera rig

Figure 5: Stitched frame output with PtGui

Figure 6: Stitched frame with barrel distortion using Surround360

Figure 7: Stitched frame after removing barrel distortion using Surround360

To get the full depth in stereo, the rig is set-up so that i = r * sin(FOV/2 – 360/n). where:

• i = IPD/2 where IPD is the inter-pupillary distance between eyes.\
• r = Radius of the rig.
• FOV = Field of view of GoPro cameras, 120 degrees.
• n = Number of cameras which is 12 in our setup.

Given IPD is normally 6.4 cms, i should be greater than 3.2 cm. This implies that with a 12-camera setup, the radius of the the rig comes to 14 cm(s). Usually, if there are more cameras it is easier to avoid black stripes.

Reducing Bandwidth – FOV-based adaptive transcoding

For a truly immersive experience, users expect 4K (3840 x 2160) quality resolution at 60 frames per second (FPS) or higher. Given typical HMDs have a FOV of 120 degrees, a full 360° video needs a resolution of at least 12K (11520 x 6480). 4K streaming needs a bandwidth of 25 Mbps [9]. So for 12K resolution, this effectively translates to > 75 Mbps and even more for higher framerates. However, average wifi in US has bandwidth of 15 Mbps [10].

One way to address the bandwidth issue is by reducing the resolution of areas that are out of the field of view. Spatial sub-sampling is used during transcoding to produce multiple viewport-specific streams. Each viewport-specific stream has high resolution in a given viewport and low resolution in the rest of the sphere.

On the player side, we can modify traditional adaptive streaming logic to take into account field of view. Depending on the video, if the user moves his head around a lot, it could result in multiple buffer fetches and could result in rebuffering. Ideally, this will work best in videos where the excessive motion happens in one field of view at a time and does not span across multiple fields of view at the same time. This work is still in an experimental stage.

The default output format from stitching software of both Surround360 and Vahana VR is equirectangular format. In order to reduce the size further, we pass it through a cubemap filter transform integrated into ffmpeg to get an additional pixel reduction of ~25%  [11] [12].

At the end of above steps, the stitching pipeline produces high-resolution stereo 3D panoramas which are then ingested into the existing Yahoo Video transcoding pipeline to produce multiple bit-rates HLS streams.

1.3. Adding a stitching step to the encoding pipeline

Live – In order to prepare for multi-bitrate streaming over the Internet, a live 360° video-stitched stream in RTMP is ingested into Yahoo’s video platform. A live Elemental encoder was used to re-encode and package the live input into multiple bit-rates for adaptive streaming on any device (iOS, Android, Browser, Windows, Mac, etc.)

Video on Demand – The existing Yahoo video transcoding pipeline was used to package multiple bit-rates HLS streams from raw equirectangular mp4 source videos.

1.4. Rendering 360° video into the player

The spherical video stream is delivered to the Yahoo player in multiple bit rates. As a user changes their viewing angle, different portion of the frame are shown, presenting a 360° immersive experience. There are two types of VR players currently supported at Yahoo:

WebVR based Javascript Player – The Web community has been very active in enabling VR experiences natively without plugins from within browsers. The W3C has a Javascript proposal [13], which describes support for accessing virtual reality (VR) devices, including sensors and head-mounted displays on the Web. VR Display is the main starting point for all the device APIs supported. Some of the key interfaces and attributes exposed are:

• VR Display Capabilities: It has attributes to indicate position support, orientation support, and has external display.
• VR Layer: Contains the HTML5 canvas element which is presented by VR Display when its submit frame is called. It also contains attributes defining the left bound and right bound textures within source canvas for presenting to an eye.
• VREye Parameters: Has information required to correctly render a scene for given eye. For each eye, it has offset the distance from middle of the user’s eyes to the center point of one eye which is half of the interpupillary distance (IPD). In addition, it maintains the current FOV of the eye, and the recommended renderWidth and render Height of each eye viewport.
• Get VR Displays: Returns a list of VR Display(s) HMDs accessible to the browser.

We implemented a subset of webvr spec in the Yahoo player (not in production yet) that lets you watch monoscopic and stereoscopic 3D video on supported web browsers (Chrome, Firefox, Samsung), including Oculus Gear VR-enabled phones. The Yahoo player takes the equirectangular video and maps its individual frames on the Canvas javascript element. It uses the webGL and Three.JS libraries to do computations for detecting the orientation and extracting the corresponding frames to display.

For web devices which support only monoscopic rendering like desktop browsers without HMD, it creates a single Perspective Camera object specifying the FOV and aspect ratio. As the device’s requestAnimationFrame is called it renders the new frames. As part of rendering the frame, it first calculates the projection matrix for FOV and sets the X (user’s right), Y (Up), Z (behind the user) coordinates of the camera position.

For devices that support stereoscopic rendering like mobile phones from Samsung Gear, the webvr player creates two PerspectiveCamera objects, one for the left eye and one for the right eye. Each Perspective camera queries the VR device capabilities to get the eye parameters like FOV, renderWidth and render Height every time a frame needs to be rendered at the native refresh rate of HMD. The key difference between stereoscopic and monoscopic is the perceived sense of depth that the user experiences, as the video frames separated by an offset are rendered by separate canvas elements to each individual eye.

Cardboard VR – Google provides a VR sdk for both iOS and Android [14]. This simplifies common VR tasks like-lens distortion correction, spatial audio, head tracking, and stereoscopic side-by-side rendering. For iOS, we integrated Cardboard VR functionality into our Yahoo Video SDK, so that users can watch stereoscopic 3D videos on iOS using Google Cardboard.

2. Results

With all the pieces in place, and experimentation done, we were able to successfully do a 360° live streaming of an internal company-wide event.

Figure 8: 360° Live streaming of Yahoo internal event

In addition to demonstrating our live streaming capabilities, we are also experimenting with showing 360° VOD videos produced with a GoPro-based camera rig. Here is a screenshot of one of the 360° videos being played in the Yahoo player.

Figure 9: Yahoo Studios produced 360° VOD content in the Yahoo Player

3. Challenges and Opportunities

3.1. Enormous amounts of data

As we alluded to in the video processing section of this post, delivering 4K resolution videos for each eye for each FOV at a high frame-rate remains a challenge. While FOV-adaptive streaming does reduce the size by providing high resolution streams separately for each FOV, providing an impeccable 60 FPS or more viewing experience still requires a lot more data than the current internet pipes can handle. Some of the other possible options which we are closely paying attention to are:

Compression efficiency with HEVC and VP9 – New codecs like HEVC and VP9 have the potential to provide significant compression gains. HEVC open source codecs like x265 have shown a 40% compression performance gain compared to the currently ubiquitous H.264/AVC codec. LIkewise, a VP9 codec from Google has shown similar 40% compression performance gains. The key challenge is the hardware decoding support and the browser support. But with Apple and Microsoft very much behind HEVC and Firefox and Chrome already supporting VP9, we believe most browsers would support HEVC or VP9 within a year.

Using 10 bit color depth vs 8 bit color depth – Traditional monitors support 8 bpc (bits per channel) for displaying images. Given each pixel has 3 channels (RGB), 8 bpc maps to 256x256x256 color/luminosity combinations to represent 16 million colors. With 10 bit color depth, you have the potential to represent even more colors. But the biggest stated advantage of using 10 bit color depth is with respect to compression during encoding even if the source only uses 8 bits per channel. Both x264 and x265 codecs support 10 bit color depth, with ffmpeg already supporting encoding at 10 bit color depth.

3.2. Six degrees of freedom

With current camera rig workflows, users viewing the streams through HMD are able to achieve three degrees of Freedom (DoF) i.e., the ability to move up/down, clockwise/anti-clockwise, and swivel. But you still can’t get a different perspective when you move inside it i.e., move forward/backward. Until now, this true six DoF immersive VR experience has only been possible in CG VR games. In video streaming, LightField technology-based video cameras produced by Lytro are the first ones to capture light field volume data from all directions [15]. But Lightfield-based videos require an order of magnitude more data than traditional fixed FOV, fixed IPD, fixed lense camera rigs like GoPro. As bandwidth problems get resolved via better compressions and better networks, achieving true immersion should be possible.

4. Conclusion

VR streaming is an emerging medium and with the addition of 360° VR playback capability, Yahoo’s video platform provides us a great starting point to explore the opportunities in video with regard to virtual reality. As we continue to work to delight our users by showing immersive video content, we remain focused on optimizing the rendering of high-quality 4K content in our players. We’re looking at building FOV-based adaptive streaming capabilities and better compression during delivery. These capabilities, and the enhancement of our webvr player to play on more HMDs like HTC Vive and Oculus Rift, will set us on track to offer streaming capabilities across the entire spectrum. At the same time, we are keeping a close watch on advancements in supporting spatial audio experiences, as well as advancements in the ability to stream volumetric lightfield videos to achieve true six degrees of freedom, with the aim of realizing the true potential of VR.

Glossary – VR concepts:

VR – Virtual reality, commonly referred to as VR, is an immersive computer-simulated reality experience that places viewers inside an experience. It “transports” viewers from their physical reality into a closed virtual reality. VR usually requires a headset device that takes care of sights and sounds, while the most-involved experiences can include external motion tracking, and sensory inputs like touch and smell. For example, when you put on VR headgear you suddenly start feeling immersed in the sounds and sights of another universe, like the deck of the Star Trek Enterprise. Though you remain physically at your place, VR technology is designed to manipulate your senses in a manner that makes you truly feel as if you are on that ship, moving through the virtual environment and interacting with the crew.

360 degree video – A 360° video is created with a camera system that simultaneously records all 360 degrees of a scene. It is a flat equirectangular video projection that is morphed into a sphere for playback on a VR headset. A standard world map is an example of equirectangular projection, which maps the surface of the world (sphere) onto orthogonal coordinates.

Spatial Audio – Spatial audio gives the creator the ability to place sound around the user. Unlike traditional mono/stereo/surround audio, it responds to head rotation in sync with video. While listening to spatial audio content, the user receives a real-time binaural rendering of an audio stream [17].

FOV – A human can naturally see 170 degrees of viewable area (field of view). Most consumer grade head mounted displays HMD(s) like Oculus Rift and HTC Vive now display 90 degrees to 120 degrees.

Monoscopic video – A monoscopic video means that both eyes see a single flat image, or video file. A common camera setup involves six cameras filming six different fields of view. Stitching software is used to form a single equirectangular video. Max output resolution on 2D scopic videos on Gear VR is 3480×1920 at 30 frames per second.

Presence – Presence is a kind of immersion where the low-level systems of the brain are tricked to such an extent that they react just as they would to non-virtual stimuli.

Latency – It’s the time between when you move your head, and when you see physical updates on the screen. An acceptable latency is anywhere from 11 ms (for games) to 20 ms (for watching 360 vr videos).

Head Tracking – There are two forms:

• Positional tracking – movements and related translations of your body, eg: sway side to side.
• Traditional head tracking – left, right, up, down, roll like clock rotation.

References:

[1] Ultimate Display Speech as reminisced by Fred Brooks: http://www.roadtovr.com/fred-brooks-ivan-sutherlands-1965-ultimate-display-speech/

[2] Equirectangular Layout Image: https://www.flickr.com/photos/[email protected]/10111691364/

[3] CubeMap Layout: http://learnopengl.com/img/advanced/cubemaps_skybox.png

[4] Vahana VR: http://www.video-stitch.com/

[5] Surround360 Stitching software: https://github.com/facebook/Surround360

[6] Computer Vision Algorithm BRISK: https://www.robots.ox.ac.uk/~vgg/rg/papers/brisk.pdf

[7] Computer Vision Algorithm AKAZE: http://docs.opencv.org/3.0-beta/doc/tutorials/features2d/akaze_matching/akaze_matching.html

[8] Optical Flow: http://docs.opencv.org/trunk/d7/d8b/tutorial_py_lucas_kanade.html

[9] 4K connection speeds: https://help.netflix.com/en/node/306

[10] Average connection speeds in US: https://www.akamai.com/us/en/about/news/press/2016-press/akamai-releases-fourth-quarter-2015-state-of-the-internet-report.jsp

[11] CubeMap transform filter for ffmpeg: https://github.com/facebook/transform

[12] FFMPEG software: https://ffmpeg.org/

[13] WebVR Spec: https://w3c.github.io/webvr/

[15] Lytro LightField Volume for six DoF: https://www.lytro.com/press/releases/lytro-immerge-the-worlds-first-professional-light-field-solution-for-cinematic-vr

[16] 10 bit color depth: https://gist.github.com/l4n9th4n9/4459997

# AWS Week in Review – August 22, 2016

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-august-22-2016/

Here’s the first community-driven edition of the AWS Week in Review. In response to last week’s blog post (AWS Week in Review – Coming Back With Your Help!), 9 other contributors helped to make this post a reality. That’s a great start; let’s see if we can go for 20 this week.

New & Notable Open Source

New SlideShare Presentations

Upcoming Events

Help Wanted

Stay tuned for next week, and please consider helping to make this a community-driven effort!

Jeff;

# Pi 3 booting part I: USB mass storage boot beta

Post Syndicated from Gordon Hollingworth original https://www.raspberrypi.org/blog/pi-3-booting-part-i-usb-mass-storage-boot/

When we originally announced the Raspberry Pi 3, we announced that we’d implemented several new boot modes. The first of these is the USB mass storage boot mode, and we’ll explain a little bit about it in this post; stay tuned for the next part on booting over Ethernet tomorrow. We’ve also supplied a boot modes tutorial over on the Raspberry Pi documentation pages.

Note: the new boot modes are still in beta testing and use the “next” branch of the firmware. If you’re unsure about using the new boot modes, it’s probably best to wait until we release it fully.

How did we do this?

Inside the 2835/6/7 devices there’s a small boot ROM, which is an unchanging bit of code used to boot the device. It’s the boot ROM that can read files from SD cards and execute them. Previously, there were two boot modes: SD boot and USB device boot (used for booting the Compute Module). When the Pi is powered up or rebooted, it tries to talk to an attached SD card and looks for a file called bootcode.bin; if it finds it, then it loads it into memory and jumps to it. This piece of code then continues to load up the rest of the Pi system, such as the firmware and ARM kernel.

While squeezing in the Quad A53 processors, I spent a fair amount of time writing some new boot modes. If you’d like to get into a little more detail, there’s more information in the documentation. Needless to say, it’s not easy squeezing SD boot, eMMC boot, SPI boot, NAND flash, FAT filesystem, GUID and MBR partitions, USB device, USB host, Ethernet device, and mass storage device support into a mere 32kB.

What is a mass storage device?

The USB specification allows for a mass storage class which many devices implement, from the humble flash drive to USB attached hard drives. This includes micro SD readers, but generally it refers to anything you can plug into a computer’s USB port and use for file storage.

I’ve tried plugging in a flash drive before and it didn’t do anything. What’s wrong?

We haven’t enabled this boot mode by default, because we first wanted to check that it worked as expected. The boot modes are enabled in One-Time Programmable (OTP) memory, so you have to enable the boot mode on your Pi 3 first. This is done using a config.txt parameter.

Instructions for implementing the mass storage boot mode, and changing a suitable Raspbian image to boot from a flash drive, can be found here.

Are there any bugs / problems?

There are a couple of known issues:

1. Some flash drives power up too slowly. There are many spinning disk drives that don’t respond within the allotted two seconds. It’s possible to extend this timeout to five seconds, but there are devices that fail to respond within this period as well, such as the Verbatim PinStripe 64GB.
2. Some flash drives have a very specific protocol requirement that we don’t handle; as a result of this, we can’t talk to these drives correctly. An example of such a drive would be the Kingston Data Traveller 100 G3 32G.

These bugs exist due to the method used to develop the boot code and squeeze it into 32kB. It simply wasn’t possible to run comprehensive tests.

However, thanks to a thorough search of eBay and some rigorous testing by our awesome work experience student Henry Budden, we’ve found the following devices work perfectly well:

• Sandisk Cruzer Fit 16GB
• Sandisk Cruzer Blade 16Gb
• Samsung 32GB USB 3.0 drive
• MeCo 16GB USB 3.0

If you find some devices we haven’t been able to test, we’d be grateful if you’d let us know your results in the comments.

Will it be possible to boot a Pi 1 or Pi 2 using MSD?

Unfortunately not. The boot code is stored in the BCM2837 device only, so the Pi 1, Pi 2, and Pi Zero will all require SD cards.

However, I have been able to boot a Pi 1 and Pi 2 using a very special SD card that only contains the single file bootcode.bin. This is useful if you want to boot a Pi from USB, but don’t want the possible unreliability of an SD card. Don’t mount the SD card from Linux, and it will never get corrupted!

My MSD doesn’t work. Is there something else I can do to get it working?

If you can’t boot from the MSD, then there are some steps that you can take to diagnose the problem. Please note, though, this is very much still a work in progress:

• Format an SD card as FAT32
• Copy the current next branch bootcode.bin from GitHub onto the SD card
• Plug it into the Pi and try again

If this still doesn’t work, please open an issue in the firmware repository.

The post Pi 3 booting part I: USB mass storage boot beta appeared first on Raspberry Pi.

# Credential Stealing as an Attack Vector

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/05/credential_stea.html

Traditional computer security concerns itself with vulnerabilities. We employ antivirus software to detect malware that exploits vulnerabilities. We have automatic patching systems to fix vulnerabilities. We debate whether the FBI should be permitted to introduce vulnerabilities in our software so it can get access to systems with a warrant. This is all important, but what’s missing is a recognition that software vulnerabilities aren’t the most common attack vector: credential stealing is.

The most common way hackers of all stripes, from criminals to hacktivists to foreign governments, break into networks is by stealing and using a valid credential. Basically, they steal passwords, set up man-in-the-middle attacks to piggy-back on legitimate logins, or engage in cleverer attacks to masquerade as authorized users. It’s a more effective avenue of attack in many ways: it doesn’t involve finding a zero-day or unpatched vulnerability, there’s less chance of discovery, and it gives the attacker more flexibility in technique.

Rob Joyce, the head of the NSA’s Tailored Access Operations (TAO) group — basically the country’s chief hacker — gave a rare public talk at a conference in January. In essence, he said that zero-day vulnerabilities are overrated, and credential stealing is how he gets into networks: “A lot of people think that nation states are running their operations on zero days, but it’s not that common. For big corporate networks, persistence and focus will get you in without a zero day; there are so many more vectors that are easier, less risky, and more productive.”

This is true for us, and it’s also true for those attacking us. It’s how the Chinese hackers breached the Office of Personnel Management in 2015. The 2014 criminal attack against Target Corporation started when hackers stole the login credentials of the company’s HVAC vendor. Iranian hackers stole US login credentials. And the hacktivist that broke into the cyber-arms manufacturer Hacking Team and published pretty much every proprietary document from that company used stolen credentials.

As Joyce said, stealing a valid credential and using it to access a network is easier, less risky, and ultimately more productive than using an existing vulnerability, even a zero-day.

Our notions of defense need to adapt to this change. First, organizations need to beef up their authentication systems. There are lots of tricks that help here: two-factor authentication, one-time passwords, physical tokens, smartphone-based authentication, and so on. None of these is foolproof, but they all make credential stealing harder.

Second, organizations need to invest in breach detection and — most importantly — incident response. Credential-stealing attacks tend to bypass traditional IT security software. But attacks are complex and multi-step. Being able to detect them in process, and to respond quickly and effectively enough to kick attackers out and restore security, is essential to resilient network security today.

Vulnerabilities are still critical. Fixing vulnerabilities is still vital for security, and introducing new vulnerabilities into existing systems is still a disaster. But strong authentication and robust incident response are also critical. And an organization that skimps on these will find itself unable to keep its networks secure.

This essay originally appeared on Xconomy.

# A Case Study in Global Fault Isolation

Post Syndicated from AWS Architecture Blog original https://www.awsarchitectureblog.com/2014/05/a-case-study-in-global-fault-isolation.html

In a previous blog post, we talked about using shuffle sharding to get magical fault isolation. Today, we'll examine a specific use case that Route 53 employs and one of the interesting tradeoffs we decided to make as part of our sharding. Then, we'll discuss how you can employ some of these concepts in your own applications.

Overview of Anycast DNS

One of our goals at Amazon Route 53 is to provide low-latency DNS resolution to customers. We do this, in part, by announcing our IP addresses using “anycast” from over 50 edge locations around the globe. Anycast works by routing packets to the closest (network-wise) location that is “advertising” a particular address. In the image below, we can see that there are three locations, all of which can receive traffic for the 205.251.194.72 address.

[Blue circles represent edge locations; orange circles represent AWS regions]

For example, if a customer has ns-584.awsdns-09.net assigned as a nameserver, issuing a query to that nameserver could result in that query landing at any one of multiple locations responsible for advertising the underlying IP address. Where the query lands depends on the anycast routing of the Internet, but it is generally going to be the closest network-wise (and hence, low latency) location to the end user.

Behind the scenes, we have thousands of nameserver names (e.g. ns-584.awsdns-09.net) hosted across four top-level domains (.com, .net, .co.uk, and .org). We refer to all the nameservers in one top-level domain as a ‘stripe;' thus, we have a .com stripe, a .net stripe, a .co.uk stripe, and a .org stripe. This is where shuffle sharding comes in: each Route 53 domain (hosted zone) receives four nameserver names one from each of stripe. As a result, it is unlikely that two zones will overlap completely across all four nameservers. In fact, we enforce a rule during nameserver assignment that no hosted zone can overlap by more than two nameservers with any previously created hosted zone.

DNS Resolution

Before continuing, it’s worth quickly explaining how DNS resolution works. Typically, a client, such as your laptop or desktop has a “stub resolver.” The stub resolver simply contacts a recursive nameserver (resolver), which in turn queries authoritative nameservers, on the Internet to find the answers to a DNS query. Typically, resolvers are provided by your ISP or corporate network infrastructure, or you may rely on an open resolver such as Google DNS. Route 53 is an authoritative nameserver, responsible for replying to resolvers on behalf of customers. For example, when a client program attempts to look up amazonaws.com, the stub resolver on the machine will query the resolver. If the resolver has the data in cache and the value hasn’t expired, it will use the cached value. Otherwise, the resolver will query authoritative nameservers to find the answer.

[Every location advertises one or more stripes, but we only show Sydney, Singapore, and Hong Kong in the above diagram for clarity.]

Each Route 53 edge location is responsible for serving the traffic for one or more stripes. For example, our edge location in Sydney, Australia could serve both the .com and .net, while Singapore could serve just the .org stripe. Any given location can serve the same stripe as other locations. Hong Kong could serve the .net stripe, too. This means that if a resolver in Australia attempts to resolve a query against a nameserver in the .org stripe, which isn’t provided in Australia, the query will go to the closest location that provides the .org stripe (which is likely Singapore). A resolver in Singapore attempting to query against a nameserver in the .net stripe may go to Hong Kong or Sydney depending on the potential Internet routes from that resolver’s particular network. This is shown in the diagram above.

For any given domain, in general, resolvers learn the lowest latency nameserver based upon the round trip time of the query (this technique is often called SRTT or smooth round-trip time). Over a few queries, a resolver in Australia would gravitate toward using the nameservers on the .net and .com stripes for Route 53 customers’ domains.

Not all resolvers do this. Some choose randomly amongst the nameservers. Others may end up choosing the slowest one, but our experiments show that about 80% of resolvers use the lowest RTT nameserver. For additional information, this presentation presents information on how various resolvers choose which nameserver they utilize. Additionally, many other resolvers (such as Google Public DNS) use pre-fetching, or have very short timeouts if a resolver fails to resolve against a particular nameserver.

The Latency-Availability Decision

Given the above resolver behavior, one option, for a DNS provider like Route 53, might be to advertise all four stripes from every edge location. This would mean that no matter which nameserver a resolver choses, it will always go to the closest network location. However, we believe this provides a poor availability model.

Why? Because edge locations can sometimes fail to provide resolution for a variety of reasons that are very hard to control: the edge location may lose power or Internet connectivity, the resolver may lose connectivity to the edge location, or an intermediary transit provider may lose connectivity. Our experiments have shown that these types of events can cause about 5 minutes of disruption as the Internet updates routing tables. In recent years another serious risk has arisen: large-scale transit network congestion due to DDOS attacks. Our colleague, Nathan Dye, has a talk from AWS re:Invent that provides more details: www.youtube.com/watch?v=V7vTPlV8P3U.

In all of these failure scenarios, advertising every nameserver from every location may result in resolvers having no fallback location. All nameservers would route to the same location and resolvers would fail to resolve DNS queries, resulting in an outage for customers.

In the diagram below, we show the difference for a resolver querying domain X, whose nameservers (NX1, NX2, NX3, NX4) are advertised from all locations and domain Y, whose nameservers (NY1, NY2, NY3, NY4) are advertised in a subset of the locations.

When the path from the resolver to location A is impaired, all queries to the nameservers for domain X will fail. In comparison, even if the path from the resolver to location A is impaired, there are other transit paths to reach nameservers at locations B, C, and D in order to resolve the DNS for domain Y.

Route 53 typically advertises only one stripe per edge location. As a result, if anything goes wrong with a resolver being able to reach an edge location, that resolver has three other nameservers in three other locations to which it can fall back. For example, if we deploy bad software that causes the edge location to stop responding, the resolver can still retry elsewhere. This is why we organize our deployments in "stripe order"; Nick Trebon provides a great overview of our deployment strategies in the previous blog post. It also means that queries to Route 53 gain a lot of Internet path diversity, which helps resolvers route around congestion and other intermediary problems on their path to reaching Route 53.

Route 53's foremost goal is to always meet our promise of a 100% SLA for DNS queries – that all of our customers' DNS names should resolve all the time. Our customers also tell us that latency is next most important feature of a DNS service provider. Maximizing Internet path and edge location diversity for availibility necessarily means that some nameservers will respond from farther-away edge locations. For most resolvers, our method has no impact on the minimum RTT, or fastest nameserver, and how quickly it can respond. As resolvers generally use the fastest nameserver, we’re confident that any compromise in resolution times is small and that this is a good balance between the goals of low latency and high availability.

On top of our striping across locations, you may have noticed that the four stripes use different top-level domains. We use multiple top-levels domains in case one of the three TLD providers (.com and .net are both operated by Verisign) has any sort of DNS outage. While this rarely happens, it means that as a customer, you’ll have increased protection during a TLD’s DNS outage because at least two of your four nameservers will continue to work.

Applications

You, too, can apply the same techniques in your own systems and applications. If your system isn't end-user facing, you could also consider utilizing multiple TLDs for resilience as well. Especially in the case where you control your own API and clients calling the API, there's no reason to place all your eggs in one TLD basket.

Another application of what we've discussed is minimizing downtime during failovers. For high availability applications, we recommend customers utilize Route 53 DNS Failover. With failover configured, Route 53 will only return answers for healthy endpoints. In order to determine endpoint health, Route 53 issues health checks against your endpoint. As a result, there is a minimum of 10 seconds (assuming you configured fast health checks with a single failover interval) where the application could be down, but failover has not triggered yet. On top of that, there is the additional time incurred for resolvers to expire the DNS entry from their cache based upon the record's TTL. To minimize this failover time, you could write your clients to behave similar to the resolver behavior described earlier. And, while you may not employ an anycast system, you can host your endpoints in multiple locations (e.g. different availability zones and perhaps even different regions). Your clients would learn the SRTT of the multiple endpoints over time and only issue queries to the fastest endpoint, but fallback to the other endpoints if the fastest is unavailable. And, of course, you could shuffle shard your endpoints to achieve increased fault isolation while doing all of the above.

– Lee-Ming Zen

# A Case Study in Global Fault Isolation

Post Syndicated from Lee-Ming Zen original https://aws.amazon.com/blogs/architecture/a-case-study-in-global-fault-isolation/

In a previous blog post, we talked about using shuffle sharding to get magical fault isolation. Today, we’ll examine a specific use case that Route 53 employs and one of the interesting tradeoffs we decided to make as part of our sharding. Then, we’ll discuss how you can employ some of these concepts in your own applications.

# Overview of Anycast DNS

One of our goals at Amazon Route 53 is to provide low-latency DNS resolution to customers. We do this, in part, by announcing our IP addresses using “anycast” from over 50 edge locations around the globe. Anycast works by routing packets to the closest (network-wise) location that is “advertising” a particular address. In the image below, we can see that there are three locations, all of which can receive traffic for the 205.251.194.72 address.

(Blue circles represent edge locations; orange circles represent AWS regions)

For example, if a customer has ns-584.awsdns-09.net assigned as a nameserver, issuing a query to that nameserver could result in that query landing at any one of multiple locations responsible for advertising the underlying IP address. Where the query lands depends on the anycast routing of the Internet, but it is generally going to be the closest network-wise (and hence, low latency) location to the end user.

Behind the scenes, we have thousands of nameserver names (e.g. ns-584.awsdns-09.net) hosted across four top-level domains (.com, .net, .co.uk, and .org). We refer to all the nameservers in one top-level domain as a ‘stripe;’ thus, we have a .com stripe, a .net stripe, a .co.uk stripe, and a .org stripe. This is where shuffle sharding comes in: each Route 53 domain (hosted zone) receives four nameserver names one from each of stripe. As a result, it is unlikely that two zones will overlap completely across all four nameservers. In fact, we enforce a rule during nameserver assignment that no hosted zone can overlap by more than two nameservers with any previously created hosted zone.

# DNS Resolution

Before continuing, it’s worth quickly explaining how DNS resolution works. Typically, a client, such as your laptop or desktop has a “stub resolver.” The stub resolver simply contacts a recursive nameserver (resolver), which in turn queries authoritative nameservers, on the Internet to find the answers to a DNS query. Typically, resolvers are provided by your ISP or corporate network infrastructure, or you may rely on an open resolver such as Google DNS. Route 53 is an authoritative nameserver, responsible for replying to resolvers on behalf of customers. For example, when a client program attempts to look up amazonaws.com, the stub resolver on the machine will query the resolver. If the resolver has the data in cache and the value hasn’t expired, it will use the cached value. Otherwise, the resolver will query authoritative nameservers to find the answer.

(Every location advertises one or more stripes, but we only show Sydney, Singapore, and Hong Kong in the above diagram for clarity.)

Each Route 53 edge location is responsible for serving the traffic for one or more stripes. For example, our edge location in Sydney, Australia could serve both the .com and .net, while Singapore could serve just the .org stripe. Any given location can serve the same stripe as other locations. Hong Kong could serve the .net stripe, too. This means that if a resolver in Australia attempts to resolve a query against a nameserver in the .org stripe, which isn’t provided in Australia, the query will go to the closest location that provides the .org stripe (which is likely Singapore). A resolver in Singapore attempting to query against a nameserver in the .net stripe may go to Hong Kong or Sydney depending on the potential Internet routes from that resolver’s particular network. This is shown in the diagram above.

For any given domain, in general, resolvers learn the lowest latency nameserver based upon the round trip time of the query (this technique is often called SRTT or smooth round-trip time). Over a few queries, a resolver in Australia would gravitate toward using the nameservers on the .net and .com stripes for Route 53 customers’ domains.

Not all resolvers do this. Some choose randomly amongst the nameservers. Others may end up choosing the slowest one, but our experiments show that about 80% of resolvers use the lowest RTT nameserver. For additional information, this presentation presents information on how various resolvers choose which nameserver they utilize. Additionally, many other resolvers (such as Google Public DNS) use pre-fetching, or have very short timeouts if a resolver fails to resolve against a particular nameserver.

# The Latency-Availability Decision

Given the above resolver behavior, one option, for a DNS provider like Route 53, might be to advertise all four stripes from every edge location. This would mean that no matter which nameserver a resolver choses, it will always go to the closest network location. However, we believe this provides a poor availability model.

Why? Because edge locations can sometimes fail to provide resolution for a variety of reasons that are very hard to control: the edge location may lose power or Internet connectivity, the resolver may lose connectivity to the edge location, or an intermediary transit provider may lose connectivity. Our experiments have shown that these types of events can cause about 5 minutes of disruption as the Internet updates routing tables. In recent years another serious risk has arisen: large-scale transit network congestion due to DDOS attacks. Our colleague, Nathan Dye, has a talk from AWS re:Invent that provides more details: www.youtube.com/watch?v=V7vTPlV8P3U.

In all of these failure scenarios, advertising every nameserver from every location may result in resolvers having no fallback location. All nameservers would route to the same location and resolvers would fail to resolve DNS queries, resulting in an outage for customers.

In the diagram below, we show the difference for a resolver querying domain X, whose nameservers (NX1, NX2, NX3, NX4) are advertised from all locations and domain Y, whose nameservers (NY1, NY2, NY3, NY4) are advertised in a subset of the locations.

When the path from the resolver to location A is impaired, all queries to the nameservers for domain X will fail. In comparison, even if the path from the resolver to location A is impaired, there are other transit paths to reach nameservers at locations B, C, and D in order to resolve the DNS for domain Y.

Route 53 typically advertises only one stripe per edge location. As a result, if anything goes wrong with a resolver being able to reach an edge location, that resolver has three other nameservers in three other locations to which it can fall back. For example, if we deploy bad software that causes the edge location to stop responding, the resolver can still retry elsewhere. This is why we organize our deployments in “stripe order”; Nick Trebon provides a great overview of our deployment strategies in the previous blog post. It also means that queries to Route 53 gain a lot of Internet path diversity, which helps resolvers route around congestion and other intermediary problems on their path to reaching Route 53.

Route 53’s foremost goal is to always meet our promise of a 100% SLA for DNS queries – that all of our customers’ DNS names should resolve all the time. Our customers also tell us that latency is next most important feature of a DNS service provider. Maximizing Internet path and edge location diversity for availibility necessarily means that some nameservers will respond from farther-away edge locations. For most resolvers, our method has no impact on the minimum RTT, or fastest nameserver, and how quickly it can respond. As resolvers generally use the fastest nameserver, we’re confident that any compromise in resolution times is small and that this is a good balance between the goals of low latency and high availability.

On top of our striping across locations, you may have noticed that the four stripes use different top-level domains. We use multiple top-levels domains in case one of the three TLD providers (.com and .net are both operated by Verisign) has any sort of DNS outage. While this rarely happens, it means that as a customer, you’ll have increased protection during a TLD’s DNS outage because at least two of your four nameservers will continue to work.

# Applications

You, too, can apply the same techniques in your own systems and applications. If your system isn’t end-user facing, you could also consider utilizing multiple TLDs for resilience as well. Especially in the case where you control your own API and clients calling the API, there’s no reason to place all your eggs in one TLD basket.

Another application of what we’ve discussed is minimizing downtime during failovers. For high availability applications, we recommend customers utilize Route 53 DNS Failover. With failover configured, Route 53 will only return answers for healthy endpoints. In order to determine endpoint health, Route 53 issues health checks against your endpoint. As a result, there is a minimum of 10 seconds (assuming you configured fast health checks with a single failover interval) where the application could be down, but failover has not triggered yet. On top of that, there is the additional time incurred for resolvers to expire the DNS entry from their cache based upon the record’s TTL. To minimize this failover time, you could write your clients to behave similar to the resolver behavior described earlier. And, while you may not employ an anycast system, you can host your endpoints in multiple locations (e.g. different availability zones and perhaps even different regions). Your clients would learn the SRTT of the multiple endpoints over time and only issue queries to the fastest endpoint, but fallback to the other endpoints if the fastest is unavailable. And, of course, you could shuffle shard your endpoints to achieve increased fault isolation while doing all of the above.

– Lee-Ming Zen

# Organizing Software Deployments to Match Failure Conditions

Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact. These strategies require an understanding of how the various components of your system interact, how those components can fail and how those failures impact your customers. This blog post discusses some of the deployment strategies that we’ve made on the Route 53 team and how these choices affect the availability of our service.

To begin, I’ll briefly describe some of the deployment procedures and the Route 53 architecture in order to provide some context for the deployment strategies that we have chosen. Hopefully, these examples will reveal strategies that could benefit your own service’s availability. Like many services, Route 53 consists of multiple environments or stages: one for active development, one for staging changes to production and the production stage itself. The natural tension with trying to reduce the number of failed deployments in production is to add more rigidity and processes that slow down the release of new code. At Route 53, we do not enforce a strict release or deployment schedule; individual developers are responsible for verifying their changes in the staging environment and pushing their changes into production. Typically, our deployments proceed in a pipelined fashion. Each step of the pipeline is referred to as a “wave” and consists of some portion of our fleet. A pipeline is a good abstraction as each wave can be thought of as an independent and separate step. After each wave of the pipeline, the change can be verified — this can include automatic, scheduled and manual testing as well as the verification of service metrics. Furthermore, we typically space out the earlier waves of production deployment at least 24 hours apart, in order to allow the changes to “bake.” Letting our software bake refers to rolling out software changes slowly to allow us to validate those changes and verify service metrics with production traffic before pushing the deployment to the next wave. The clear advantage of deploying new code to only a portion of your fleet is that it reduces the impact of a failed deployment to just the portion of the fleet containing the new code. Another benefit of our deployment infrastructure is that it provides us a mechanism to quickly “roll back” a deployment to a previous software version if any problems are detected which, in many cases, enables us to quickly mitigate a failed deployment.

Based on our experiences, we have further organized our deployments to try and match our failure conditions to further reduce impact. First, our deployment strategies are tailored to the part of the system that is the target of our deployment. We commonly refer to two main components of Route 53: the control plane and the data plane (pictured below). The control plane consists primarily of our API and DNS change propagation system. Essentially, this is the part of our system that accepts a customer request to create or delete a DNS record and then the transmission of that update to all of our DNS servers distributed across the world. The data plane consists of our fleet of DNS servers that are responsible for answering DNS queries on behalf of our customers. These servers currently reside in more than 50 locations around the world. Both of these components have their own set of failure conditions and differ in how a failed deployment will impact customers. Further, a failure of one component may not impact the other. For example, an API outage where customers are unable to create new hosted zones or records has no impact on our data plane continuing to answer queries for all records created prior to the outage. Given their distinct set of failure conditions, the control plane and data plane have their own deployment strategies, which are each discussed in more detail below.

# Control Plane Deployments

The bulk of the of the control plane actually consists of two APIs. The first is our external API that is reachable from the Internet and is the entry point for customers to create, delete and view their DNS records. This external API performs authentication and authorization checks on customer requests before forwarding them to our internal API. The second, internal API supports a much larger set of operations than just the ones needed by the external API; it also includes operations required to monitor and propagate DNS changes to our DNS servers as well as other operations needed to operate and monitor the service. Failed deployments to the external API typically impact a customer’s ability to view or modify their DNS records. The availability of this API is critical as our customers may rely on the ability to update their DNS records quickly and reliably during an operational event for their own service or site.

Deployments to the external API are fairly straightforward. For increased availability, we host the external API in multiple availability zones. Each wave of deployment consists of the hosts within a single availability zone, and each host in that availability zone is deployed to individually. If any single host deployment fails, the deployment to the entire availability zone is halted automatically. Some host failures may be quickly caught and mitigated by the load balancer for our hosts in that particular availability zone, which is responsible for health checking the hosts. Hosts that fail these load balancer health checks are automatically removed from service by the load balancer. Thus, a failed deployment to just a single host would result in it being removed from service automatically and the deployment halted without any operator intervention. For other types of failed deployments that may not cause the load balancer health checks to fail, restricting waves to a single availability zone allows us to easily flip away from that availability zone as soon as the failure is detected. A similar approach could be applied to services that utilize Route 53 plus ELB in multiple regions and availability zones for their services. ELBs automatically health check their back-end instances and remove unhealthy instances from service. By creating Route 53 alias records marked to evaluate target health (see ELB documentation for how to set this up), if all instances behind an ELB are unhealthy, Route 53 will fail away from this alias and attempt to find an alternate healthy record to serve. This configuration will enable automatic failover at the DNS-level for an unhealthy region or availability zone. To enable manual failover, simply convert the alias resource record set for your ELB to either a weighted alias or associate it with a health check whose health you control. To initiate a failover, simply set the weight to 0 or fail the health check. A weighted alias also allows you the ability to slowly increase the traffic to that ELB, which can be useful for verifying your own software deployments to the back-end instances.

For our internal API, the deployment strategy is more complicated (pictured below). Here, our fleet is partitioned by the type of traffic it handles. We classify traffic into three types: (1) low-priority, long-running operations used to monitor the service (batch fleet), (2) all other operations used to operate and monitor the service (operations fleet) and (3) all customer operations (customer fleet). Deployments to the production internal API are then organized by how critical their traffic is to the service as a whole. For instance, the batch fleet is deployed to first because their operations are not critical to the running of the service and we can tolerate long outages of this fleet. Similarly, we prioritize the operations fleet below that of customer traffic as we would rather continue accepting and processing customer traffic after a failed deployment to the operations fleet. For the internal API, we have also organized our staging waves differently from our production waves. In the staging waves, all three fleets are split across two waves. This is done intentionally to allow us to verify that the code changes work in a split-world where multiple versions of the software are running simultaneously. We have found this to be useful in catching incompatibilities between software versions. Since we never deploy software in production to 100% of our fleet at the same time, our software updates must be designed to be compatible with the previous version. Finally, as with the external API, all wave deployments proceed with a single host at a time. For this API, we also include a deep application health check as part of the deployment. Similar to the load balancer health checks for the external API, if this health check fails, the entire deployment is immediately halted.

# Data Plane Deployments

As mentioned earlier, our data plane consists of Route 53’s DNS servers, which are distributed across the world in more than 50 distinct locations (we refer to each location as an ‘edge location’). An important consideration with our deployment strategy is how we stripe our anycast IP space across locations. In summary, each hosted zone is assigned four delegation name servers, each of which belong to a “stripe” (i.e., one quarter of our anycast range). Generally speaking, each edge location announces only a single stripe, so each stripe is therefore announced by roughly 1/4 of our edge locations worldwide. Thus, when a resolver issues a query against each of the four delegation name servers, those queries are directed via BGP to the closest (in a network sense) edge location from each stripe. While the availability and correctness of our API is important, the availability and correctness of our data plane are even more critical. In this case, an outage directly results in an outage for our customers. Furthermore, the impact of serving even a single wrong answer on behalf of a customer is magnified by that answer being cached by both intermediate resolvers and end clients alike. Thus, deployments to our data plane are organized even more carefully to both prevent failed deployments and to reduce potential impact.

# Conclusions

Our team’s experience with operating Route 53 over the last 3+ years have highlighted the importance of reducing the impact from failed deployments. Over the years, we have been able to identify some of the common failure conditions and to organize our software deployments in such a way so that we increase the ease of mitigation while decreasing the potential impact to our customers.

– Nick Trebon

# Organizing Software Deployments to Match Failure Conditions

Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact. These strategies require an understanding of how the various components of your system interact, how those components can fail and how those failures impact your customers. This blog post discusses some of the deployment strategies that we've made on the Route 53 team and how these choices affect the availability of our service.

To begin, I'll briefly describe some of the deployment procedures and the Route 53 architecture in order to provide some context for the deployment strategies that we have chosen. Hopefully, these examples will reveal strategies that could benefit your own service's availability. Like many services, Route 53 consists of multiple environments or stages: one for active development, one for staging changes to production and the production stage itself. The natural tension with trying to reduce the number of failed deployments in production is to add more rigidity and processes that slow down the release of new code. At Route 53, we do not enforce a strict release or deployment schedule; individual developers are responsible for verifying their changes in the staging environment and pushing their changes into production. Typically, our deployments proceed in a pipelined fashion. Each step of the pipeline is referred to as a "wave" and consists of some portion of our fleet. A pipeline is a good abstraction as each wave can be thought of as an independent and separate step. After each wave of the pipeline, the change can be verified — this can include automatic, scheduled and manual testing as well as the verification of service metrics. Furthermore, we typically space out the earlier waves of production deployment at least 24 hours apart, in order to allow the changes to "bake." Letting our software bake refers to rolling out software changes slowly to allow us to validate those changes and verify service metrics with production traffic before pushing the deployment to the next wave. The clear advantage of deploying new code to only a portion of your fleet is that it reduces the impact of a failed deployment to just the portion of the fleet containing the new code. Another benefit of our deployment infrastructure is that it provides us a mechanism to quickly "roll back" a deployment to a previous software version if any problems are detected which, in many cases, enables us to quickly mitigate a failed deployment.

Based on our experiences, we have further organized our deployments to try and match our failure conditions to further reduce impact. First, our deployment strategies are tailored to the part of the system that is the target of our deployment. We commonly refer to two main components of Route 53: the control plane and the data plane (pictured below). The control plane consists primarily of our API and DNS change propagation system. Essentially, this is the part of our system that accepts a customer request to create or delete a DNS record and then the transmission of that update to all of our DNS servers distributed across the world. The data plane consists of our fleet of DNS servers that are responsible for answering DNS queries on behalf of our customers. These servers currently reside in more than 50 locations around the world. Both of these components have their own set of failure conditions and differ in how a failed deployment will impact customers. Further, a failure of one component may not impact the other. For example, an API outage where customers are unable to create new hosted zones or records has no impact on our data plane continuing to answer queries for all records created prior to the outage. Given their distinct set of failure conditions, the control plane and data plane have their own deployment strategies, which are each discussed in more detail below.

Control Plane Deployments

The bulk of the of the control plane actually consists of two APIs. The first is our external API that is reachable from the Internet and is the entry point for customers to create, delete and view their DNS records. This external API performs authentication and authorization checks on customer requests before forwarding them to our internal API. The second, internal API supports a much larger set of operations than just the ones needed by the external API; it also includes operations required to monitor and propagate DNS changes to our DNS servers as well as other operations needed to operate and monitor the service. Failed deployments to the external API typically impact a customer's ability to view or modify their DNS records. The availability of this API is critical as our customers may rely on the ability to update their DNS records quickly and reliably during an operational event for their own service or site.

Deployments to the external API are fairly straightforward. For increased availability, we host the external API in multiple availability zones. Each wave of deployment consists of the hosts within a single availability zone, and each host in that availability zone is deployed to individually. If any single host deployment fails, the deployment to the entire availability zone is halted automatically. Some host failures may be quickly caught and mitigated by the load balancer for our hosts in that particular availability zone, which is responsible for health checking the hosts. Hosts that fail these load balancer health checks are automatically removed from service by the load balancer. Thus, a failed deployment to just a single host would result in it being removed from service automatically and the deployment halted without any operator intervention. For other types of failed deployments that may not cause the load balancer health checks to fail, restricting waves to a single availability zone allows us to easily flip away from that availability zone as soon as the failure is detected. A similar approach could be applied to services that utilize Route 53 plus ELB in multiple regions and availability zones for their services. ELBs automatically health check their back-end instances and remove unhealthy instances from service. By creating Route 53 alias records marked to evaluate target health (see ELB documentation for how to set this up), if all instances behind an ELB are unhealthy, Route 53 will fail away from this alias and attempt to find an alternate healthy record to serve. This configuration will enable automatic failover at the DNS-level for an unhealthy region or availability zone. To enable manual failover, simply convert the alias resource record set for your ELB to either a weighted alias or associate it with a health check whose health you control. To initiate a failover, simply set the weight to 0 or fail the health check. A weighted alias also allows you the ability to slowly increase the traffic to that ELB, which can be useful for verifying your own software deployments to the back-end instances.

For our internal API, the deployment strategy is more complicated (pictured below). Here, our fleet is partitioned by the type of traffic it handles. We classify traffic into three types: (1) low-priority, long-running operations used to monitor the service (batch fleet), (2) all other operations used to operate and monitor the service (operations fleet) and (3) all customer operations (customer fleet). Deployments to the production internal API are then organized by how critical their traffic is to the service as a whole. For instance, the batch fleet is deployed to first because their operations are not critical to the running of the service and we can tolerate long outages of this fleet. Similarly, we prioritize the operations fleet below that of customer traffic as we would rather continue accepting and processing customer traffic after a failed deployment to the operations fleet. For the internal API, we have also organized our staging waves differently from our production waves. In the staging waves, all three fleets are split across two waves. This is done intentionally to allow us to verify that the code changes work in a split-world where multiple versions of the software are running simultaneously. We have found this to be useful in catching incompatibilities between software versions. Since we never deploy software in production to 100% of our fleet at the same time, our software updates must be designed to be compatible with the previous version. Finally, as with the external API, all wave deployments proceed with a single host at a time. For this API, we also include a deep application health check as part of the deployment. Similar to the load balancer health checks for the external API, if this health check fails, the entire deployment is immediately halted.

Data Plane Deployments

As mentioned earlier, our data plane consists of Route 53's DNS servers, which are distributed across the world in more than 50 distinct locations (we refer to each location as an 'edge location'). An important consideration with our deployment strategy is how we stripe our anycast IP space across locations. In summary, each hosted zone is assigned four delegation name servers, each of which belong to a "stripe" (i.e., one quarter of our anycast range). Generally speaking, each edge location announces only a single stripe, so each stripe is therefore announced by roughly 1/4 of our edge locations worldwide. Thus, when a resolver issues a query against each of the four delegation name servers, those queries are directed via BGP to the closest (in a network sense) edge location from each stripe. While the availability and correctness of our API is important, the availability and correctness of our data plane are even more critical. In this case, an outage directly results in an outage for our customers. Furthermore, the impact of serving even a single wrong answer on behalf of a customer is magnified by that answer being cached by both intermediate resolvers and end clients alike. Thus, deployments to our data plane are organized even more carefully to both prevent failed deployments and to reduce potential impact.

Conclusions

Our team's experience with operating Route 53 over the last 3+ years have highlighted the importance of reducing the impact from failed deployments. Over the years, we have been able to identify some of the common failure conditions and to organize our software deployments in such a way so that we increase the ease of mitigation while decreasing the potential impact to our customers.

– Nick Trebon

# systemd for Administrators, Part X

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/instances.html

Here’s the tenth installment
of
my ongoing series
on
systemd
for

Instantiated Services

Most services on Linux/Unix are singleton services: there’s
usually only one instance of Syslog, Postfix, or Apache running on a
specific system at the same time. On the other hand some select
services may run in multiple instances on the same host. For example,
an Internet service like the Dovecot IMAP service could run in
multiple instances on different IP ports or different local IP
addresses. A more common example that exists on all installations is
getty, the mini service that runs once for each TTY and
presents a login prompt on it. On most systems this service is
instantiated once for each of the first six virtual consoles
tty1 to tty6. On some servers depending on
administrator configuration or boot-time parameters an additional
getty is instantiated for a serial or virtualizer console. Another
common instantiated service in the systemd world is fsck, the
file system checker that is instantiated once for each block device
that needs to be checked. Finally, in systemd socket activated
per-connection services (think classic inetd!) are also implemented
via instantiated services: a new instance is created for each incoming
connection. In this installment I hope to explain a bit how systemd
implements instantiated services and how to take advantage of them as

If you followed the previous episodes of this series you are
probably aware that services in systemd are named according to the
pattern foobar.service, where foobar is an
identification string for the service, and .service simply a
fixed suffix that is identical for all service units. The definition files
for these services are searched for in /etc/systemd/system
and /lib/systemd/system (and possibly other directories) under this name. For
instantiated services this pattern is extended a bit: the service name becomes
[email protected] where foobar is the
common service identifier, and quux the instance
identifier. Example: [email protected] is the serial
getty service instantiated for ttyS2.

Service instances can be created dynamically as needed. Without
further configuration you may easily start a new getty on a serial
port simply by invoking a systemctl start command for the new
instance:

# systemctl start [email protected]

If a command like the above is run systemd will first look for a
unit configuration file by the exact name you requested. If this
service file is not found (and usually it isn’t if you use
instantiated services like this) then the instance id is removed from
the name and a unit configuration file by the resulting
template name searched. In other words, in the above example,
if the precise [email protected] unit file cannot
be found, [email protected] is loaded instead. This unit
template file will hence be common for all instances of this
service. For the serial getty we ship a template unit file in systemd
(/lib/systemd/system/[email protected]) that looks
something like this:

[Unit]
Description=Serial Getty on %I
BindTo=dev-%i.device
After=dev-%i.device systemd-user-sessions.service

[Service]
ExecStart=-/sbin/agetty -s %I 115200,38400,9600
Restart=always
RestartSec=0

(Note that the unit template file we actually ship along with
systemd for the serial gettys is a bit longer. If you are interested,
have a look at the actual
file
which includes additional directives for compatibility with
SysV, to clear the screen and remove previous users from the TTY
device. To keep things simple I have shortened the unit file to the
relevant lines here.)

This file looks mostly like any other unit file, with one
distinction: the specifiers %I and %i are used at
multiple locations. At unit load time %I and %i are
replaced by systemd with the instance identifier of the service. In
our example above, if a service is instantiated as
[email protected] the specifiers %I and
%i will be replaced by ttyUSB0. If you introspect
the instanciated unit with systemctl status
[email protected] you will see these replacements
having taken place:

\$ systemctl status [email protected]
[email protected] – Getty on ttyUSB0
Active: active (running) since Mon, 26 Sep 2011 04:20:44 +0200; 2s ago
Main PID: 5443 (agetty)
CGroup: name=systemd:/system/[email protected]/ttyUSB0
└ 5443 /sbin/agetty -s ttyUSB0 115200,38400,9600

And that is already the core idea of instantiated services in
systemd. As you can see systemd provides a very simple templating
system, which can be used to dynamically instantiate services as
needed. To make effective use of this, a few more notes:

You may instantiate these services on-the-fly in
.wants/ symbolic links in the file system. For example, to
make sure the serial getty on ttyUSB0 is started
automatically at every boot, create a symlink like this:

# ln -s /lib/systemd/system/[email protected] /etc/systemd/system/getty.target.wants/[email protected]

systemd will instantiate the symlinked unit file with the
instance name specified in the symlink name.

You cannot instantiate a unit template without specifying an
instance identifier. In other words systemctl start
[email protected] will necessarily fail since the instance
name was left unspecified.

Sometimes it is useful to opt-out of the generic template
for one specific instance. For these cases make use of the fact that
systemd always searches first for the full instance file name before
falling back to the template file name: make sure to place a unit file
under the fully instantiated name in /etc/systemd/system and
it will override the generic templated version for this specific
instance.

The unit file shown above uses %i at some places and
%I at others. You may wonder what the difference between
these specifiers are. %i is replaced by the exact characters
of the instance identifier. For %I on the other hand the
instance identifier is first passed through a simple unescaping
algorithm. In the case of a simple instance identifier like
ttyUSB0 there is no effective difference. However, if the
device name includes one or more slashes (“/”) this cannot be
part of a unit name (or Unix file name). Before such a device name can
be used as instance identifier it needs to be escaped so that “/”
becomes “-” and most other special characters (including “-“) are
replaced by “xAB” where AB is the ASCII code of the character in
hexadecimal notation[1]. Example: to refer to a USB serial port by its
bus path we want to use a port name like
serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0. The
escaped version of this name is
serial-byx2dpath-pcix2d0000:00:1d.0x2dusbx2d0:1.4:1.1x2dport0. %I
will then refer to former, %i to the latter. Effectively this
means %i is useful wherever it is necessary to refer to other
units, for example to express additional dependencies. On the other
hand %I is useful for usage in command lines, or inclusion in
pretty description strings. Let’s check how this looks with the above unit file:

# systemctl start ‘[email protected]:00:1d.0x2dusbx2d0:1.4:1.1x2dport0.service’
# systemctl status ‘[email protected]:00:1d.0x2dusbx2d0:1.4:1.1x2dport0.service’
[email protected]:00:1d.0x2dusbx2d0:1.4:1.1x2dport0.service – Serial Getty on serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0
Active: active (running) since Mon, 26 Sep 2011 05:08:52 +0200; 1s ago
Main PID: 5788 (agetty)
CGroup: name=systemd:/system/[email protected]/serial-byx2dpath-pcix2d0000:00:1d.0x2dusbx2d0:1.4:1.1x2dport0
└ 5788 /sbin/agetty -s serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0 115200 38400 9600

As we can see the while the instance identifier is the escaped
string the command line and the description string actually use the
unescaped version, as expected.

(Side note: there are more specifiers available than just
%i and %I, and many of them are actually
available in all unit files, not just templates for service
instances. For more details see the man
page
which includes a full list and terse explanations.)

And at this point this shall be all for now. Stay tuned for a
follow-up article on how instantiated services are used for
inetd-style socket activation.

Footnotes

[1] Yupp, this escaping algorithm doesn’t really result in
particularly pretty escaped strings, but then again, most escaping
algorithms don’t help readability. The algorithm we used here is
inspired by what udev does in a similar case, with one change. In the
end, we had to pick something. If you’ll plan to comment on the
escaping algorithm please also mention where you live so that I can
come around and paint your bike shed yellow with blue stripes. Thanks!

# systemd for Administrators, Part X

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/instances.html

Here’s the tenth installment
of
my ongoing series
on
systemd
for

#### Instantiated Services

Most services on Linux/Unix are singleton services: there’s
usually only one instance of Syslog, Postfix, or Apache running on a
specific system at the same time. On the other hand some select
services may run in multiple instances on the same host. For example,
an Internet service like the Dovecot IMAP service could run in
multiple instances on different IP ports or different local IP
addresses. A more common example that exists on all installations is
getty, the mini service that runs once for each TTY and
presents a login prompt on it. On most systems this service is
instantiated once for each of the first six virtual consoles
tty1 to tty6. On some servers depending on
administrator configuration or boot-time parameters an additional
getty is instantiated for a serial or virtualizer console. Another
common instantiated service in the systemd world is fsck, the
file system checker that is instantiated once for each block device
that needs to be checked. Finally, in systemd socket activated
per-connection services (think classic inetd!) are also implemented
via instantiated services: a new instance is created for each incoming
connection. In this installment I hope to explain a bit how systemd
implements instantiated services and how to take advantage of them as

If you followed the previous episodes of this series you are
probably aware that services in systemd are named according to the
pattern foobar.service, where foobar is an
identification string for the service, and .service simply a
fixed suffix that is identical for all service units. The definition files
for these services are searched for in /etc/systemd/system
and /lib/systemd/system (and possibly other directories) under this name. For
instantiated services this pattern is extended a bit: the service name becomes
foobar@quux.service where foobar is the
common service identifier, and quux the instance
identifier. Example: [email protected] is the serial
getty service instantiated for ttyS2.

Service instances can be created dynamically as needed. Without
further configuration you may easily start a new getty on a serial
port simply by invoking a systemctl start command for the new
instance:

`# systemctl start [email protected]`

If a command like the above is run systemd will first look for a
unit configuration file by the exact name you requested. If this
service file is not found (and usually it isn’t if you use
instantiated services like this) then the instance id is removed from
the name and a unit configuration file by the resulting
template name searched. In other words, in the above example,
if the precise [email protected] unit file cannot
be found, [email protected] is loaded instead. This unit
template file will hence be common for all instances of this
service. For the serial getty we ship a template unit file in systemd
(/lib/systemd/system/[email protected]) that looks
something like this:

```[Unit]
Description=Serial Getty on %I
BindTo=dev-%i.device
After=dev-%i.device systemd-user-sessions.service

[Service]
ExecStart=-/sbin/agetty -s %I 115200,38400,9600
Restart=always
RestartSec=0
```

(Note that the unit template file we actually ship along with
systemd for the serial gettys is a bit longer. If you are interested,
have a look at the actual
file
which includes additional directives for compatibility with
SysV, to clear the screen and remove previous users from the TTY
device. To keep things simple I have shortened the unit file to the
relevant lines here.)

This file looks mostly like any other unit file, with one
distinction: the specifiers %I and %i are used at
multiple locations. At unit load time %I and %i are
replaced by systemd with the instance identifier of the service. In
our example above, if a service is instantiated as
[email protected] the specifiers %I and
%i will be replaced by ttyUSB0. If you introspect
the instanciated unit with systemctl status
[email protected]
you will see these replacements
having taken place:

```\$ systemctl status [email protected]
[email protected] - Getty on ttyUSB0
Active: active (running) since Mon, 26 Sep 2011 04:20:44 +0200; 2s ago
Main PID: 5443 (agetty)
CGroup: name=systemd:/system/[email protected]/ttyUSB0
└ 5443 /sbin/agetty -s ttyUSB0 115200,38400,9600
```

And that is already the core idea of instantiated services in
systemd. As you can see systemd provides a very simple templating
system, which can be used to dynamically instantiate services as
needed. To make effective use of this, a few more notes:

You may instantiate these services on-the-fly in
.wants/ symbolic links in the file system. For example, to
make sure the serial getty on ttyUSB0 is started
automatically at every boot, create a symlink like this:

`# ln -s /lib/systemd/system/[email protected] /etc/systemd/system/getty.target.wants/[email protected]ttyUSB0.service`

systemd will instantiate the symlinked unit file with the
instance name specified in the symlink name.

You cannot instantiate a unit template without specifying an
instance identifier. In other words systemctl start
[email protected]
will necessarily fail since the instance
name was left unspecified.

Sometimes it is useful to opt-out of the generic template
for one specific instance. For these cases make use of the fact that
systemd always searches first for the full instance file name before
falling back to the template file name: make sure to place a unit file
under the fully instantiated name in /etc/systemd/system and
it will override the generic templated version for this specific
instance.

The unit file shown above uses %i at some places and
%I at others. You may wonder what the difference between
these specifiers are. %i is replaced by the exact characters
of the instance identifier. For %I on the other hand the
instance identifier is first passed through a simple unescaping
algorithm. In the case of a simple instance identifier like
ttyUSB0 there is no effective difference. However, if the
device name includes one or more slashes (“/“) this cannot be
part of a unit name (or Unix file name). Before such a device name can
be used as instance identifier it needs to be escaped so that “/”
becomes “-” and most other special characters (including “-“) are
replaced by “xAB” where AB is the ASCII code of the character in
hexadecimal notation[1]. Example: to refer to a USB serial port by its
bus path we want to use a port name like
serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0. The
escaped version of this name is
serial-byx2dpath-pcix2d0000:00:1d.0x2dusbx2d0:1.4:1.1x2dport0. %I
will then refer to former, %i to the latter. Effectively this
means %i is useful wherever it is necessary to refer to other
units, for example to express additional dependencies. On the other
hand %I is useful for usage in command lines, or inclusion in
pretty description strings. Let’s check how this looks with the above unit file:

```# systemctl start '[email protected]:00:1d.0x2dusbx2d0:1.4:1.1x2dport0.service'
# systemctl status '[email protected]:00:1d.0x2dusbx2d0:1.4:1.1x2dport0.service'
[email protected]:00:1d.0x2dusbx2d0:1.4:1.1x2dport0.service - Serial Getty on serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0
Active: active (running) since Mon, 26 Sep 2011 05:08:52 +0200; 1s ago
Main PID: 5788 (agetty)
CGroup: name=systemd:/system/[email protected]/serial-byx2dpath-pcix2d0000:00:1d.0x2dusbx2d0:1.4:1.1x2dport0
└ 5788 /sbin/agetty -s serial/by-path/pci-0000:00:1d.0-usb-0:1.4:1.1-port0 115200 38400 9600
```

As we can see the while the instance identifier is the escaped
string the command line and the description string actually use the
unescaped version, as expected.

(Side note: there are more specifiers available than just
%i and %I, and many of them are actually
available in all unit files, not just templates for service
instances. For more details see the man
page
which includes a full list and terse explanations.)

And at this point this shall be all for now. Stay tuned for a
follow-up article on how instantiated services are used for
inetd-style socket activation.

Footnotes

[1] Yupp, this escaping algorithm doesn’t really result in
particularly pretty escaped strings, but then again, most escaping
algorithms don’t help readability. The algorithm we used here is
inspired by what udev does in a similar case, with one change. In the
end, we had to pick something. If you’ll plan to comment on the
escaping algorithm please also mention where you live so that I can
come around and paint your bike shed yellow with blue stripes. Thanks!