Tag Archives: responder

A quick look at the Ikea Trådfri lighting platform

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/47803.html

Ikea recently launched their Trådfri smart lighting platform in the US. The idea of Ikea plus internet security together at last seems like a pretty terrible one, but having taken a look it’s surprisingly competent. Hardware-wise, the device is pretty minimal – it seems to be based on the Cypress[1] WICED IoT platform, with 100MBit ethernet and a Silicon Labs Zigbee chipset. It’s running the Express Logic ThreadX RTOS, has no running services on any TCP ports and appears to listen on two single UDP ports. As IoT devices go, it’s pleasingly minimal.

That single port seems to be a COAP server running with DTLS and a pre-shared key that’s printed on the bottom of the device. When you start the app for the first time it prompts you to scan a QR code that’s just a machine-readable version of that key. The Android app has code for using the insecure COAP port rather than the encrypted one, but the device doesn’t respond to queries there so it’s presumably disabled in release builds. It’s also local only, with no cloud support. You can program timers, but they run on the device. The only other service it seems to run is an mdns responder, which responds to the _coap._udp.local query to allow for discovery.

From a security perspective, this is pretty close to ideal. Having no remote APIs means that security is limited to what’s exposed locally. The local traffic is all encrypted. You can only authenticate with the device if you have physical access to read the (decently long) key off the bottom. I haven’t checked whether the DTLS server is actually well-implemented, but it doesn’t seem to respond unless you authenticate first which probably covers off a lot of potential risks. The SoC has wireless support, but it seems to be disabled – there’s no antenna on board and no mechanism for configuring it.

However, there’s one minor issue. On boot the device grabs the current time from pool.ntp.org (fine) but also hits http://fw.ota.homesmart.ikea.net/feed/version_info.json . That file contains a bunch of links to firmware updates, all of which are also downloaded over http (and not https). The firmware images themselves appear to be signed, but downloading untrusted objects and then parsing them isn’t ideal. Realistically, this is only a problem if someone already has enough control over your network to mess with your DNS, and being wired-only makes this pretty unlikely. I’d be surprised if it’s ever used as a real avenue of attack.

Overall: as far as design goes, this is one of the most secure IoT-style devices I’ve looked at. I haven’t examined the COAP stack in detail to figure out whether it has any exploitable bugs, but the attack surface is pretty much as minimal as it could be while still retaining any functionality at all. I’m impressed.

[1] Formerly Broadcom

comment count unavailable comments

A quick look at the Ikea Trådfri lighting platform

Post Syndicated from Matthew Garrett original http://mjg59.dreamwidth.org/47803.html

Ikea recently launched their Trådfri smart lighting platform in the US. The idea of Ikea plus internet security together at last seems like a pretty terrible one, but having taken a look it’s surprisingly competent. Hardware-wise, the device is pretty minimal – it seems to be based on the Cypress[1] WICED IoT platform, with 100MBit ethernet and a Silicon Labs Zigbee chipset. It’s running the Express Logic ThreadX RTOS, has no running services on any TCP ports and appears to listen on two single UDP ports. As IoT devices go, it’s pleasingly minimal.

That single port seems to be a COAP server running with DTLS and a pre-shared key that’s printed on the bottom of the device. When you start the app for the first time it prompts you to scan a QR code that’s just a machine-readable version of that key. The Android app has code for using the insecure COAP port rather than the encrypted one, but the device doesn’t respond to queries there so it’s presumably disabled in release builds. It’s also local only, with no cloud support. You can program timers, but they run on the device. The only other service it seems to run is an mdns responder, which responds to the _coap._udp.local query to allow for discovery.

From a security perspective, this is pretty close to ideal. Having no remote APIs means that security is limited to what’s exposed locally. The local traffic is all encrypted. You can only authenticate with the device if you have physical access to read the (decently long) key off the bottom. I haven’t checked whether the DTLS server is actually well-implemented, but it doesn’t seem to respond unless you authenticate first which probably covers off a lot of potential risks. The SoC has wireless support, but it seems to be disabled – there’s no antenna on board and no mechanism for configuring it.

However, there’s one minor issue. On boot the device grabs the current time from pool.ntp.org (fine) but also hits http://fw.ota.homesmart.ikea.net/feed/version_info.json . That file contains a bunch of links to firmware updates, all of which are also downloaded over http (and not https). The firmware images themselves appear to be signed, but downloading untrusted objects and then parsing them isn’t ideal. Realistically, this is only a problem if someone already has enough control over your network to mess with your DNS, and being wired-only makes this pretty unlikely. I’d be surprised if it’s ever used as a real avenue of attack.

Overall: as far as design goes, this is one of the most secure IoT-style devices I’ve looked at. I haven’t examined the COAP stack in detail to figure out whether it has any exploitable bugs, but the attack surface is pretty much as minimal as it could be while still retaining any functionality at all. I’m impressed.

[1] Formerly Broadcom

comment count unavailable comments

Security Orchestration and Incident Response

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/03/security_orches.html

Last month at the RSA Conference, I saw a lot of companies selling security incident response automation. Their promise was to replace people with computers ­– sometimes with the addition of machine learning or other artificial intelligence techniques ­– and to respond to attacks at computer speeds.

While this is a laudable goal, there’s a fundamental problem with doing this in the short term. You can only automate what you’re certain about, and there is still an enormous amount of uncertainty in cybersecurity. Automation has its place in incident response, but the focus needs to be on making the people effective, not on replacing them ­ security orchestration, not automation.

This isn’t just a choice of words ­– it’s a difference in philosophy. The US military went through this in the 1990s. What was called the Revolution in Military Affairs (RMA) was supposed to change how warfare was fought. Satellites, drones and battlefield sensors were supposed to give commanders unprecedented information about what was going on, while networked soldiers and weaponry would enable troops to coordinate to a degree never before possible. In short, the traditional fog of war would be replaced by perfect information, providing certainty instead of uncertainty. They, too, believed certainty would fuel automation and, in many circumstances, allow technology to replace people.

Of course, it didn’t work out that way. The US learned in Afghanistan and Iraq that there are a lot of holes in both its collection and coordination systems. Drones have their place, but they can’t replace ground troops. The advances from the RMA brought with them some enormous advantages, especially against militaries that didn’t have access to the same technologies, but never resulted in certainty. Uncertainty still rules the battlefield, and soldiers on the ground are still the only effective way to control a region of territory.

But along the way, we learned a lot about how the feeling of certainty affects military thinking. Last month, I attended a lecture on the topic by H.R. McMaster. This was before he became President Trump’s national security advisor-designate. Then, he was the director of the Army Capabilities Integration Center. His lecture touched on many topics, but at one point he talked about the failure of the RMA. He confirmed that military strategists mistakenly believed that data would give them certainty. But he took this change in thinking further, outlining the ways this belief in certainty had repercussions in how military strategists thought about modern conflict.

McMaster’s observations are directly relevant to Internet security incident response. We too have been led to believe that data will give us certainty, and we are making the same mistakes that the military did in the 1990s. In a world of uncertainty, there’s a premium on understanding, because commanders need to figure out what’s going on. In a world of certainty, knowing what’s going on becomes a simple matter of data collection.

I see this same fallacy in Internet security. Many companies exhibiting at the RSA Conference promised to collect and display more data and that the data will reveal everything. This simply isn’t true. Data does not equal information, and information does not equal understanding. We need data, but we also must prioritize understanding the data we have over collecting ever more data. Much like the problems with bulk surveillance, the “collect it all” approach provides minimal value over collecting the specific data that’s useful.

In a world of uncertainty, the focus is on execution. In a world of certainty, the focus is on planning. I see this manifesting in Internet security as well. My own Resilient Systems ­– now part of IBM Security –­ allows incident response teams to manage security incidents and intrusions. While the tool is useful for planning and testing, its real focus is always on execution.

Uncertainty demands initiative, while certainty demands synchronization. Here, again, we are heading too far down the wrong path. The purpose of all incident response tools should be to make the human responders more effective. They need both the ability and the capability to exercise it effectively.

When things are uncertain, you want your systems to be decentralized. When things are certain, centralization is more important. Good incident response teams know that decentralization goes hand in hand with initiative. And finally, a world of uncertainty prioritizes command, while a world of certainty prioritizes control. Again, effective incident response teams know this, and effective managers aren’t scared to release and delegate control.

Like the US military, we in the incident response field have shifted too much into the world of certainty. We have prioritized data collection, preplanning, synchronization, centralization and control. You can see it in the way people talk about the future of Internet security, and you can see it in the products and services offered on the show floor of the RSA Conference.

Automation, too, is fixed. Incident response needs to be dynamic and agile, because you are never certain and there is an adaptive, malicious adversary on the other end. You need a response system that has human controls and can modify itself on the fly. Automation just doesn’t allow a system to do that to the extent that’s needed in today’s environment. Just as the military shifted from trying to replace the soldier to making the best soldier possible, we need to do the same.

For some time, I have been talking about incident response in terms of OODA loops. This is a way of thinking about real-time adversarial relationships, originally developed for airplane dogfights, but much more broadly applicable. OODA stands for observe-orient-decide-act, and it’s what people responding to a cybersecurity incident do constantly, over and over again. We need tools that augment each of those four steps. These tools need to operate in a world of uncertainty, where there is never enough data to know everything that is going on. We need to prioritize understanding, execution, initiative, decentralization and command.

At the same time, we’re going to have to make all of this scale. If anything, the most seductive promise of a world of certainty and automation is that it allows defense to scale. The problem is that we’re not there yet. We can automate and scale parts of IT security, such as antivirus, automatic patching and firewall management, but we can’t yet scale incident response. We still need people. And we need to understand what can be automated and what can’t be.

The word I prefer is orchestration. Security orchestration represents the union of people, process and technology. It’s computer automation where it works, and human coordination where that’s necessary. It’s networked systems giving people understanding and capabilities for execution. It’s making those on the front lines of incident response the most effective they can be, instead of trying to replace them. It’s the best approach we have for cyberdefense.

Automation has its place. If you think about the product categories where it has worked, they’re all areas where we have pretty strong certainty. Automation works in antivirus, firewalls, patch management and authentication systems. None of them is perfect, but all those systems are right almost all the time, and we’ve developed ancillary systems to deal with it when they’re wrong.

Automation fails in incident response because there’s too much uncertainty. Actions can be automated once the people understand what’s going on, but people are still required. For example, IBM’s Watson for Cyber Security provides insights for incident response teams based on its ability to ingest and find patterns in an enormous amount of freeform data. It does not attempt a level of understanding necessary to take people out of the equation.

From within an orchestration model, automation can be incredibly powerful. But it’s the human-centric orchestration model –­ the dashboards, the reports, the collaboration –­ that makes automation work. Otherwise, you’re blindly trusting the machine. And when an uncertain process is automated, the results can be dangerous.

Technology continues to advance, and this is all a changing target. Eventually, computers will become intelligent enough to replace people at real-time incident response. My guess, though, is that computers are not going to get there by collecting enough data to be certain. More likely, they’ll develop the ability to exhibit understanding and operate in a world of uncertainty. That’s a much harder goal.

Yes, today, this is all science fiction. But it’s not stupid science fiction, and it might become reality during the lifetimes of our children. Until then, we need people in the loop. Orchestration is a way to achieve that.

This essay previously appeared on the Security Intelligence blog.

Squarespace OCSP Stapling Implementation

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org//2016/10/24/squarespace-ocsp-impl.html

We’re excited that Squarespace has decided to protect the millions of sites they host with HTTPS! While talking with their
team we learned they were deploying OCSP Stapling from the get-go, and we were impressed. We asked them to share their
experience with our readers in our first guest blog post (hopefully more to come).

– Josh Aas, Executive Director, ISRG / Let’s Encrypt

OCSP stapling is an alternative approach to the Online Certificate Status Protocol (OCSP) for checking the revocation status of certificates. It allows the presenter of a certificate to bear the resource cost involved in providing OCSP responses by appending (“stapling”) a time-stamped OCSP response signed by the CA to the initial TLS handshake, eliminating the need for clients to contact the CA. The certificate holder queries the OCSP responder at regular intervals and caches the responses.

Traditional OCSP requires the CA to provide responses to each client that requests certificate revocation information. When a certificate is issued for a popular website, a large amount of queries start hitting the CA’s OCSP responder server. This poses a privacy risk because information must pass through a third party and the third party is able to determine who browsed which site at what time. It can also create performance problems, since most browsers will contact the OCSP responder before loading anything on the web page. OCSP stapling is efficient because the user doesn’t have to make a separate connection to the CA, and it’s safe because the OCSP response is digitally signed so it cannot be modified without detection.

OCSP Stapling @ Squarespace

As we were planning our roll out of SSL for all custom domains on the Squarespace platform, we decided that we wanted to support OCSP stapling at time of launch. A reverse proxy built by our Edge Infrastructure team is responsible for terminating all SSL traffic, it’s written in Java and is powered by Netty. Unfortunately, the Java JDK 8 only has preliminary, client-only, OCSP stapling support. JDK 9 introduces OCSP stapling with JEP 249, but it is not available yet.

Our reverse proxy does not use the JDK’s SSL implementation. Instead, we use OpenSSL via netty-tcnative. At this time, neither the original tcnative nor Netty’s fork have OCSP stapling support. However, the tcnative library exposes the inner workings of OpenSSL, including the address pointers for the SSL context and engine. We were able to use JNI to extend the netty-tcnative library and add OCSP stapling support using the tlsext_status OpenSSL C functions. Our extension is a standalone library but we could equally well fold it into the netty-tcnative library itself. If there is interest, we can contribute it upstream as part of Netty’s next API-breaking development cycle.

One of the goals of our initial OCSP stapling implementation was to take the biggest edge off of the OCSP responder’s operator, in this case Let’s Encrypt. Due to the nature of the website traffic on our platform, we have a very long tail. At least to start, we don’t pre-fetch and cache all OCSP responses. We decided to fetch OCSP responses asynchronously and we try to do it only if more than one client is going to use it in the foreseeable future. Bloom filters are utilized to identify “one-hit wonders” that are not worthy of being cached.

Squarespace invests in the security of our customers’ websites and their visitors. We will continue to make refinements to our OCSP stapling implementation to eventually have OCSP staples on all requests. For a more in depth discussion about the security challenges of traditional OCSP, we recommend this blog post.

Squarespace OCSP Stapling Implementation

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org/2016/10/24/squarespace-ocsp-impl.html

<blockquote>
<p>We’re excited that Squarespace has decided to protect the millions of sites they host with HTTPS! While talking with their
team we learned they were deploying OCSP Stapling from the get-go, and we were impressed. We asked them to share their
experience with our readers in our first guest blog post (hopefully more to come).</p>

<p>- Josh Aas, Executive Director, ISRG / Let’s Encrypt</p>
</blockquote>

<p><a href="https://en.wikipedia.org/wiki/OCSP_stapling">OCSP stapling</a> is an alternative approach to the Online Certificate Status Protocol (OCSP) for checking the revocation status of certificates. It allows the presenter of a certificate to bear the resource cost involved in providing OCSP responses by appending (“stapling”) a time-stamped OCSP response signed by the CA to the initial TLS handshake, eliminating the need for clients to contact the CA. The certificate holder queries the OCSP responder at regular intervals and caches the responses.</p>

<p>Traditional OCSP requires the CA to provide responses to each client that requests certificate revocation information. When a certificate is issued for a popular website, a large amount of queries start hitting the CA’s OCSP responder server. This poses a privacy risk because information must pass through a third party and the third party is able to determine who browsed which site at what time. It can also create performance problems, since most browsers will contact the OCSP responder before loading anything on the web page. OCSP stapling is efficient because the user doesn’t have to make a separate connection to the CA, and it’s safe because the OCSP response is digitally signed so it cannot be modified without detection.</p>

<h2 id="ocsp-stapling-squarespace">OCSP Stapling @ Squarespace</h2>

<p>As we were planning our roll out of SSL for all custom domains on the Squarespace platform, we decided that we wanted to support OCSP stapling at time of launch. A reverse proxy built by our <a href="https://www.squarespace.com/about/careers?gh_jid=245517">Edge Infrastructure team</a> is responsible for terminating all SSL traffic, it’s written in Java and is powered by <a href="http://netty.io/">Netty</a>. Unfortunately, the Java JDK 8 only has preliminary, client-only, OCSP stapling support. JDK 9 introduces OCSP stapling with <a href="http://openjdk.java.net/jeps/249">JEP 249</a>, but it is not available yet.</p>

<p>Our reverse proxy does not use the JDK’s SSL implementation. Instead, we use OpenSSL via <a href="http://netty.io/wiki/forked-tomcat-native.html">netty-tcnative</a>. At this time, neither the original tcnative nor Netty’s fork have OCSP stapling support. However, the tcnative library exposes the inner workings of OpenSSL, including the address pointers for the SSL context and engine. We were able to use JNI to extend the netty-tcnative library and add OCSP stapling support using the <a href="https://www.openssl.org/docs/man1.0.2/ssl/SSL_set_tlsext_status_type.html">tlsext_status</a> OpenSSL C functions. Our extension is a standalone library but we could equally well fold it into the netty-tcnative library itself. If there is interest, we can contribute it upstream as part of Netty’s next API-breaking development cycle.</p>

<p>One of the goals of our initial OCSP stapling implementation was to take the biggest edge off of the OCSP responder’s operator, in this case Let’s Encrypt. Due to the nature of the website traffic on our platform, we have a very long tail. At least to start, we don’t pre-fetch and cache all OCSP responses. We decided to fetch OCSP responses asynchronously and we try to do it only if more than one client is going to use it in the foreseeable future. Bloom filters are utilized to identify “one-hit wonders” that are not worthy of being cached.</p>

<p>Squarespace invests in the security of our customers’ websites and their visitors. We will continue to make refinements to our OCSP stapling implementation to eventually have OCSP staples on all requests. For a more in depth discussion about the security challenges of traditional OCSP, we recommend <a href="https://www.imperialviolet.org/2014/04/19/revchecking.html">this blog post</a>.</p>

How to Help Achieve Mobile App Transport Security (ATS) Compliance by Using Amazon CloudFront and AWS Certificate Manager

Post Syndicated from Lee Atkinson original https://aws.amazon.com/blogs/security/how-to-help-achieve-mobile-app-transport-security-compliance-by-using-amazon-cloudfront-and-aws-certificate-manager/

Web and application users and organizations have expressed a growing desire to conduct most of their HTTP communication securely by using HTTPS. At its 2016 Worldwide Developers Conference, Apple announced that starting in January 2017, apps submitted to its App Store will be required to support App Transport Security (ATS). ATS requires all connections to web services to use HTTPS and TLS version 1.2. In addition, Google has announced that starting in January 2017, new versions of its Chrome web browser will mark HTTP websites as being “not secure.”

In this post, I show how you can generate Secure Sockets Layer (SSL) or Transport Layer Security (TLS) certificates by using AWS Certificate Manager (ACM), apply the certificates to your Amazon CloudFront distributions, and deliver your websites and APIs over HTTPS.

Background

Hypertext Transfer Protocol (HTTP) was proposed originally without the need for security measures such as server authentication and transport encryption. As HTTP evolved from covering simple document retrieval to sophisticated web applications and APIs, security concerns emerged. For example, if someone were able to spoof a website’s DNS name (perhaps by altering the DNS resolver’s configuration), they could direct users to another web server. Users would be unaware of this because the URL displayed by the browser would appear just as the user expected. If someone were able to gain access to network traffic between a client and server, that individual could eavesdrop on HTTP communication and either read or modify the content, without the client or server being aware of such malicious activities.

Hypertext Transfer Protocol Secure (HTTPS) was introduced as a secure version of HTTP. It uses either SSL or TLS protocols to create a secure channel through which HTTP communication can be transported. Using SSL/TLS, servers can be authenticated by using digital certificates. These certificates can be digitally signed by one of the certificate authorities (CA) trusted by the web client. Certificates can mitigate website spoofing and these can be later revoked by the CA, providing additional security. These revoked certificates are published by the authority on a certificate revocation list, or their status is made available via an online certificate status protocol (OCSP) responder. The SSL/TLS “handshake” that initiates the secure channel exchanges encryption keys in order to encrypt the data sent over it.

To avoid warnings from client applications regarding untrusted certificates, a CA that is trusted by the application must sign the certificates. The process of obtaining a certificate from a CA begins with generating a key pair and a certificate signing request. The certificate authority uses various methods in order to verify that the certificate requester is the owner of the domain for which the certificate is requested. Many authorities charge for verification and generation of the certificate.

Use ACM and CloudFront to deliver HTTPS websites and APIs

The process of requesting and paying for certificates, storing and transporting them securely, and repeating the process at renewal time can be a burden for website owners. ACM enables you to easily provision, manage, and deploy SSL/TLS certificates for use with AWS services, including CloudFront. ACM removes the time-consuming manual process of purchasing, uploading, and renewing certificates. With ACM, you can quickly request a certificate, deploy it on your CloudFront distributions, and let ACM handle certificate renewals. In addition to requesting SSL/TLS certificates provided by ACM, you can import certificates that you obtained outside of AWS.

CloudFront is a global content delivery network (CDN) service that accelerates the delivery of your websites, APIs, video content, and other web assets. CloudFront’s proportion of traffic delivered via HTTPS continues to increase as more customers use the secure protocol to deliver their websites and APIs.

CloudFront supports Apple’s ATS requirements for TLS 1.2, Perfect Forward Secrecy, server certificates with 2048-bit Rivest-Shamir-Adleman (RSA) keys, and a choice of ciphers. See more details in Supported Protocols and Ciphers.

The following diagram illustrates an architecture with ACM, a CloudFront distribution and its origins, and how they integrate to provide HTTPS access to end users and applications.

Solution architecture diagram

  1. ACM automates the creation and renewal of SSL/TLS certificates and deploys them to AWS resources such as CloudFront distributions and Elastic Load Balancing load balancers at your instruction.
  2. Users communicate with CloudFront over HTTPS. CloudFront terminates the SSL/TLS connection at the edge location.
  3. You can configure CloudFront to communicate to the origin over HTTP or HTTPS.

CloudFront enables easy HTTPS adoption. It provides a default *.cloudfront.net wildcard certificate and supports custom certificates, which can be either created by a third-party CA, or created and managed by ACM. ACM automates the process of generating and associating certificates with your CloudFront distribution for the first time and on each renewal. CloudFront supports the Server Name Indication (SNI) TLS extension (enabling efficient use of IP addresses when hosting multiple HTTPS websites) and dedicated-IP SSL/TLS (for older browsers and legacy clients that do no support SNI).

Keeping that background information in mind, I will now show you how you can generate a certificate with ACM and associate it with your CloudFront distribution.

Generate a certificate with ACM and associate it with your CloudFront distribution

In order to help deliver websites and APIs that are compliant with Apple’s ATS requirements, you can generate a certificate in ACM and associate it with your CloudFront distribution.

To generate a certificate with ACM and associate it with your CloudFront distribution:

  1. Go to the ACM console and click Get started.
    ACM "Get started" page
  2. On the next page, type the website’s domain name for your certificate. If applicable, you can enter multiple domains here so that the same certificate can be used for multiple websites. In my case, I type *.leeatk.com to create what is known as a wildcard certificate that can be used for any domain ending in .leeatk.com (that is a domain I own). Click Review and request.
    Request a certificate page
  3. Click Confirm and request. You must now validate that you own the domain. ACM sends an email with a verification link to the domain registrant, technical contact, and administrative contact registered in the Whois record for the domain. ACM also sends the verification link to email addresses commonly associated with an administrator of a domain: administrator, hostmaster, postmaster, and webmaster. ACM sends the same verification email to all these addresses in the expectation that at least one address is monitored by the domain owner. The link in any of the emails can be used to verify the domain.
    List of email addresses to which the email with verification link will be sent
  4. Until the certificate has been validated, the status of the certificate remains Pending validation. When I went through this approval process for *.leeatk.com, I received the verification email shown in the following screenshot. When you receive the verification email, click the link in the email to approve the request.
    Example verification email
  5. After you click I Approve on the landing page, you will then see a page that confirms that you have approved an SSL/TLS certificate for your domain name.
    SSL/TLS certificate confirmation page
  6. Return to the ACM console, and the certificate’s status should become Issued. You may need to refresh the webpage.
    ACM console showing the certificate has been issued
  7. Now that you have created your certificate, go to the CloudFront console and select the distribution with which you want to associate the certificate.
    Screenshot of associating the CloudFront distribution with which to associate the certificate
  8. Click Edit. Scroll down to SSL Certificate and select Custom SSL certificate. From the drop-down list, select the certificate provided by ACM. Select Only Clients that Support Server Name Indication (SNI). You could select All Clients if you want to support older clients that do not support SNI.
    Screenshot of choosing a custom SSL certificate
  9. Save the configuration by clicking Yes, Edit at the bottom of the page.
  10. Now, when you view the website in a browser (Firefox is shown in the following screenshot), you see a green padlock in the address bar, confirming that this page is secured with a certificate trusted by the browser.
    Screenshot showing green padlock in address bar

Configure CloudFront to redirect HTTP requests to HTTPS

We encourage you to use HTTPS to help make your websites and APIs more secure. Therefore, we recommend that you configure CloudFront to redirect HTTP requests to HTTPS.

To configure CloudFront to redirect HTTP requests to HTTPS:

  1. Go to the CloudFront console, select the distribution again, and then click Cache Behavior.
    Screenshot showing Cache Behavior button
  2. In my case, I only have one behavior in my distribution. (If I had more behaviors, I would repeat the process for each behavior that I wanted to have HTTP-to-HTTPS redirection) and click Edit.
  3. Next to Viewer Protocol Policy, choose Redirect HTTP to HTTPS, and click Yes, Edit at the bottom of the page.
    Screenshot of choosing Redirect HTTP to HTTPS

I could also consider employing an HTTP Strict Transport Security (HSTS) policy on my website. In this case, I would add a Strict-Transport-Security response header at my origin to instruct browsers and other applications to make only HTTPS requests to my website for a period of time specified in the header’s value. This ensures that if a user submits a URL to my website specifying only HTTP, the browser will make an HTTPS request anyway. This is also useful for websites that link to my website using HTTP URLs.

Summary

CloudFront and ACM enable more secure communication between your users and your websites. CloudFront allows you to adopt HTTPS for your websites and APIs. ACM provides a simple way to request, manage, and renew your SSL/TLS certificates, and deploy those to AWS services such as CloudFront. Mobile application developers and API providers can more easily meet Apple’s ATS requirements now using CloudFront, in time for the January 2017 deadline.

If you have comments about this post, submit them in the “Comments” section below. If you have implementation questions, please start a new thread on the CloudFront forum.

– Lee

Responder – LLMNR, MDNS and NBT-NS Poisoner

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/RT7cVz5g4FU/

Responder is an LLMNR, NBT-NS and MDNS poisoner. It will answer to specific NBT-NS (NetBIOS Name Service) queries based on their name suffix (see: NetBIOS Suffixes). By default, the tool will only answer to File Server Service request, which is for SMB. The concept behind this is to target our answers, and be stealthier on […]

The post…

Read the full post at darknet.org.uk

Receiving Email with Amazon SES

Post Syndicated from Peter Winckles original http://sesblog.amazon.com/post/Tx3HQEJUJNVABG8/Receiving-Email-with-Amazon-SES

The Amazon SES team is pleased to announce that you can now use SES to receive email!

For the past four years, SES has strived to make your life easier by maintaining a fleet of SMTP servers ready to send mail when you want it. There’s no need to worry about scaling, ensuring message delivery, or navigating relationships with countless email service providers.

However, you’d still need to manage a fleet of SMTP servers if you wanted to receive mail. As with sending mail, receiving mail comes with its own set of headaches: scaling for traffic spikes, blocking malicious senders, filtering out spam and viruses, and ultimately routing mail to your application, to name a few.

As of today, the SES team would like to invite you to say goodbye to these hassles, and rely on SES to simply receive your mail just as you rely on us to simply send your mail.

Why should I use SES to receive mail?

SES is ideally suited for servicing mail that is programmatically actionable. The following are a handful of common use cases that you can now leverage SES to solve:

Automatically create support tickets from customer email.

Implement an email auto-responder.

Process email list unsubscribe requests.

Process email bounces and complaints.

Create an email archival solution.

Update correspondence in tickets, forums, etc. by email.

Receive files from customers via email.

You can also use SES to manage your organization’s entire mail stream, directing mail destined to personal inboxes to Amazon WorkMail and processing customer service mail, etc. programmatically with SES.

How does it work?

Think of SES as an email gateway to the AWS ecosystem. After onboarding your domain onto SES, we will receive mail on your behalf, and allow you to consume it through a variety of different AWS services. For example, you can configure SES to deliver all of your mail to an Amazon S3 bucket, and process it directly using AWS Lambda.

SES empowers you to make decisions about how your mail is processed through the concept of a rule set. Every account that receives mail using SES has a single active rule set that you customize to dictate to SES what you’d like done with your mail across all of your SES-managed domains.

A rule set is simply an ordered list of rules, and a rule is a combination of a matching condition and an ordered list of actions. A condition is something like "All mail to [email protected]" or "All mail to example.com and all subdomains." Actions are things like "Encrypt my mail using my AWS KMS key, write it to my S3 bucket, and notify me of the delivery via Amazon SNS" or "Asynchronously execute my Lambda function that updates my mailing list based on unsubscribe emails" or "Send me a SNS notification containing the email." A more thorough discussion of rule sets, rules, and actions can be found in our developer guide.

Your rule set is sequentially evaluated for every message SES receives, and only the actions that apply to the message are executed. This enables you to write rules that route mail differently based on individual message characteristics. You can have a rule that drops mail that SES flags as spam across all of your domains, another that writes mail to a.example.com to one S3 bucket, another that writes mail to b.example.com to a different bucket and then executes a Lambda function but only when the email contains a specific header value, and so on.

The system was designed to be both highly customizable and convenient to use. Our goal is to minimize the amount of custom email routing or parsing logic that your application needs to do, and, if you capitalize on our Lambda integration, you may not even need an application at all!

How do I get started?

The best place to start is the SES developer guide. It provides detailed instructions on how to onboard a domain onto SES to receive mail, as well as walks you through the process of setting up rules to govern your mail flows. Then, head over to the SES console to set up your domains to begin receiving mail!

Finally, if you’re heading to AWS re:Invent this year, be sure to check out our presentation showcasing our new features!

Receiving Email with Amazon SES

Post Syndicated from Peter Winckles original http://sesblog.amazon.com/post/Tx3HQEJUJNVABG8/Receiving-Email-with-Amazon-SES

The Amazon SES team is pleased to announce that you can now use SES to receive email!

For the past four years, SES has strived to make your life easier by maintaining a fleet of SMTP servers ready to send mail when you want it. There’s no need to worry about scaling, ensuring message delivery, or navigating relationships with countless email service providers.

However, you’d still need to manage a fleet of SMTP servers if you wanted to receive mail. As with sending mail, receiving mail comes with its own set of headaches: scaling for traffic spikes, blocking malicious senders, filtering out spam and viruses, and ultimately routing mail to your application, to name a few.

As of today, the SES team would like to invite you to say goodbye to these hassles, and rely on SES to simply receive your mail just as you rely on us to simply send your mail.

Why should I use SES to receive mail?

SES is ideally suited for servicing mail that is programmatically actionable. The following are a handful of common use cases that you can now leverage SES to solve:

Automatically create support tickets from customer email.

Implement an email auto-responder.

Process email list unsubscribe requests.

Process email bounces and complaints.

Create an email archival solution.

Update correspondence in tickets, forums, etc. by email.

Receive files from customers via email.

You can also use SES to manage your organization’s entire mail stream, directing mail destined to personal inboxes to Amazon WorkMail and processing customer service mail, etc. programmatically with SES.

How does it work?

Think of SES as an email gateway to the AWS ecosystem. After onboarding your domain onto SES, we will receive mail on your behalf, and allow you to consume it through a variety of different AWS services. For example, you can configure SES to deliver all of your mail to an Amazon S3 bucket, and process it directly using AWS Lambda.

SES empowers you to make decisions about how your mail is processed through the concept of a rule set. Every account that receives mail using SES has a single active rule set that you customize to dictate to SES what you’d like done with your mail across all of your SES-managed domains.

A rule set is simply an ordered list of rules, and a rule is a combination of a matching condition and an ordered list of actions. A condition is something like "All mail to [email protected]" or "All mail to example.com and all subdomains." Actions are things like "Encrypt my mail using my AWS KMS key, write it to my S3 bucket, and notify me of the delivery via Amazon SNS" or "Asynchronously execute my Lambda function that updates my mailing list based on unsubscribe emails" or "Send me a SNS notification containing the email." A more thorough discussion of rule sets, rules, and actions can be found in our developer guide.

Your rule set is sequentially evaluated for every message SES receives, and only the actions that apply to the message are executed. This enables you to write rules that route mail differently based on individual message characteristics. You can have a rule that drops mail that SES flags as spam across all of your domains, another that writes mail to a.example.com to one S3 bucket, another that writes mail to b.example.com to a different bucket and then executes a Lambda function but only when the email contains a specific header value, and so on.

The system was designed to be both highly customizable and convenient to use. Our goal is to minimize the amount of custom email routing or parsing logic that your application needs to do, and, if you capitalize on our Lambda integration, you may not even need an application at all!

How do I get started?

The best place to start is the SES developer guide. It provides detailed instructions on how to onboard a domain onto SES to receive mail, as well as walks you through the process of setting up rules to govern your mail flows. Then, head over to the SES console to set up your domains to begin receiving mail!

Finally, if you’re heading to AWS re:Invent this year, be sure to check out our presentation showcasing our new features!

Receiving Email with Amazon SES

Post Syndicated from Peter Winckles original https://aws.amazon.com/blogs/ses/receiving-email-with-amazon-ses/

The Amazon SES team is pleased to announce that you can now use SES to receive email!

For the past four years, SES has strived to make your life easier by maintaining a fleet of SMTP servers ready to send mail when you want it. There’s no need to worry about scaling, ensuring message delivery, or navigating relationships with countless email service providers.

However, you’d still need to manage a fleet of SMTP servers if you wanted to receive mail. As with sending mail, receiving mail comes with its own set of headaches: scaling for traffic spikes, blocking malicious senders, filtering out spam and viruses, and ultimately routing mail to your application, to name a few.

As of today, the SES team would like to invite you to say goodbye to these hassles, and rely on SES to simply receive your mail just as you rely on us to simply send your mail.

Why should I use SES to receive mail?

SES is ideally suited for servicing mail that is programmatically actionable. The following are a handful of common use cases that you can now leverage SES to solve:

  • Automatically create support tickets from customer email.
  • Implement an email auto-responder.
  • Process email list unsubscribe requests.
  • Process email bounces and complaints.
  • Create an email archival solution.
  • Update correspondence in tickets, forums, etc. by email.
  • Receive files from customers via email.

You can also use SES to manage your organization’s entire mail stream, directing mail destined to personal inboxes to Amazon WorkMail and processing customer service mail, etc. programmatically with SES.

How does it work?

Think of SES as an email gateway to the AWS ecosystem. After onboarding your domain onto SES, we will receive mail on your behalf, and allow you to consume it through a variety of different AWS services. For example, you can configure SES to deliver all of your mail to an Amazon S3 bucket, and process it directly using AWS Lambda.

SES empowers you to make decisions about how your mail is processed through the concept of a rule set. Every account that receives mail using SES has a single active rule set that you customize to dictate to SES what you’d like done with your mail across all of your SES-managed domains.

A rule set is simply an ordered list of rules, and a rule is a combination of a matching condition and an ordered list of actions. A condition is something like “All mail to [email protected]” or “All mail to example.com and all subdomains.” Actions are things like “Encrypt my mail using my AWS KMS key, write it to my S3 bucket, and notify me of the delivery via Amazon SNS” or “Asynchronously execute my Lambda function that updates my mailing list based on unsubscribe emails” or “Send me a SNS notification containing the email.” A more thorough discussion of rule sets, rules, and actions can be found in our developer guide.

Your rule set is sequentially evaluated for every message SES receives, and only the actions that apply to the message are executed. This enables you to write rules that route mail differently based on individual message characteristics. You can have a rule that drops mail that SES flags as spam across all of your domains, another that writes mail to a.example.com to one S3 bucket, another that writes mail to b.example.com to a different bucket and then executes a Lambda function but only when the email contains a specific header value, and so on.

Amazon SES rule set

The system was designed to be both highly customizable and convenient to use. Our goal is to minimize the amount of custom email routing or parsing logic that your application needs to do, and, if you capitalize on our Lambda integration, you may not even need an application at all!

How do I get started?

The best place to start is the SES developer guide. It provides detailed instructions on how to onboard a domain onto SES to receive mail, as well as walks you through the process of setting up rules to govern your mail flows. Then, head over to the SES console to set up your domains to begin receiving mail!

Finally, if you’re heading to AWS re:Invent this year, be sure to check out our presentation showcasing our new features!

Organizing Software Deployments to Match Failure Conditions

Post Syndicated from Nick Trebon original https://aws.amazon.com/blogs/architecture/organizing-software-deployments-to-match-failure-conditions/

Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact. These strategies require an understanding of how the various components of your system interact, how those components can fail and how those failures impact your customers. This blog post discusses some of the deployment strategies that we’ve made on the Route 53 team and how these choices affect the availability of our service.

To begin, I’ll briefly describe some of the deployment procedures and the Route 53 architecture in order to provide some context for the deployment strategies that we have chosen. Hopefully, these examples will reveal strategies that could benefit your own service’s availability. Like many services, Route 53 consists of multiple environments or stages: one for active development, one for staging changes to production and the production stage itself. The natural tension with trying to reduce the number of failed deployments in production is to add more rigidity and processes that slow down the release of new code. At Route 53, we do not enforce a strict release or deployment schedule; individual developers are responsible for verifying their changes in the staging environment and pushing their changes into production. Typically, our deployments proceed in a pipelined fashion. Each step of the pipeline is referred to as a “wave” and consists of some portion of our fleet. A pipeline is a good abstraction as each wave can be thought of as an independent and separate step. After each wave of the pipeline, the change can be verified — this can include automatic, scheduled and manual testing as well as the verification of service metrics. Furthermore, we typically space out the earlier waves of production deployment at least 24 hours apart, in order to allow the changes to “bake.” Letting our software bake refers to rolling out software changes slowly to allow us to validate those changes and verify service metrics with production traffic before pushing the deployment to the next wave. The clear advantage of deploying new code to only a portion of your fleet is that it reduces the impact of a failed deployment to just the portion of the fleet containing the new code. Another benefit of our deployment infrastructure is that it provides us a mechanism to quickly “roll back” a deployment to a previous software version if any problems are detected which, in many cases, enables us to quickly mitigate a failed deployment.

Based on our experiences, we have further organized our deployments to try and match our failure conditions to further reduce impact. First, our deployment strategies are tailored to the part of the system that is the target of our deployment. We commonly refer to two main components of Route 53: the control plane and the data plane (pictured below). The control plane consists primarily of our API and DNS change propagation system. Essentially, this is the part of our system that accepts a customer request to create or delete a DNS record and then the transmission of that update to all of our DNS servers distributed across the world. The data plane consists of our fleet of DNS servers that are responsible for answering DNS queries on behalf of our customers. These servers currently reside in more than 50 locations around the world. Both of these components have their own set of failure conditions and differ in how a failed deployment will impact customers. Further, a failure of one component may not impact the other. For example, an API outage where customers are unable to create new hosted zones or records has no impact on our data plane continuing to answer queries for all records created prior to the outage. Given their distinct set of failure conditions, the control plane and data plane have their own deployment strategies, which are each discussed in more detail below.

Control Plane Deployments

The bulk of the of the control plane actually consists of two APIs. The first is our external API that is reachable from the Internet and is the entry point for customers to create, delete and view their DNS records. This external API performs authentication and authorization checks on customer requests before forwarding them to our internal API. The second, internal API supports a much larger set of operations than just the ones needed by the external API; it also includes operations required to monitor and propagate DNS changes to our DNS servers as well as other operations needed to operate and monitor the service. Failed deployments to the external API typically impact a customer’s ability to view or modify their DNS records. The availability of this API is critical as our customers may rely on the ability to update their DNS records quickly and reliably during an operational event for their own service or site.

Deployments to the external API are fairly straightforward. For increased availability, we host the external API in multiple availability zones. Each wave of deployment consists of the hosts within a single availability zone, and each host in that availability zone is deployed to individually. If any single host deployment fails, the deployment to the entire availability zone is halted automatically. Some host failures may be quickly caught and mitigated by the load balancer for our hosts in that particular availability zone, which is responsible for health checking the hosts. Hosts that fail these load balancer health checks are automatically removed from service by the load balancer. Thus, a failed deployment to just a single host would result in it being removed from service automatically and the deployment halted without any operator intervention. For other types of failed deployments that may not cause the load balancer health checks to fail, restricting waves to a single availability zone allows us to easily flip away from that availability zone as soon as the failure is detected. A similar approach could be applied to services that utilize Route 53 plus ELB in multiple regions and availability zones for their services. ELBs automatically health check their back-end instances and remove unhealthy instances from service. By creating Route 53 alias records marked to evaluate target health (see ELB documentation for how to set this up), if all instances behind an ELB are unhealthy, Route 53 will fail away from this alias and attempt to find an alternate healthy record to serve. This configuration will enable automatic failover at the DNS-level for an unhealthy region or availability zone. To enable manual failover, simply convert the alias resource record set for your ELB to either a weighted alias or associate it with a health check whose health you control. To initiate a failover, simply set the weight to 0 or fail the health check. A weighted alias also allows you the ability to slowly increase the traffic to that ELB, which can be useful for verifying your own software deployments to the back-end instances.

For our internal API, the deployment strategy is more complicated (pictured below). Here, our fleet is partitioned by the type of traffic it handles. We classify traffic into three types: (1) low-priority, long-running operations used to monitor the service (batch fleet), (2) all other operations used to operate and monitor the service (operations fleet) and (3) all customer operations (customer fleet). Deployments to the production internal API are then organized by how critical their traffic is to the service as a whole. For instance, the batch fleet is deployed to first because their operations are not critical to the running of the service and we can tolerate long outages of this fleet. Similarly, we prioritize the operations fleet below that of customer traffic as we would rather continue accepting and processing customer traffic after a failed deployment to the operations fleet. For the internal API, we have also organized our staging waves differently from our production waves. In the staging waves, all three fleets are split across two waves. This is done intentionally to allow us to verify that the code changes work in a split-world where multiple versions of the software are running simultaneously. We have found this to be useful in catching incompatibilities between software versions. Since we never deploy software in production to 100% of our fleet at the same time, our software updates must be designed to be compatible with the previous version. Finally, as with the external API, all wave deployments proceed with a single host at a time. For this API, we also include a deep application health check as part of the deployment. Similar to the load balancer health checks for the external API, if this health check fails, the entire deployment is immediately halted.

Data Plane Deployments

As mentioned earlier, our data plane consists of Route 53’s DNS servers, which are distributed across the world in more than 50 distinct locations (we refer to each location as an ‘edge location’). An important consideration with our deployment strategy is how we stripe our anycast IP space across locations. In summary, each hosted zone is assigned four delegation name servers, each of which belong to a “stripe” (i.e., one quarter of our anycast range). Generally speaking, each edge location announces only a single stripe, so each stripe is therefore announced by roughly 1/4 of our edge locations worldwide. Thus, when a resolver issues a query against each of the four delegation name servers, those queries are directed via BGP to the closest (in a network sense) edge location from each stripe. While the availability and correctness of our API is important, the availability and correctness of our data plane are even more critical. In this case, an outage directly results in an outage for our customers. Furthermore, the impact of serving even a single wrong answer on behalf of a customer is magnified by that answer being cached by both intermediate resolvers and end clients alike. Thus, deployments to our data plane are organized even more carefully to both prevent failed deployments and to reduce potential impact.

The safest way to deploy and minimize impact would be to deploy to a single edge location at a time. However, with manual deployments that are overseen by a developer, this approach is just not scalable with how frequently we deploy new software to over 50 locations (with more added each year). Thus, most of our production deployment waves consist of multiple locations; the one exception is our first wave that includes just a single location. Furthermore, this location is specifically chosen because it runs our oldest hardware, which provides us a quick notification for any unintended performance degradation. It is important to note that while the caching behavior for resolvers can cause issues if we serve an incorrect answer, they handle other types failures well. When a recursive resolver receives a query for a record that is not cached, it will typically issue queries to at least three of the four delegation name servers in parallel and it will use the first response it receives. Thus, in the event where one of our locations is black holing customer queries (i.e., not replying to DNS queries), the resolver should receive a response from one of the other delegation name servers. In this case, the only impact is to resolvers where the edge location that is not answering would have been the fastest responder. Now, that resolver will effectively be waiting for the response from the second fastest stripe. To take advantage of this resiliency, our other waves are organized such that they include edge locations that are geographically diverse, with the intent that for any single resolver, there will be nearby locations that are not included in the current deployment wave. Furthermore, to guarantee that at most a single nameserver for all customers is affected, waves are actually organized by stripe. Finally, each stripe is spread across multiple waves so that failures impact only a single name server for a portion of our customers. An example of this strategy is depicted below. A few notes: our staging environment consists of a much smaller number of edge locations than production, so single-location waves are possible. Second, each stripe is denoted by color; in this example, we see deployments spread across a blue and orange stripe. You, too, can think about organizing your deployment strategy around your failure conditions. For example, if you have a database schema used by both your production system and a warehousing system, deploy the change to the warehousing system first to ensure you haven’t broken any compatibility. You might catch problems with the warehousing system before it affects customer traffic.

Conclusions

Our team’s experience with operating Route 53 over the last 3+ years have highlighted the importance of reducing the impact from failed deployments. Over the years, we have been able to identify some of the common failure conditions and to organize our software deployments in such a way so that we increase the ease of mitigation while decreasing the potential impact to our customers.

– Nick Trebon

Organizing Software Deployments to Match Failure Conditions

Post Syndicated from AWS Architecture Blog original https://www.awsarchitectureblog.com/2014/05/organizing-software-deployments-to-match-failure-conditions.html

Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact. These strategies require an understanding of how the various components of your system interact, how those components can fail and how those failures impact your customers. This blog post discusses some of the deployment strategies that we've made on the Route 53 team and how these choices affect the availability of our service.

To begin, I'll briefly describe some of the deployment procedures and the Route 53 architecture in order to provide some context for the deployment strategies that we have chosen. Hopefully, these examples will reveal strategies that could benefit your own service's availability. Like many services, Route 53 consists of multiple environments or stages: one for active development, one for staging changes to production and the production stage itself. The natural tension with trying to reduce the number of failed deployments in production is to add more rigidity and processes that slow down the release of new code. At Route 53, we do not enforce a strict release or deployment schedule; individual developers are responsible for verifying their changes in the staging environment and pushing their changes into production. Typically, our deployments proceed in a pipelined fashion. Each step of the pipeline is referred to as a "wave" and consists of some portion of our fleet. A pipeline is a good abstraction as each wave can be thought of as an independent and separate step. After each wave of the pipeline, the change can be verified — this can include automatic, scheduled and manual testing as well as the verification of service metrics. Furthermore, we typically space out the earlier waves of production deployment at least 24 hours apart, in order to allow the changes to "bake." Letting our software bake refers to rolling out software changes slowly to allow us to validate those changes and verify service metrics with production traffic before pushing the deployment to the next wave. The clear advantage of deploying new code to only a portion of your fleet is that it reduces the impact of a failed deployment to just the portion of the fleet containing the new code. Another benefit of our deployment infrastructure is that it provides us a mechanism to quickly "roll back" a deployment to a previous software version if any problems are detected which, in many cases, enables us to quickly mitigate a failed deployment.

Based on our experiences, we have further organized our deployments to try and match our failure conditions to further reduce impact. First, our deployment strategies are tailored to the part of the system that is the target of our deployment. We commonly refer to two main components of Route 53: the control plane and the data plane (pictured below). The control plane consists primarily of our API and DNS change propagation system. Essentially, this is the part of our system that accepts a customer request to create or delete a DNS record and then the transmission of that update to all of our DNS servers distributed across the world. The data plane consists of our fleet of DNS servers that are responsible for answering DNS queries on behalf of our customers. These servers currently reside in more than 50 locations around the world. Both of these components have their own set of failure conditions and differ in how a failed deployment will impact customers. Further, a failure of one component may not impact the other. For example, an API outage where customers are unable to create new hosted zones or records has no impact on our data plane continuing to answer queries for all records created prior to the outage. Given their distinct set of failure conditions, the control plane and data plane have their own deployment strategies, which are each discussed in more detail below.

Control and Data Planes

Control Plane Deployments

The bulk of the of the control plane actually consists of two APIs. The first is our external API that is reachable from the Internet and is the entry point for customers to create, delete and view their DNS records. This external API performs authentication and authorization checks on customer requests before forwarding them to our internal API. The second, internal API supports a much larger set of operations than just the ones needed by the external API; it also includes operations required to monitor and propagate DNS changes to our DNS servers as well as other operations needed to operate and monitor the service. Failed deployments to the external API typically impact a customer's ability to view or modify their DNS records. The availability of this API is critical as our customers may rely on the ability to update their DNS records quickly and reliably during an operational event for their own service or site.

Deployments to the external API are fairly straightforward. For increased availability, we host the external API in multiple availability zones. Each wave of deployment consists of the hosts within a single availability zone, and each host in that availability zone is deployed to individually. If any single host deployment fails, the deployment to the entire availability zone is halted automatically. Some host failures may be quickly caught and mitigated by the load balancer for our hosts in that particular availability zone, which is responsible for health checking the hosts. Hosts that fail these load balancer health checks are automatically removed from service by the load balancer. Thus, a failed deployment to just a single host would result in it being removed from service automatically and the deployment halted without any operator intervention. For other types of failed deployments that may not cause the load balancer health checks to fail, restricting waves to a single availability zone allows us to easily flip away from that availability zone as soon as the failure is detected. A similar approach could be applied to services that utilize Route 53 plus ELB in multiple regions and availability zones for their services. ELBs automatically health check their back-end instances and remove unhealthy instances from service. By creating Route 53 alias records marked to evaluate target health (see ELB documentation for how to set this up), if all instances behind an ELB are unhealthy, Route 53 will fail away from this alias and attempt to find an alternate healthy record to serve. This configuration will enable automatic failover at the DNS-level for an unhealthy region or availability zone. To enable manual failover, simply convert the alias resource record set for your ELB to either a weighted alias or associate it with a health check whose health you control. To initiate a failover, simply set the weight to 0 or fail the health check. A weighted alias also allows you the ability to slowly increase the traffic to that ELB, which can be useful for verifying your own software deployments to the back-end instances.

For our internal API, the deployment strategy is more complicated (pictured below). Here, our fleet is partitioned by the type of traffic it handles. We classify traffic into three types: (1) low-priority, long-running operations used to monitor the service (batch fleet), (2) all other operations used to operate and monitor the service (operations fleet) and (3) all customer operations (customer fleet). Deployments to the production internal API are then organized by how critical their traffic is to the service as a whole. For instance, the batch fleet is deployed to first because their operations are not critical to the running of the service and we can tolerate long outages of this fleet. Similarly, we prioritize the operations fleet below that of customer traffic as we would rather continue accepting and processing customer traffic after a failed deployment to the operations fleet. For the internal API, we have also organized our staging waves differently from our production waves. In the staging waves, all three fleets are split across two waves. This is done intentionally to allow us to verify that the code changes work in a split-world where multiple versions of the software are running simultaneously. We have found this to be useful in catching incompatibilities between software versions. Since we never deploy software in production to 100% of our fleet at the same time, our software updates must be designed to be compatible with the previous version. Finally, as with the external API, all wave deployments proceed with a single host at a time. For this API, we also include a deep application health check as part of the deployment. Similar to the load balancer health checks for the external API, if this health check fails, the entire deployment is immediately halted.

Internal API Wave Deployments

Data Plane Deployments

As mentioned earlier, our data plane consists of Route 53's DNS servers, which are distributed across the world in more than 50 distinct locations (we refer to each location as an 'edge location'). An important consideration with our deployment strategy is how we stripe our anycast IP space across locations. In summary, each hosted zone is assigned four delegation name servers, each of which belong to a "stripe" (i.e., one quarter of our anycast range). Generally speaking, each edge location announces only a single stripe, so each stripe is therefore announced by roughly 1/4 of our edge locations worldwide. Thus, when a resolver issues a query against each of the four delegation name servers, those queries are directed via BGP to the closest (in a network sense) edge location from each stripe. While the availability and correctness of our API is important, the availability and correctness of our data plane are even more critical. In this case, an outage directly results in an outage for our customers. Furthermore, the impact of serving even a single wrong answer on behalf of a customer is magnified by that answer being cached by both intermediate resolvers and end clients alike. Thus, deployments to our data plane are organized even more carefully to both prevent failed deployments and to reduce potential impact.

The safest way to deploy and minimize impact would be to deploy to a single edge location at a time. However, with manual deployments that are overseen by a developer, this approach is just not scalable with how frequently we deploy new software to over 50 locations (with more added each year). Thus, most of our production deployment waves consist of multiple locations; the one exception is our first wave that includes just a single location. Furthermore, this location is specifically chosen because it runs our oldest hardware, which provides us a quick notification for any unintended performance degradation. It is important to note that while the caching behavior for resolvers can cause issues if we serve an incorrect answer, they handle other types failures well. When a recursive resolver receives a query for a record that is not cached, it will typically issue queries to at least three of the four delegation name servers in parallel and it will use the first response it receives. Thus, in the event where one of our locations is black holing customer queries (i.e., not replying to DNS queries), the resolver should receive a response from one of the other delegation name servers. In this case, the only impact is to resolvers where the edge location that is not answering would have been the fastest responder. Now, that resolver will effectively be waiting for the response from the second fastest stripe. To take advantage of this resiliency, our other waves are organized such that they include edge locations that are geographically diverse, with the intent that for any single resolver, there will be nearby locations that are not included in the current deployment wave. Furthermore, to guarantee that at most a single nameserver for all customers is affected, waves are actually organized by stripe. Finally, each stripe is spread across multiple waves so that failures impact only a single name server for a portion of our customers. An example of this strategy is depicted below. A few notes: our staging environment consists of a much smaller number of edge locations than production, so single-location waves are possible. Second, each stripe is denoted by color; in this example, we see deployments spread across a blue and orange stripe. You, too, can think about organizing your deployment strategy around your failure conditions. For example, if you have a database schema used by both your production system and a warehousing system, deploy the change to the warehousing system first to ensure you haven't broken any compatibility. You might catch problems with the warehousing system before it affects customer traffic.

Data Plane Wave Deployments

Conclusions

Our team's experience with operating Route 53 over the last 3+ years have highlighted the importance of reducing the impact from failed deployments. Over the years, we have been able to identify some of the common failure conditions and to organize our software deployments in such a way so that we increase the ease of mitigation while decreasing the potential impact to our customers.

– Nick Trebon

Avahi Gains Compatibility Layers for Apple Bonjour and HOWL

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/avahi-compat.html

A short while ago I checked in to SVN two API/ABI compatibility
modules which implement the HOWL and the Apple
Bonjour (dns_sd.h)
DNS-SD/mDNS APIs on top of Avahi’s
native API. Effectively this means that you can run *all*
Zeroconf-enabled software that is available for free operating systems
seamlessly on top of Avahi. Or at least the software that uses the
limited subset of API functions we support. Missing functions will be
implemented on an on-demand basis. Gnome-VFS/Nautilus works
perfectly, as does Gobby, which are the only real-world applications
we tested until now.

The list of supported/unsupported functions is available from SVN for HOWL and for
dns-sd.h.

The compatibility layers are actually pretty interesting pieces of code: for
compatibility with the way HOWL/Bonjour integrates with event loops we had to
hook up the timeout and I/O watches D-BUS depends on to a single file
descriptor. This involves all kinds of ugly things like threading and
“creative” ways to use the event loop abstraction Avahi provides. Some might
call this “cracktastic”, but it actually works pretty well.

The compatibility layers are not intended to be long term solutions. For
every session object we create a background thread that polls for events and a
DBUS session object. This is an utter waste of resources, especially on
dns_sd.h where every basic operation uses a session object of its own.
In addition, our compatibility layers are incomplete. We do not offer the full
set of functions or the full semantics. Our compatibility is just good enough
to make most Zeroconf-aware programs work with Avahi right now.

We consider neither dns_sd.h nor the HOWL API a “well designed”
API and encourage people to port their programs to our more powerful native
API. To stress this the two modules will warn the user about their usage and
write a warning line to STDERR and syslog. Hopefully this will annoy
people sufficiently that Avahi adoption speeds up a little.

To our own surprise we actually support at least one API function more than each of the
reference implementations! From dns_sd.h we support
DNSServiceEnumerateDomains() which is actually unsupported by
Apple Bonjour on POSIX/Linux systems. The documented HOWL function
sw_ipv4_address_decompose() is actually a NOOP in the
reference implementation, but isn’t in our compatibility layer.

Since dns_sd.h is the only file licensed under a BSD license in the otherwise APSL-licensed
mDNSResponder distribution, we were able to copy it into our sources untouched.

Here’s a screenshot of
Nautilus and Gobby
running on top of Avahi through the HOWL compatibility
layers.

Avahi Gains Compatibility Layers for Apple Bonjour and HOWL

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/avahi-compat.html

A short while ago I checked in to SVN two API/ABI compatibility
modules which implement the HOWL and the Apple
Bonjour (dns_sd.h)
DNS-SD/mDNS APIs on top of Avahi’s
native API. Effectively this means that you can run *all*
Zeroconf-enabled software that is available for free operating systems
seamlessly on top of Avahi. Or at least the software that uses the
limited subset of API functions we support. Missing functions will be
implemented on an on-demand basis. Gnome-VFS/Nautilus works
perfectly, as does Gobby, which are the only real-world applications
we tested until now.

The list of supported/unsupported functions is available from SVN for HOWL and for
dns-sd.h.

The compatibility layers are actually pretty interesting pieces of code: for
compatibility with the way HOWL/Bonjour integrates with event loops we had to
hook up the timeout and I/O watches D-BUS depends on to a single file
descriptor. This involves all kinds of ugly things like threading and
“creative” ways to use the event loop abstraction Avahi provides. Some might
call this “cracktastic”, but it actually works pretty well.

The compatibility layers are not intended to be long term solutions. For
every session object we create a background thread that polls for events and a
DBUS session object. This is an utter waste of resources, especially on
dns_sd.h where every basic operation uses a session object of its own.
In addition, our compatibility layers are incomplete. We do not offer the full
set of functions or the full semantics. Our compatibility is just good enough
to make most Zeroconf-aware programs work with Avahi right now.

We consider neither dns_sd.h nor the HOWL API a “well designed”
API and encourage people to port their programs to our more powerful native
API. To stress this the two modules will warn the user about their usage and
write a warning line to STDERR and syslog. Hopefully this will annoy
people sufficiently that Avahi adoption speeds up a little.

To our own surprise we actually support at least one API function more than each of the
reference implementations! From dns_sd.h we support
DNSServiceEnumerateDomains() which is actually unsupported by
Apple Bonjour on POSIX/Linux systems. The documented HOWL function
sw_ipv4_address_decompose() is actually a NOOP in the
reference implementation, but isn’t in our compatibility layer.

Since dns_sd.h is the only file licensed under a BSD license in the otherwise APSL-licensed
mDNSResponder distribution, we were able to copy it into our sources untouched.

Here’s a screenshot of
Nautilus and Gobby
running on top of Avahi through the HOWL compatibility
layers.