Tag Archives: software updates

Introducing AWS Directory Service for Microsoft Active Directory (Standard Edition)

Post Syndicated from Peter Pereira original https://aws.amazon.com/blogs/security/introducing-aws-directory-service-for-microsoft-active-directory-standard-edition/

Today, AWS introduced AWS Directory Service for Microsoft Active Directory (Standard Edition), also known as AWS Microsoft AD (Standard Edition), which is managed Microsoft Active Directory (AD) that is performance optimized for small and midsize businesses. AWS Microsoft AD (Standard Edition) offers you a highly available and cost-effective primary directory in the AWS Cloud that you can use to manage users, groups, and computers. It enables you to join Amazon EC2 instances to your domain easily and supports many AWS and third-party applications and services. It also can support most of the common use cases of small and midsize businesses. When you use AWS Microsoft AD (Standard Edition) as your primary directory, you can manage access and provide single sign-on (SSO) to cloud applications such as Microsoft Office 365. If you have an existing Microsoft AD directory, you can also use AWS Microsoft AD (Standard Edition) as a resource forest that contains primarily computers and groups, allowing you to migrate your AD-aware applications to the AWS Cloud while using existing on-premises AD credentials.

In this blog post, I help you get started by answering three main questions about AWS Microsoft AD (Standard Edition):

  1. What do I get?
  2. How can I use it?
  3. What are the key features?

After answering these questions, I show how you can get started with creating and using your own AWS Microsoft AD (Standard Edition) directory.

1. What do I get?

When you create an AWS Microsoft AD (Standard Edition) directory, AWS deploys two Microsoft AD domain controllers powered by Microsoft Windows Server 2012 R2 in your Amazon Virtual Private Cloud (VPC). To help deliver high availability, the domain controllers run in different Availability Zones in the AWS Region of your choice.

As a managed service, AWS Microsoft AD (Standard Edition) configures directory replication, automates daily snapshots, and handles all patching and software updates. In addition, AWS Microsoft AD (Standard Edition) monitors and automatically recovers domain controllers in the event of a failure.

AWS Microsoft AD (Standard Edition) has been optimized as a primary directory for small and midsize businesses with the capacity to support approximately 5,000 employees. With 1 GB of directory object storage, AWS Microsoft AD (Standard Edition) has the capacity to store 30,000 or more total directory objects (users, groups, and computers). AWS Microsoft AD (Standard Edition) also gives you the option to add domain controllers to meet the specific performance demands of your applications. You also can use AWS Microsoft AD (Standard Edition) as a resource forest with a trust relationship to your on-premises directory.

2. How can I use it?

With AWS Microsoft AD (Standard Edition), you can share a single directory for multiple use cases. For example, you can share a directory to authenticate and authorize access for .NET applications, Amazon RDS for SQL Server with Windows Authentication enabled, and Amazon Chime for messaging and video conferencing.

The following diagram shows some of the use cases for your AWS Microsoft AD (Standard Edition) directory, including the ability to grant your users access to external cloud applications and allow your on-premises AD users to manage and have access to resources in the AWS Cloud. Click the diagram to see a larger version.

Diagram showing some ways you can use AWS Microsoft AD (Standard Edition)--click the diagram to see a larger version

Use case 1: Sign in to AWS applications and services with AD credentials

You can enable multiple AWS applications and services such as the AWS Management Console, Amazon WorkSpaces, and Amazon RDS for SQL Server to use your AWS Microsoft AD (Standard Edition) directory. When you enable an AWS application or service in your directory, your users can access the application or service with their AD credentials.

For example, you can enable your users to sign in to the AWS Management Console with their AD credentials. To do this, you enable the AWS Management Console as an application in your directory, and then assign your AD users and groups to IAM roles. When your users sign in to the AWS Management Console, they assume an IAM role to manage AWS resources. This makes it easy for you to grant your users access to the AWS Management Console without needing to configure and manage a separate SAML infrastructure.

Use case 2: Manage Amazon EC2 instances

Using familiar AD administration tools, you can apply AD Group Policy objects (GPOs) to centrally manage your Amazon EC2 for Windows or Linux instances by joining your instances to your AWS Microsoft AD (Standard Edition) domain.

In addition, your users can sign in to your instances with their AD credentials. This eliminates the need to use individual instance credentials or distribute private key (PEM) files. This makes it easier for you to instantly grant or revoke access to users by using AD user administration tools you already use.

Use case 3: Provide directory services to your AD-aware workloads

AWS Microsoft AD (Standard Edition) is an actual Microsoft AD that enables you to run traditional AD-aware workloads such as Remote Desktop Licensing Manager, Microsoft SharePoint, and Microsoft SQL Server Always On in the AWS Cloud. AWS Microsoft AD (Standard Edition) also helps you to simplify and improve the security of AD-integrated .NET applications by using group Managed Service Accounts (gMSAs) and Kerberos constrained delegation (KCD).

Use case 4: SSO to Office 365 and other cloud applications

You can use AWS Microsoft AD (Standard Edition) to provide SSO for cloud applications. You can use Azure AD Connect to synchronize your users into Azure AD, and then use Active Directory Federation Services (AD FS) so that your users can access Microsoft Office 365 and other SAML 2.0 cloud applications by using their AD credentials.

Use case 5: Extend your on-premises AD to the AWS Cloud

If you already have an AD infrastructure and want to use it when migrating AD-aware workloads to the AWS Cloud, AWS Microsoft AD (Standard Edition) can help. You can use AD trusts to connect AWS Microsoft AD (Standard Edition) to your existing AD. This means your users can access AD-aware and AWS applications with their on-premises AD credentials, without needing you to synchronize users, groups, or passwords.

For example, your users can sign in to the AWS Management Console and Amazon WorkSpaces by using their existing AD user names and passwords. Also, when you use AD-aware applications such as SharePoint with AWS Microsoft AD (Standard Edition), your logged-in Windows users can access these applications without needing to enter credentials again.

3. What are the key features?

AWS Microsoft AD (Standard Edition) includes the features detailed in this section.

Extend your AD schema

With AWS Microsoft AD, you can run customized AD-integrated applications that require changes to your directory schema, which defines the structures of your directory. The schema is composed of object classes such as user objects, which contain attributes such as user names. AWS Microsoft AD lets you extend the schema by adding new AD attributes or object classes that are not present in the core AD attributes and classes.

For example, if you have a human resources application that uses employee badge color to assign specific benefits, you can extend the schema to include a badge color attribute in the user object class of your directory. To learn more, see How to Move More Custom Applications to the AWS Cloud with AWS Directory Service.

Create user-specific password policies

With user-specific password policies, you can apply specific restrictions and account lockout policies to different types of users in your AWS Microsoft AD (Standard Edition) domain. For example, you can enforce strong passwords and frequent password change policies for administrators, and use less-restrictive policies with moderate account lockout policies for general users.

Add domain controllers

You can increase the performance and redundancy of your directory by adding domain controllers. This can help improve application performance by enabling directory clients to load-balance their requests across a larger number of domain controllers.

Encrypt directory traffic

You can use AWS Microsoft AD (Standard Edition) to encrypt Lightweight Directory Access Protocol (LDAP) communication between your applications and your directory. By enabling LDAP over Secure Sockets Layer (SSL)/Transport Layer Security (TLS), also called LDAPS, you encrypt your LDAP communications end to end. This helps you to protect sensitive information you keep in your directory when it is accessed over untrusted networks.

Improve the security of signing in to AWS services by using multi-factor authentication (MFA)

You can improve the security of signing in to AWS services, such as Amazon WorkSpaces and Amazon QuickSight, by enabling MFA in your AWS Microsoft AD (Standard Edition) directory. With MFA, your users must enter a one-time passcode (OTP) in addition to their AD user names and passwords to access AWS applications and services you enable in AWS Microsoft AD (Standard Edition).

Get started

To get started, use the Directory Service console to create your first directory with just a few clicks. If you have not used Directory Service before, you may be eligible for a 30-day limited free trial.

Summary

In this blog post, I explained what AWS Microsoft AD (Standard Edition) is and how you can use it. With a single directory, you can address many use cases for your business, making it easier to migrate and run your AD-aware workloads in the AWS Cloud, provide access to AWS applications and services, and connect to other cloud applications. To learn more about AWS Microsoft AD, see the Directory Service home page.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about this blog post, start a new thread on the Directory Service forum.

– Peter

Solus 3 released

Post Syndicated from corbet original https://lwn.net/Articles/731121/rss

The Solus distribution project has announced
the availability of Solus 3. “This is the third iteration of
Solus since our move to become a rolling release operating system. Unlike
the previous iterations, however, this is a release and not a
snapshot. We’ve now moved away from the ‘regular snapshot’ model to
accommodate the best hybrid approach possible – feature rich releases with
explicit goals and technology enabling, along with the benefits of a
curated rolling release operating system.
” Headline features
include support for the Snap packaging format, a lot of desktop changes,
and numerous software updates. (LWN looked at
Solus
in 2016).

Blue/Green Deployments with Amazon EC2 Container Service

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/bluegreen-deployments-with-amazon-ecs/

This post and accompanying code was generously contributed by:

Jeremy Cowan
Solutions Architect
Anuj Sharma
DevOps Cloud Architect
Peter Dalbhanjan
Solutions Architect

Deploying software updates in traditional non-containerized environments is hard and fraught with risk. When you write your deployment package or script, you have to assume that the target machine is in a particular state. If your staging environment is not an exact mirror image of your production environment, your deployment could fail. These failures frequently cause outages that persist until you re-deploy the last known good version of your application. If you are an Operations Manager, this is what keeps you up at night.

Increasingly, customers want to do testing in production environments without exposing customers to the new version until the release has been vetted. Others want to expose a small percentage of their customers to the new release to gather feedback about a feature before it’s released to the broader population. This is often referred to as canary analysis or canary testing. In this post, I introduce patterns to implement blue/green and canary deployments using Application Load Balancers and target groups.

If you’d like to try this approach to blue/green deployments, we have open sourced the code and AWS CloudFormation templates in the ecs-blue-green-deployment GitHub repo. The workflow builds an automated CI/CD pipeline that deploys your service onto an ECS cluster and offers a controlled process to swap target groups when you’re ready to promote the latest version of your code to production. You can quickly set up the environment in three steps and see the blue/green swap in action. We’d love for you to try it and send us your feedback!

Benefits of blue/green

Blue/green deployments are a type of immutable deployment that help you deploy software updates with less risk. The risk is reduced by creating separate environments for the current running or “blue” version of your application, and the new or “green” version of your application.

This type of deployment gives you an opportunity to test features in the green environment without impacting the current running version of your application. When you’re satisfied that the green version is working properly, you can gradually reroute the traffic from the old blue environment to the new green environment by modifying DNS. By following this method, you can update and roll back features with near zero downtime.

A typical blue/green deployment involves shifting traffic between 2 distinct environments.

This ability to quickly roll traffic back to the still-operating blue environment is one of the key benefits of blue/green deployments. With blue/green, you should be able to roll back to the blue environment at any time during the deployment process. This limits downtime to the time it takes to realize there’s an issue in the green environment and shift the traffic back to the blue environment. Furthermore, the impact of the outage is limited to the portion of traffic going to the green environment, not all traffic. If the blast radius of deployment errors is reduced, so is the overall deployment risk.

Containers make it simpler

Historically, blue/green deployments were not often used to deploy software on-premises because of the cost and complexity associated with provisioning and managing multiple environments. Instead, applications were upgraded in place.

Although this approach worked, it had several flaws, including the ability to roll back quickly from failures. Rollbacks typically involved re-deploying a previous version of the application, which could affect the length of an outage caused by a bad release. Fixing the issue took precedence over the need to debug, so there were fewer opportunities to learn from your mistakes.

Containers can ease the adoption of blue/green deployments because they’re easily packaged and behave consistently as they’re moved between environments. This consistency comes partly from their immutability. To change the configuration of a container, update its Dockerfile and rebuild and re-deploy the container rather than updating the software in place.

Containers also provide process and namespace isolation for your applications, which allows you to run multiple versions of them side by side on the same Docker host without conflicts. Given their small sizes relative to virtual machines, you can binpack more containers per host than VMs. This lets you make more efficient use of your computing resources, reducing the cost of blue/green deployments.

Fully Managed Updates with Amazon ECS

Amazon EC2 Container Service (ECS) performs rolling updates when you update an existing Amazon ECS service. A rolling update involves replacing the current running version of the container with the latest version. The number of containers Amazon ECS adds or removes from service during a rolling update is controlled by adjusting the minimum and maximum number of healthy tasks allowed during service deployments.

When you update your service’s task definition with the latest version of your container image, Amazon ECS automatically starts replacing the old version of your container with the latest version. During a deployment, Amazon ECS drains connections from the current running version and registers your new containers with the Application Load Balancer as they come online.

Target groups

A target group is a logical construct that allows you to run multiple services behind the same Application Load Balancer. This is possible because each target group has its own listener.

When you create an Amazon ECS service that’s fronted by an Application Load Balancer, you have to designate a target group for your service. Ordinarily, you would create a target group for each of your Amazon ECS services. However, the approach we’re going to explore here involves creating two target groups: one for the blue version of your service, and one for the green version of your service. We’re also using a different listener port for each target group so that you can test the green version of your service using the same path as the blue service.

With this configuration, you can run both environments in parallel until you’re ready to cut over to the green version of your service. You can also do things such as restricting access to the green version to testers on your internal network, using security group rules and placement constraints. For example, you can target the green version of your service to only run on instances that are accessible from your corporate network.

Swapping Over

When you’re ready to replace the old blue service with the new green service, call the ModifyListener API operation to swap the listener’s rules for the target group rules. The change happens instantaneously. Afterward, the green service is running in the target group with the port 80 listener and the blue service is running in the target group with the port 8080 listener. The diagram below is an illustration of the approach described.

Scenario

Two services are defined, each with their own target group registered to the same Application Load Balancer but listening on different ports. Deployment is completed by swapping the listener rules between the two target groups.

The second service is deployed with a new target group listening on a different port but registered to the same Application Load Balancer.

By using 2 listeners, requests to blue services are directed to the target group with the port 80 listener, while requests to the green services are directed to target group with the port 8080 listener.

After automated or manual testing, the deployment can be completed by swapping the listener rules on the Application Load Balancer and sending traffic to the green service.

Caveats

There are a few caveats to be mindful of when using this approach. This method:

  • Assumes that your application code is completely stateless. Store state outside of the container.
  • Doesn’t gracefully drain connections. The swapping of target groups is sudden and abrupt. Therefore, be cautious about using this approach if your service has long-running transactions.
  • Doesn’t allow you to perform canary deployments. While the method gives you the ability to quickly switch between different versions of your service, it does not allow you to divert a portion of the production traffic to a canary or control the rate at which your service is deployed across the cluster.

Canary testing

While this type of deployment automates much of the heavy lifting associated with rolling deployments, it doesn’t allow you to interrupt the deployment if you discover an issue midstream. Rollbacks using the standard Amazon ECS deployment require updating the service’s task definition with the last known good version of the container. Then, you wait for Amazon ECS to schedule and deploy it across the cluster. If the latest version introduces a breaking change that went undiscovered during testing, this might be too slow.

With canary testing, if you discover the green environment is not operating as expected, there is no impact on the blue environment. You can route traffic back to it, minimizing impaired operation or downtime, and limiting the blast radius of impact.

This type of deployment is particularly useful for A/B testing where you want to expose a new feature to a subset of users to get their feedback before making it broadly available.

For canary style deployments, you can use a variation of the blue/green swap that involves deploying the blue and the green service to the same target group. Although this method is not as fast as the swap, it allows you to control the rate at which your containers are replaced by adjusting the task count for each service. Furthermore, it gives you the ability to roll back by adjusting the number of tasks for the blue and green services respectively. Unlike the swap approach described above, connections to your containers are drained gracefully. We plan to address canary style deployments for Amazon ECS in a future post.

Conclusion

With AWS, you can operationalize your blue/green deployments using Amazon ECS, an Application Load Balancer, and target groups. I encourage you to adapt the code published to the ecs-blue-green-deployment GitHub repo for your use cases and look forward to reading your feedback.

If you’re interested in learning more, I encourage you to read the Blue/Green Deployments on AWS and Practicing Continuous Integration and Continuous Delivery on AWS whitepapers.

If you have questions or suggestions, please comment below.

AWS Greengrass – Run AWS Lambda Functions on Connected Devices

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-greengrass-run-aws-lambda-functions-on-connected-devices/

I first told you about AWS Greengrass in the post that I published during re:Invent (AWS Greengrass – Ubiquitous Real-World Computing). We launched a limited preview of Greengrass at that time and invited you to sign up if you were interested.

As I noted at the time, many AWS customers want to collect and process data out in the field, where connectivity is often slow and sometimes either intermittent or unreliable. Greengrass allows them to extend the AWS programming model to small, simple, field-based devices. It builds on AWS IoT and AWS Lambda, and supports access to the ever-increasing variety of services that are available in the AWS Cloud.

Greengrass gives you access to compute, messaging, data caching, and syncing services that run in the field, and that do not depend on constant, high-bandwidth connectivity to an AWS Region. You can write Lambda functions in Python 2.7 and deploy them to your Greengrass devices from the cloud while using device shadows to maintain state. Your devices and peripherals can talk to each other using local messaging that does not pass through the cloud.

Now Generally Available
Today we are making Greengrass generally available in the US East (Northern Virginia) and US West (Oregon) Regions. During the preview, AWS customers were able to get hands-on experience with Greengrass and to start building applications and businesses around it. I’ll share a few of these early successes later in this post.

The Greengrass Core code runs on each device. It allows you to deploy and run Lambda applications on the device, supports local MQTT messaging across a secure network, and also ensures that conversations between devices and the cloud are made across secure connections. The Greengrass Core also supports secure, over-the-air software updates, including Lambda functions. It includes a message broker, a Lambda runtime, a Thing Shadows implementation, and a deployment agent. Greengrass Core and (optionally) other devices make up a Greengrass Group. The group includes configuration data, the list of devices and the identity of the Greengrass Core, a list of Lambda functions, and a set of subscriptions that define where the messages should go. All of this information is copied to the Greengrass core devices during the deployment process.

Your Lambda functions can use APIs in three distinct SDKs:

AWS SDK for Python – This SDK allows your code to interact with Amazon Simple Storage Service (S3), Amazon DynamoDB, Amazon Simple Queue Service (SQS), and other AWS services.

AWS IoT Device SDK – This SDK (available for Node.js, Python, Java, and C++) helps you to connect your hardware devices to AWS IoT. The C++ SDK has a few extra features including access to the Greengrass Discovery Service and support for root CA downloads.

AWS Greengrass Core SDK – This SDK provides APIs that allow local invocation of other Lambda functions, publish messages, and work with thing shadows.

You can run the Greengrass Core on x86 and ARM devices that have version 4.4.11 (or newer) of the Linux kernel, with the OverlayFS and user namespace features enabled. While most deployments of Greengrass will be targeted at specialized, industrial-grade hardware, you can also run the Greengrass Core on a Raspberry Pi or an EC2 instance for development and test purposes.

For this post, I used a Raspberry Pi attached to a BrickPi, connected to my home network via WiFi:

The Raspberry Pi, the BrickPi, the case, and all of the other parts are available in the BrickPi 3 Starter Kit. You will need some Linux command-line expertise and a decent amount of manual dexterity to put all of this together, but if I did it then you surely can.

Greengrass in Action
I can access Greengrass from the Console, API, or CLI. I’ll use the Console. The intro page of the Greengrass Console lets me define groups, add Greengrass Cores, and add devices to my groups:

I click on Get Started and then on Use easy creation:

Then I name my group:

And name my first Greengrass Core:

I’m ready to go, so I click on Create Group and Core:

This runs for a few seconds and then offers up my security resources (two keys and a certificate) for downloading, along with the Greengrass Core:

I download the security resources and put them in a safe place, and select and download the desired version of the Greengrass Core software (ARMv7l for my Raspberry Pi), and click on Finish.

Now I power up my Pi, and copy the security resources and the software to it (I put them in an S3 bucket and pulled them down with wget). Here’s my shell history at that point:

Following the directions in the user guide, I create a new user and group, run the rpi-update script, and install several packages including sqlite3 and openssl. After a couple of reboots, I am ready to proceed!

Next, still following the directions, I untar the Greengrass Core software and move the security resources to their final destination (/greengrass/configuration/certs), giving them generic names along the way. Here’s what the directory looks like:

The next step is to associate the core with an AWS IoT thing. I return to the Console, click through the group and the Greengrass Core, and find the Thing ARN:

I insert the names of the certificates and the Thing ARN into the config.json file, and also fill in the missing sections of the iotHost and ggHost:

I start the Greengrass demon (this was my second attempt; I had a typo in one of my path names the first time around):

After all of this pleasant time at the command line (taking me back to my Unix v7 and BSD 4.2 days), it is time to go visual once again! I visit my AWS IoT dashboard and see that my Greengrass Core is making connections to IoT:

I go to the Lambda Console and create a Lambda function using the Python 2.7 runtime (the IAM role does not matter here):

I publish the function in the usual way and, hop over to the Greengrass Console, click on my group, and choose to add a Lambda function:

Then I choose the version to deploy:

I also configure the function to be long-lived instead of on-demand:

My code will publish messages to AWS IoT, so I create a subscription by specifying the source and destination:

I set up a topic filter (hello/world) on the subscription as well:

I confirm my settings and save my subscription and I am just about ready to deploy my code. I revisit my group, click on Deployments, and choose Deploy from the Actions menu:

I choose Automatic detection to move forward:

Since this is my first deployment, I need to create a service-level role that gives Greengrass permission to access other AWS services. I simply click on Grant permission:

I can see the status of each deployment:

The code is now running on my Pi! It publishes messages to topic hello/world; I can see them by going to the IoT Console, clicking on Test, and subscribing to the topic:

And here are the messages:

With all of the setup work taken care of, I can do iterative development by uploading, publishing, and deploying new versions of my code. I plan to use the BrickPi to control some LEGO Technic motors and to publish data collected from some sensors. Stay tuned for that post!

Greengrass Pricing
You can run the Greengrass Core on three devices free for one year as part of the AWS Free Tier. At the next level (3 to 10,000 devices) two options are available:

  • Pay as You Go – $0.16 per month per device.
  • Annual Commitment – $1.49 per year per device, a 17.5% savings.

If you want to run the Greengrass Core on more than 10,000 devices or make a longer commitment, please get in touch with us; details on all pricing models are on the Greengrass Pricing page.

Jeff;

How to Update AWS CloudHSM Devices and Client Instances to the Software and Firmware Versions Supported by AWS

Post Syndicated from Tracy Pierce original https://aws.amazon.com/blogs/security/how-to-update-aws-cloudhsm-devices-and-client-instances-to-the-software-and-firmware-versions-supported-by-aws/

As I explained in my previous Security Blog post, a hardware security module (HSM) is a hardware device designed with the security of your data and cryptographic key material in mind. It is tamper-resistant hardware that prevents unauthorized users from attempting to pry open the device, plug in any extra devices to access data or keys such as subtokens, or damage the outside housing. The HSM device AWS CloudHSM offers is the Luna SA 7000 (also called Safenet Network HSM 7000), which is created by Gemalto. Depending on the firmware version you install, many of the security properties of these HSMs will have been validated under Federal Information Processing Standard (FIPS) 140-2, a standard issued by the National Institute of Standards and Technology (NIST) for cryptography modules. These standards are in place to protect the integrity and confidentiality of the data stored on cryptographic modules.

To help ensure its continued use, functionality, and support from AWS, we suggest that you update your AWS CloudHSM device software and firmware as well as the client instance software to current versions offered by AWS. As of the publication of this blog post, the current non-FIPS-validated versions are 5.4.9/client, 5.3.13/software, and 6.20.2/firmware, and the current FIPS-validated versions are 5.4.9/client, 5.3.13/software, and 6.10.9/firmware. (The firmware version determines FIPS validation.) It is important to know your current versions before updating so that you can follow the correct update path.

In this post, I demonstrate how to update your current CloudHSM devices and client instances so that you are using the most current versions of software and firmware. If you contact AWS Support for CloudHSM hardware and application issues, you will be required to update to these supported versions before proceeding. Also, any newly provisioned CloudHSM devices will use these supported software and firmware versions only, and AWS does not offer “downgrade options.

Note: Before you perform any updates, check with your local CloudHSM administrator and application developer to verify that these updates will not conflict with your current applications or architecture.

Overview of the update process

To update your client and CloudHSM devices, you must use both update paths offered by AWS. The first path involves updating the software on your client instance, also known as a control instance. Following the second path updates the software first and then the firmware on your CloudHSM devices. The CloudHSM software must be updated before the firmware because of the firmware’s dependencies on the software in order to work appropriately.

As I demonstrate in this post, the correct update order is:

  1. Updating your client instance
  2. Updating your CloudHSM software
  3. Updating your CloudHSM firmware

To update your client instance, you must have the private SSH key you created when you first set up your client environment. This key is used to connect via SSH protocol on port 22 of your client instance. If you have more than one client instance, you must repeat this connection and update process on each of them. The following diagram shows the flow of an SSH connection from your local network to your client instances in the AWS Cloud.

Diagram that shows the flow of an SSH connection from your local network to your client instances in the AWS Cloud

After you update your client instance to the most recent software (5.3.13), you then must update the CloudHSM device software and firmware. First, you must initiate an SSH connection from any one client instance to each CloudHSM device, as illustrated in the following diagram. A successful SSH connection will have you land at the Luna shell, denoted by lunash:>. Second, you must be able to initiate a Secure Copy (SCP) of files to each device from the client instance. Because the software and firmware updates require an elevated level of privilege, you must have the Security Officer (SO) password that you created when you initialized your CloudHSM devices.

Diagram illustrating the initiation of an SSH connection from any one client instance to each CloudHSM device

After you have completed all updates, you can receive enhanced troubleshooting assistance from AWS, if you need it. When new versions of software and firmware are released, AWS performs extensive testing to ensure your smooth transition from version to version.

Detailed guidance for updating your client instance, CloudHSM software, and CloudHSM firmware

1.  Updating your client instance

Let’s start by updating your client instances. My client instance and CloudHSM devices are in the eu-west-1 region, but these steps work the same in any AWS region. Because Gemalto offers client instances in both Linux and Windows, I will cover steps to update both. I will start with Linux. Please note that all commands should be run as the “root” user.

Updating the Linux client

  1. SSH from your local network into the client instance. You can do this from Linux or Windows. Typically, you would do this from the directory where you have stored your private SSH key by using a command like the following command in a terminal or PuTTY This initiates the SSH connection by pointing to the path of your SSH key and denoting the user name and IP address of your client instance.
    ssh –i /Users/Bob/Keys/CloudHSM_SSH_Key.pem [email protected]

  1. After the SSH connection is established, you must stop all applications and services on the instance that are using the CloudHSM device. This is required because you are removing old software and installing new software in its place. After you have stopped all applications and services, you can move on to remove the existing version of the client software.
    /usr/safenet/lunaclient/bin/uninstall.sh

This command will remove the old client software, but will not remove your configuration file or certificates. These will be saved in the Chrystoki.conf file of your /etc directory and your usr/safenet/lunaclient/cert directory. Do not delete these files because you will lose the configuration of your CloudHSM devices and client connections.

  1. Download the new client software package: cloudhsm-safenet-client. Double-click it to extract the archive.
    SafeNet-Luna-client-5-4-9/linux/64/install.sh

    Make sure you choose the Luna SA option when presented with it. Because the directory where your certificates are installed is the same, you do not need to copy these certificates to another directory. You do, however, need to ensure that the Chrystoki.conf file, located at /etc/Chrystoki.conf, has the same path and name for the certificates as when you created them. (The path or names should not have changed, but you should still verify they are same as before the update.)

  1. Check to ensure that the PATH environment variable points to the directory, /usr/safenet/lunaclient/bin, to ensure no issues when you restart applications and services. The update process for your Linux client Instance is now complete.

Updating the Windows client

Use the following steps to update your Windows client instances:

  1. Use Remote Desktop Protocol (RDP) from your local network into the client instance. You can accomplish this with the RDP application of your choice.
  2. After you establish the RDP connection, stop all applications and services on the instance that are using the CloudHSM device. This is required because you will remove old software and install new software in its place or overwrite If your client software version is older than 5.4.1, you need to completely remove it and all patches by using Programs and Features in the Windows Control Panel. If your client software version is 5.4.1 or newer, proceed without removing the software. Your configuration file will remain intact in the crystoki.ini file of your C:\Program Files\SafeNet\Lunaclient\ directory. All certificates are preserved in the C:\Program Files\SafeNet\Lunaclient\cert\ directory. Again, do not delete these files, or you will lose all configuration and client connection data.
  3. After you have completed these steps, download the new client software: cloudhsm-safenet-client. Extract the archive from the downloaded file, and launch the SafeNet-Luna-client-5-4-9\win\64\Lunaclient.msi Choose the Luna SA option when it is presented to you. Because the directory where your certificates are installed is the same, you do not need to copy these certificates to another directory. You do, however, need to ensure that the crystoki.ini file, which is located at C:\Program Files\SafeNet\Lunaclient\crystoki.ini, has the same path and name for the certificates as when you created them. (The path and names should not have changed, but you should still verify they are same as before the update.)
  4. Make one last check to ensure the PATH environment variable points to the directory C:\Program Files\SafeNet\Lunaclient\ to help ensure no issues when you restart applications and services. The update process for your Windows client instance is now complete.

2.  Updating your CloudHSM software

Now that your clients are up to date with the most current software version, it’s time to move on to your CloudHSM devices. A few important notes:

  • Back up your data to a Luna SA Backup device. AWS does not sell or support the Luna SA Backup devices, but you can purchase them from Gemalto. We do, however, offer the steps to back up your data to a Luna SA Backup device. Do not update your CloudHSM devices without backing up your data first.
  • If the names of your clients used for Network Trust Link Service (NTLS) connections has a capital “T” as the eighth character, the client will not work after this update. This is because of a Gemalto naming convention. Before upgrading, ensure you modify your client names accordingly. The NTLS connection uses a two-way digital certificate authentication and SSL data encryption to protect sensitive data transmitted between your CloudHSM device and the client Instances.
  • The syslog configuration for the CloudHSM devices will be lost. After the update is complete, notify AWS Support and we will update the configuration for you.

Now on to updating the software versions. There are actually three different update paths to follow, and I will cover each. Depending on the current software versions on your CloudHSM devices, you might need to follow all three or just one.

Updating the software from version 5.1.x to 5.1.5

If you are running any version of the software older than 5.1.5, you must first update to version 5.1.5 before proceeding. To update to version 5.1.5:

  1. Stop all applications and services that access the CloudHSM device.
  2. Download the Luna SA software update package.
  3. Extract all files from the archive.
  4. Run the following command from your client instance to copy the lunasa_update-5.1.5-2.spkg file to the CloudHSM device.
    $ scp –I <private_key_file> lunasa_update-5.1.5-2.spkg [email protected]<hsm_ip_address>:

    <private_key_file> is the private portion of your SSH key pair and <hsm_ip_address> is the IP address of your CloudHSM elastic network interface (ENI). The ENI is the network endpoint that permits access to your CloudHSM device. The IP address was supplied to you when the CloudHSM device was provisioned.

  1. Use the following command to connect to your CloudHSM device and log in with your Security Officer (SO) password.
    $ ssh –I <private_key_file> [email protected]<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA software package.
    lunash:> package verify lunasa_update-5.1.5-2.spkg –authcode <auth_code>
    
    lunash:> package update lunasa_update-5.1.5-2.spkg –authcode <auth_code>

    The value you will use for <auth_code> is contained in the lunasa_update-5.1.5-2.auth file found in the 630-010165-018_reva.tar archive you downloaded in Step 2.

  1. Reboot the CloudHSM device by running the following command.
    lunash:> sysconf appliance reboot

When all the steps in this section are completed, you will have updated your CloudHSM software to version 5.1.5. You can now move on to update to version 5.3.10.

Updating the software to version 5.3.10

You can update to version 5.3.10 only if you are currently running version 5.1.5. To update to version 5.3.10:

  1. Stop all applications and services that access the CloudHSM device.
  2. Download the v 5.3.10 Luna SA software update package.
  3. Extract all files from the archive.
  4. Run the following command to copy the lunasa_update-5.3.10-7.spkg file to the CloudHSM device.
    $ scp –i <private_key_file> lunasa_update-5.3.10-7.spkg [email protected]<hsm_ip_address>:

    <private_key_file> is the private portion of your SSH key pair and <hsm_ip_address> is the IP address of your CloudHSM ENI.

  1. Run the following command to connect to your CloudHSM device and log in with your SO password.
    $ ssh –i <private_key_file> [email protected]<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA software package.
    lunash:> package verify lunasa_update-5.3.10-7.spkg –authcode <auth_code>
    
    lunash:> package update lunasa_update-5.3.10-7.spkg –authcode <auth_code>

The value you will use for <auth_code> is contained in the lunasa_update-5.3.10-7.auth file found in the SafeNet-Luna-SA-5-3-10.zip archive you downloaded in Step 2.

  1. Reboot the CloudHSM device by running the following command.
    lunash:> sysconf appliance reboot

When all the steps in this section are completed, you will have updated your CloudHSM software to version 5.3.10. You can now move on to update to version 5.3.13.

Note: Do not configure your applog settings at this point; you must first update the software to version 5.3.13 in the following step.

Updating the software to version 5.3.13

You can update to version 5.3.13 only if you are currently running version 5.3.10. If you are not already running version 5.3.10, follow the two update paths mentioned previously in this section.

To update to version 5.3.13:

  1. Stop all applications and services that access the CloudHSM device.
  2. Download the Luna SA software update package.
  3. Extract all files from the archive.
  4. Run the following command to copy the lunasa_update-5.3.13-1.spkg file to the CloudHSM device.
    $ scp –i <private_key_file> lunasa_update-5.3.13-1.spkg [email protected]<hsm_ip_address>

<private_key_file> is the private portion of your SSH key pair and <hsm_ip_address> is the IP address of your CloudHSM ENI.

  1. Run the following command to connect to your CloudHSM device and log in with your SO password.
    $ ssh –i <private_key_file> [email protected]<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA software package.
    lunash:> package verify lunasa_update-5.3.13-1.spkg –authcode <auth_code>
    
    lunash:> package update lunasa_update-5.3.13-1.spkg –authcode <auth_code>

The value you will use for <auth_code> is contained in the lunasa_update-5.3.13-1.auth file found in the SafeNet-Luna-SA-5-3-13.zip archive that you downloaded in Step 2.

  1. When updating to this software version, the option to update the firmware also is offered. If you do not require a version of the firmware validated under FIPS 140-2, accept the firmware update to version 6.20.2. If you do require a version of the firmware validated under FIPS 140-2, do not accept the firmware update and instead update by using the steps in the next section, “Updating your CloudHSM FIPS 140-2 validated firmware.”
  2. After updating the CloudHSM device, reboot it by running the following command.
    lunash:> sysconf appliance reboot

  1. Disable NTLS IP checking on the CloudHSM device so that it can operate within its VPC. To do this, run the following command.
    lunash:> ntls ipcheck disable

When all the steps in this section are completed, you will have updated your CloudHSM software to version 5.3.13. If you don’t need the FIPS 140-2 validated firmware, you will have also updated the firmware to version 6.20.2. If you do need the FIPS 140-2 validated firmware, proceed to the next section.

3.  Updating your CloudHSM FIPS 140-2 validated firmware

To update the FIPS 140-2 validated version of the firmware to 6.10.9, use the following steps:

  1. Download version 6.10.9 of the firmware package.
  2. Extract all files from the archive.
  3. Run the following command to copy the 630-010430-010_SPKG_LunaFW_6.10.9.spkg file to the CloudHSM device.
    $ scp –i <private_key_file> 630-010430-010_SPKG_LunaFW_6.10.9.spkg [email protected]<hsm_ip_address>:

<private_key_file> is the private portion of your SSH key pair, and <hsm_ip_address> is the IP address of your CloudHSM ENI.

  1. Run the following command to connect to your CloudHSM device and log in with your SO password.
    $ ssh –i <private_key_file> manager#<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA firmware package.
    lunash:> package verify 630-010430-010_SPKG_LunaFW_6.10.9.spkg –authcode <auth_code>
    
    lunash:> package update 630-010430-010_SPKG_LunaFW_6.10.9.spkg –authcode <auth_code>

The value you will use for <auth_code> is contained in the 630-010430-010_SPKG_LunaFW_6.10.9.auth file found in the 630-010430-010_SPKG_LunaFW_6.10.9.zip archive that you downloaded in Step 1.

  1. Run the following command to update the firmware of the CloudHSM devices.
    lunash:> hsm update firmware

  1. After you have updated the firmware, reboot the CloudHSM devices to complete the installation.
    lunash:> sysconf appliance reboot

Summary

In this blog post, I walked you through how to update your existing CloudHSM devices and clients so that they are using supported client, software, and firmware versions. Per AWS Support and CloudHSM Terms and Conditions, your devices and clients must use the most current supported software and firmware for continued troubleshooting assistance. Software and firmware versions regularly change based on customer use cases and requirements. Because AWS tests and validates all updates from Gemalto, you must install all updates for firmware and software by using our package links described in this post and elsewhere in our documentation.

If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing this solution, please start a new thread on the CloudHSM forum.

– Tracy

Of course smart homes are targets for hackers

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/45483.html

The Wirecutter, an in-depth comparative review site for various electrical and electronic devices, just published an opinion piece on whether users should be worried about security issues in IoT devices. The summary: avoid devices that don’t require passwords (or don’t force you to change a default and devices that want you to disable security, follow general network security best practices but otherwise don’t worry – criminals aren’t likely to target you.

This is terrible, irresponsible advice. It’s true that most users aren’t likely to be individually targeted by random criminals, but that’s a poor threat model. As I’ve mentioned before, you need to worry about people with an interest in you. Making purchasing decisions based on the assumption that you’ll never end up dating someone with enough knowledge to compromise a cheap IoT device (or even meeting an especially creepy one in a bar) is not safe, and giving advice that doesn’t take that into account is a huge disservice to many potentially vulnerable users.

Of course, there’s also the larger question raised by the last week’s problems. Insecure IoT devices still pose a threat to the wider internet, even if the owner’s data isn’t at risk. I may not be optimistic about the ease of fixing this problem, but that doesn’t mean we should just give up. It is important that we improve the security of devices, and many vendors are just bad at that.

So, here’s a few things that should be a minimum when considering an IoT device:

  • Does the vendor publish a security contact? (If not, they don’t care about security)
  • Does the vendor provide frequent software updates, even for devices that are several years old? (If not, they don’t care about security)
  • Has the vendor ever denied a security issue that turned out to be real? (If so, they care more about PR than security)
  • Is the vendor able to provide the source code to any open source components they use? (If not, they don’t know which software is in their own product and so don’t care about security, and also they’re probably infringing my copyright)
  • Do they mark updates as fixing security bugs? (If not, they care more about hiding security issues than fixing them)
  • Has the vendor ever threatened to prosecute a security researcher? (If so, again, they care more about PR than security)
  • Does the vendor provide a public minimum support period for the device? (If not, they don’t care about security or their users)

    I’ve worked with big name vendors who did a brilliant job here. I’ve also worked with big name vendors who responded with hostility when I pointed out that they were selling a device with arbitrary remote code execution. Going with brand names is probably a good proxy for many of these requirements, but it’s insufficient.

    So here’s my recommendations to The Wirecutter – talk to a wide range of security experts about the issues that users should be concerned about, and figure out how to test these things yourself. Don’t just ask vendors whether they care about security, ask them what their processes and procedures look like. Look at their history. And don’t assume that just because nobody’s interested in you, everybody else’s level of risk is equal.

  • comment count unavailable comments

    Organizing Software Deployments to Match Failure Conditions

    Post Syndicated from Nick Trebon original https://aws.amazon.com/blogs/architecture/organizing-software-deployments-to-match-failure-conditions/

    Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact. These strategies require an understanding of how the various components of your system interact, how those components can fail and how those failures impact your customers. This blog post discusses some of the deployment strategies that we’ve made on the Route 53 team and how these choices affect the availability of our service.

    To begin, I’ll briefly describe some of the deployment procedures and the Route 53 architecture in order to provide some context for the deployment strategies that we have chosen. Hopefully, these examples will reveal strategies that could benefit your own service’s availability. Like many services, Route 53 consists of multiple environments or stages: one for active development, one for staging changes to production and the production stage itself. The natural tension with trying to reduce the number of failed deployments in production is to add more rigidity and processes that slow down the release of new code. At Route 53, we do not enforce a strict release or deployment schedule; individual developers are responsible for verifying their changes in the staging environment and pushing their changes into production. Typically, our deployments proceed in a pipelined fashion. Each step of the pipeline is referred to as a “wave” and consists of some portion of our fleet. A pipeline is a good abstraction as each wave can be thought of as an independent and separate step. After each wave of the pipeline, the change can be verified — this can include automatic, scheduled and manual testing as well as the verification of service metrics. Furthermore, we typically space out the earlier waves of production deployment at least 24 hours apart, in order to allow the changes to “bake.” Letting our software bake refers to rolling out software changes slowly to allow us to validate those changes and verify service metrics with production traffic before pushing the deployment to the next wave. The clear advantage of deploying new code to only a portion of your fleet is that it reduces the impact of a failed deployment to just the portion of the fleet containing the new code. Another benefit of our deployment infrastructure is that it provides us a mechanism to quickly “roll back” a deployment to a previous software version if any problems are detected which, in many cases, enables us to quickly mitigate a failed deployment.

    Based on our experiences, we have further organized our deployments to try and match our failure conditions to further reduce impact. First, our deployment strategies are tailored to the part of the system that is the target of our deployment. We commonly refer to two main components of Route 53: the control plane and the data plane (pictured below). The control plane consists primarily of our API and DNS change propagation system. Essentially, this is the part of our system that accepts a customer request to create or delete a DNS record and then the transmission of that update to all of our DNS servers distributed across the world. The data plane consists of our fleet of DNS servers that are responsible for answering DNS queries on behalf of our customers. These servers currently reside in more than 50 locations around the world. Both of these components have their own set of failure conditions and differ in how a failed deployment will impact customers. Further, a failure of one component may not impact the other. For example, an API outage where customers are unable to create new hosted zones or records has no impact on our data plane continuing to answer queries for all records created prior to the outage. Given their distinct set of failure conditions, the control plane and data plane have their own deployment strategies, which are each discussed in more detail below.

    Control Plane Deployments

    The bulk of the of the control plane actually consists of two APIs. The first is our external API that is reachable from the Internet and is the entry point for customers to create, delete and view their DNS records. This external API performs authentication and authorization checks on customer requests before forwarding them to our internal API. The second, internal API supports a much larger set of operations than just the ones needed by the external API; it also includes operations required to monitor and propagate DNS changes to our DNS servers as well as other operations needed to operate and monitor the service. Failed deployments to the external API typically impact a customer’s ability to view or modify their DNS records. The availability of this API is critical as our customers may rely on the ability to update their DNS records quickly and reliably during an operational event for their own service or site.

    Deployments to the external API are fairly straightforward. For increased availability, we host the external API in multiple availability zones. Each wave of deployment consists of the hosts within a single availability zone, and each host in that availability zone is deployed to individually. If any single host deployment fails, the deployment to the entire availability zone is halted automatically. Some host failures may be quickly caught and mitigated by the load balancer for our hosts in that particular availability zone, which is responsible for health checking the hosts. Hosts that fail these load balancer health checks are automatically removed from service by the load balancer. Thus, a failed deployment to just a single host would result in it being removed from service automatically and the deployment halted without any operator intervention. For other types of failed deployments that may not cause the load balancer health checks to fail, restricting waves to a single availability zone allows us to easily flip away from that availability zone as soon as the failure is detected. A similar approach could be applied to services that utilize Route 53 plus ELB in multiple regions and availability zones for their services. ELBs automatically health check their back-end instances and remove unhealthy instances from service. By creating Route 53 alias records marked to evaluate target health (see ELB documentation for how to set this up), if all instances behind an ELB are unhealthy, Route 53 will fail away from this alias and attempt to find an alternate healthy record to serve. This configuration will enable automatic failover at the DNS-level for an unhealthy region or availability zone. To enable manual failover, simply convert the alias resource record set for your ELB to either a weighted alias or associate it with a health check whose health you control. To initiate a failover, simply set the weight to 0 or fail the health check. A weighted alias also allows you the ability to slowly increase the traffic to that ELB, which can be useful for verifying your own software deployments to the back-end instances.

    For our internal API, the deployment strategy is more complicated (pictured below). Here, our fleet is partitioned by the type of traffic it handles. We classify traffic into three types: (1) low-priority, long-running operations used to monitor the service (batch fleet), (2) all other operations used to operate and monitor the service (operations fleet) and (3) all customer operations (customer fleet). Deployments to the production internal API are then organized by how critical their traffic is to the service as a whole. For instance, the batch fleet is deployed to first because their operations are not critical to the running of the service and we can tolerate long outages of this fleet. Similarly, we prioritize the operations fleet below that of customer traffic as we would rather continue accepting and processing customer traffic after a failed deployment to the operations fleet. For the internal API, we have also organized our staging waves differently from our production waves. In the staging waves, all three fleets are split across two waves. This is done intentionally to allow us to verify that the code changes work in a split-world where multiple versions of the software are running simultaneously. We have found this to be useful in catching incompatibilities between software versions. Since we never deploy software in production to 100% of our fleet at the same time, our software updates must be designed to be compatible with the previous version. Finally, as with the external API, all wave deployments proceed with a single host at a time. For this API, we also include a deep application health check as part of the deployment. Similar to the load balancer health checks for the external API, if this health check fails, the entire deployment is immediately halted.

    Data Plane Deployments

    As mentioned earlier, our data plane consists of Route 53’s DNS servers, which are distributed across the world in more than 50 distinct locations (we refer to each location as an ‘edge location’). An important consideration with our deployment strategy is how we stripe our anycast IP space across locations. In summary, each hosted zone is assigned four delegation name servers, each of which belong to a “stripe” (i.e., one quarter of our anycast range). Generally speaking, each edge location announces only a single stripe, so each stripe is therefore announced by roughly 1/4 of our edge locations worldwide. Thus, when a resolver issues a query against each of the four delegation name servers, those queries are directed via BGP to the closest (in a network sense) edge location from each stripe. While the availability and correctness of our API is important, the availability and correctness of our data plane are even more critical. In this case, an outage directly results in an outage for our customers. Furthermore, the impact of serving even a single wrong answer on behalf of a customer is magnified by that answer being cached by both intermediate resolvers and end clients alike. Thus, deployments to our data plane are organized even more carefully to both prevent failed deployments and to reduce potential impact.

    The safest way to deploy and minimize impact would be to deploy to a single edge location at a time. However, with manual deployments that are overseen by a developer, this approach is just not scalable with how frequently we deploy new software to over 50 locations (with more added each year). Thus, most of our production deployment waves consist of multiple locations; the one exception is our first wave that includes just a single location. Furthermore, this location is specifically chosen because it runs our oldest hardware, which provides us a quick notification for any unintended performance degradation. It is important to note that while the caching behavior for resolvers can cause issues if we serve an incorrect answer, they handle other types failures well. When a recursive resolver receives a query for a record that is not cached, it will typically issue queries to at least three of the four delegation name servers in parallel and it will use the first response it receives. Thus, in the event where one of our locations is black holing customer queries (i.e., not replying to DNS queries), the resolver should receive a response from one of the other delegation name servers. In this case, the only impact is to resolvers where the edge location that is not answering would have been the fastest responder. Now, that resolver will effectively be waiting for the response from the second fastest stripe. To take advantage of this resiliency, our other waves are organized such that they include edge locations that are geographically diverse, with the intent that for any single resolver, there will be nearby locations that are not included in the current deployment wave. Furthermore, to guarantee that at most a single nameserver for all customers is affected, waves are actually organized by stripe. Finally, each stripe is spread across multiple waves so that failures impact only a single name server for a portion of our customers. An example of this strategy is depicted below. A few notes: our staging environment consists of a much smaller number of edge locations than production, so single-location waves are possible. Second, each stripe is denoted by color; in this example, we see deployments spread across a blue and orange stripe. You, too, can think about organizing your deployment strategy around your failure conditions. For example, if you have a database schema used by both your production system and a warehousing system, deploy the change to the warehousing system first to ensure you haven’t broken any compatibility. You might catch problems with the warehousing system before it affects customer traffic.

    Conclusions

    Our team’s experience with operating Route 53 over the last 3+ years have highlighted the importance of reducing the impact from failed deployments. Over the years, we have been able to identify some of the common failure conditions and to organize our software deployments in such a way so that we increase the ease of mitigation while decreasing the potential impact to our customers.

    – Nick Trebon