Tag Archives: Amazon EC2

Offloading SQL for Amazon RDS using the Heimdall Proxy

Post Syndicated from Antony Prasad Thevaraj original https://aws.amazon.com/blogs/architecture/offloading-sql-for-amazon-rds-using-the-heimdall-proxy/

Getting the maximum scale from your database often requires fine-tuning the application. This can increase time and incur cost – effort that could be used towards other strategic initiatives. The Heimdall Proxy was designed to intelligently manage SQL connections to help you get the most out of your database.

In this blog post, we demonstrate two SQL offload features offered by this proxy:

  1. Automated query caching
  2. Read/Write split for improved database scale

By leveraging the solution shown in Figure 1, you can save on development costs and accelerate the onboarding of applications into production.

Figure 1. Heimdall Proxy distributed, auto-scaling architecture

Figure 1. Heimdall Proxy distributed, auto-scaling architecture

Why query caching?

For ecommerce websites with high read calls and infrequent data changes, query caching can drastically improve your Amazon Relational Database Sevice (RDS) scale. You can use Amazon ElastiCache to serve results. Retrieving data from cache has a shorter access time, which reduces latency and improves I/O operations.

It can take developers considerable effort to create, maintain, and adjust TTLs for cache subsystems. The proxy technology covered in this article has features that allow for automated results caching in grid-caching chosen by the user, without code changes. What makes this solution unique is the distributed, scalable architecture. As your traffic grows, scaling is supported by simply adding proxies. Multiple proxies work together as a cohesive unit for caching and invalidation.

View video: Heimdall Data: Query Caching Without Code Changes

Why Read/Write splitting?

It can be fairly straightforward to configure a primary and read replica instance on the AWS Management Console. But it may be challenging for the developer to implement such a scale-out architecture.

Some of the issues they might encounter include:

  • Replication lag. A query read-after-write may result in data inconsistency due to replication lag. Many applications require strong consistency.
  • DNS dependencies. Due to the DNS cache, many connections can be routed to a single replica, creating uneven load distribution across replicas.
  • Network latency. When deploying Amazon RDS globally using the Amazon Aurora Global Database, it’s difficult to determine how the application intelligently chooses the optimal reader.

The Heimdall Proxy streamlines the ability to elastically scale out read-heavy database workloads. The Read/Write splitting supports:

  • ACID compliance. Determines the replication lag and know when it is safe to access a database table, ensuring data consistency.
  • Database load balancing. Tracks the status of each DB instance for its health and evenly distribute connections without relying on DNS.
  • Intelligent routing. Chooses the optimal reader to access based on the lowest latency to create local-like response times. Check out our Aurora Global Database blog.

View video: Heimdall Data: Scale-Out Amazon RDS with Strong Consistency

Customer use case: Tornado

Hayden Cacace, Director of Engineering at Tornado

Tornado is a modern web and mobile brokerage that empowers anyone who aspires to become a better investor.

Our engineering team was tasked to upgrade our backend such that it could handle a massive surge in traffic. With a 3-month timeline, we decided to use read replicas to reduce the load on the main database instance.

First, we migrated from Amazon RDS for PostgreSQL to Aurora for Postgres since it provided better data replication speed. But we still faced a problem – the amount of time it would take to update server code to use the read replicas would be significant. We wanted the team to stay focused on user-facing enhancements rather than server refactoring.

Enter the Heimdall Proxy: We evaluated a handful of options for a database proxy that could automatically do Read/Write splits for us with no code changes, and it became clear that Heimdall was our best option. It had the Read/Write splitting “out of the box” with zero application changes required. And it also came with database query caching built-in (integrated with Amazon ElastiCache), which promised to take additional load off the database.

Before the Tornado launch date, our load testing showed the new system handling several times more load than we were able to previously. We were using a primary Aurora Postgres instance and read replicas behind the Heimdall proxy. When the Tornado launch date arrived, the system performed well, with some background jobs averaging around a 50% hit rate on the Heimdall cache. This has really helped reduce the database load and improve the runtime of those jobs.

Using this solution, we now have a data architecture with additional room to scale. This allows us to continue to focus on enhancing the product for all our customers.

Download a free trial from the AWS Marketplace.


Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced Tier ISV partner. They have Amazon Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes. For other proxy options, consider the Amazon RDS Proxy, PgBouncer, PgPool-II, or ProxySQL.

Securely extend and access on-premises Active Directory domain controllers in AWS

Post Syndicated from Mangesh Budkule original https://aws.amazon.com/blogs/security/securely-extend-and-access-on-premises-active-directory-domain-controllers-in-aws/

If you have an on-premises Windows Server Active Directory infrastructure, it’s important to plan carefully how to extend it into Amazon Web Services (AWS) when you’re migrating or implementing cloud-based applications. In this scenario, existing applications require Active Directory for authentication and identity management. When you migrate these applications to the cloud, having a locally accessible Active Directory domain controller is an important factor in achieving fast, reliable, and secure Active Directory authentication.

In this blog post, I’ll provide guidance on how to securely extend your existing Active Directory domain to AWS and optimize your infrastructure for maximum performance. I’ll also show you a best practice that implements a remote desktop gateway solution to access your domain controllers securely while using the minimum required ports. Additionally, you will learn about how AWS Systems Manager Session Manager port forwarding helps provide a secure and simple way to manage your domain resources remotely, without the need to open inbound ports and maintain RDGW hosts.

Administrators can use this blog post as guidance to design Active Directory on Amazon Elastic Compute Cloud (Amazon EC2) domain controllers. This post can also be used to determine which ports and protocols are required for domain controller infrastructure communication in a segmented network.

Design and guidelines for EC2-hosted domain controllers

This section provides a set of best practices for designing and deploying EC2-hosted domain controllers in AWS.

AWS has multiple options for hosting Active Directory on AWS, which are discussed in detail in the Active Directory Domain Services on AWS Design and Planning Guide. One option is to use AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD). AWS Managed Microsoft AD provides you with a complete new forest and domain to start your Active Directory deployment on AWS. However, if you prefer to extend your existing Active Directory domain infrastructure to AWS and manage it yourself, you have the option of running Active Directory on EC2-hosted domain controllers. See our Quick Start guide for instructions on how to deploy both of these options (AWS Managed Microsoft AD or EC2-hosted domain controllers on AWS).

If you’re operating in more than one AWS Region and require Active Directory to be available in all these Regions, use the best practices in the Design and Planning Guide for a multi-Region deployment strategy. Within each of the Regions, follow the guidelines and best practices described in this blog post.

Figure 1 shows an example of how to deploy Active Directory on EC2 instances in multiple Regions with multiple virtual private clouds (VPCs). In this example, I’m showing the Active Directory design in multiple Regions that interconnect to each other by using AWS Transit Gateway.

Figure 1: Extended EC2 domain controllers architecture

Figure 1: Extended EC2 domain controllers architecture

In order to extend your existing Active Directory deployment from on-premises to AWS as shown in the example, you do two things. First, you add additional domain controllers (running on Amazon EC2) to your existing domain. Second, you place the domain controllers in multiple Availability Zones (AZs) within your VPC, in multiple Regions, by keeping the same forest (Example.com) and domain structure.

Consider these best practices when you deploy or extend Active Directory on EC2 instances:

  1. We recommend deploying at least two domain controllers (DCs) in each Region and configuring a minimum of two AZs, to provide high availability.
  2. If you require additional domain controllers to achieve your performance goals, add more domain controllers to existing AZs or deploy to another available AZ.
  3. It’s important to define Active Directory sites and subnets correctly to prevent clients from using domain controllers that are located in different Regions, which causes increased latency.
  4. Configure the VPC in a Region as a single Active Directory site and configure Active Directory subnets accordingly in the AD Sites and Services console. This configuration confirms that your clients correctly select the closest available domain controller.
  5. If you have multiple VPCs, centralize the Active Directory services in one of your existing VPCs or create a shared services VPC to centralize the domain controllers.
  6. Make sure that robust inter-Region connectivity exists between all of the Regions. Within AWS, you can leverage cross-Region VPC peering to achieve highly available private connectivity between Regions. You can also use the Transit Gateway VPC solution, as shown in Figure 1, to interconnect multiple Regions.
  7. Make sure that you’re deploying your domain controllers in a private subnet without internet access.
  8. Keep your security patches up to date on EC2 domain controllers. You can use AWS Systems Manager to patch your domain controllers in a controlled manner.
  9. Have a dedicated AWS account for directory services and don’t share the account with other general services and applications. This helps you to control access to the AWS account and add domain controller–specific automation.
  10. If your users need to manage AWS services and access AWS applications with their Active Directory credentials, we recommend integrating your identity service with the management account in AWS Organizations. You can configure the AWS Single Sign-On (AWS SSO) service to use AD Connector in a primary account VPC to connect to self-managed Active Directory domain controllers that are hosted in a Shared Services account.

    Alternatively, you can deploy AWS Managed Microsoft AD in the management account, with trust to your EC2 Active Directory domain, to allow users from any trusted domain to access AWS applications. However, you could host these EC2 domain controllers in the primary account, similar to the AWS Managed AD option.

  11. Build domain controllers with end-to-end automation using version control (for example, GIT and AWS CodeCommit) and Desired State Configuration (DSC)/PowerShell.

Security considerations for EC2-hosted domains

This section explains how you can maximize the security of your extended EC2-hosted domain controller infrastructure, and use AWS services to help achieve security compliance. You should also refer to your organization’s IT system security policies to determine the most relevant recommendations to implement.

AWS operates under a shared security responsibility model, where AWS is responsible for the security of the underlying cloud infrastructure and you are responsible for securing workloads you deploy in AWS.

Our recommendations for security for EC2-hosted domains are as follows:

  1. We recommend that you place EC2-hosted domain controllers in a single dedicated AWS account or deploy them in your AWS Organizations management account. This makes it possible for you to use your Active Directory credentials for authentication to access the AWS Management Console and other AWS applications.
  2. Use tag-based policies to restrict access to domain controllers if you’re using the Shared Services account for hosting domain controllers.
  3. Take advantage of the EC2 Image Builder service to deploy a domain controller that uses a CIS standard base image. By doing this, you can avoid manual deployment by setting up an image pipeline.
  4. Secure the AWS account where the domain controllers are running by following the principle of least privilege and by using role-based access control.
  5. Take advantage of these AWS services to help secure your workloads and application:
    • AWS Landing Zone–A solution that helps you more quickly set up a secure, multi-account AWS environment, based on AWS best practices.
    • AWS Organizations–A service that helps you centrally manage and govern your environment as you grow and scale your AWS resources.
    • Amazon Guard Duty–An automated threat detection service that continuously monitors for suspicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data that are stored in Amazon Simple Storage Service (Amazon S3).
    • Amazon Detective–A service that can analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities.
    • Amazon Inspector–An automated security assessment service that helps improve the security and compliance of applications that are deployed on AWS.
    • AWS Security Hub–A service that provides customers with a comprehensive view of their security and compliance status across their AWS accounts. You can import critical patch compliance findings into Security Hub for easy reference.

Use data encryption

AWS offers you the ability to add a layer of security to your data at rest in the cloud, providing scalable and efficient encryption features. These are some best practices for data encryption:

  1. Encrypt the Amazon Elastic Block Store (Amazon EBS) volumes that are attached to the domain controllers, and keep the customer master key (CMK) safe with AWS Key Management Service (AWS KMS) or AWS CloudHSM, according to your security team’s guidance and policies.
  2. Consider using a separate CMK for the Active Directory and restrict access to the CMK to a specific team.
  3. Enable LDAP over SSL (LDAPS) on all domain controllers, for secure authentication, if your application supports LDAPS authentication.
  4. Deploy and manage a public key infrastructure (PKI) on AWS. For more information, see the Microsoft PKI Quick Start guide.

Restrict account and instance access

Provide management access for directory service accounts and domain controller instances only to the specific team that manages the Active Directory. To do this, follow these guidelines:

  1. Restrict access to an EC2 domain controller’s start, stop, and terminate behavior by using AWS Identity and Access Management (IAM) policy and resources tags. Example: Restrict-ec2-iam
  2. Restrict access to Amazon EBS volumes and snapshots.
  3. Restrict account root access and implement multi-factor authentication (MFA) for this access.

Network access control for domain controllers

Whenever possible, block all unnecessary traffic to and from your domain controllers to limit the communication so that only the necessary ports are opened between a domain controller and another computer. Use these best practices:

  1. Allow only the required network ports between the client and domain controllers, and between domain controllers.
  2. Use a security group to narrow down the access to domain controllers.
  3. Use network access control lists (network ACLs) to filter Active Directory ports as this gives you better control than using ephemeral ports.
  4. Deploy domain controllers in private subnets.
  5. Route only the required subnets into the VPC that contains the domain controllers.

Secure administration

AWS provides services that continuously monitor your data, accounts, and workloads to help protect them from unauthorized access. We recommend that you take advantage of the following services to securely administer your domain controller’s deployment:

  1. Use AWS Systems Manager Session Manager or Run Command to manage your instances remotely. The command actions are sent to Amazon CloudWatch Logs or Amazon S3 for auditing purposes. Leaving inbound Remote Desktop Protocol (RDP), WinRM ports, and remote PowerShell ports open on your instances greatly increases the risk of entities running unauthorized or malicious commands on the instances. Session Manager helps you improve your security posture by letting you close these inbound ports, which frees you from managing SSH keys and certificates, bastion hosts, and jump boxes.
  2. Use Amazon EventBridge to set up rules to detect when changes happen to your domain controller EC2 instances and to send notifications by using Amazon Simple Notification Service (Amazon SNS) when a command is run.
  3. Manage configuration drift on EC2 instances. Systems Manager State Manager helps you automate the process of keeping your domain controller EC2 instances in the desired state and integrates with Systems Manager Compliance.
  4. Avoid any manual interventions while you build and manage domain controllers. Automate the domain join process for Amazon EC2 instances from multiple AWS accounts and Regions.
  5. For developing your applications with domain controllers, use the Windows DC locator service or use the Dynamic DNS (DDNS) service of your AWS Managed Microsoft AD to locate domain controllers. Do not hard-code applications with the address of a domain controller.
  6. Use AWS Config to manage your domain controller configuration.
  7. Use Systems Manager Parameter Store or Secrets Manager to store all secrets, as well as configurations for your domain controller automation.
  8. Use version control to update the domain controller source code with pipeline approvals to avoid any misconfigurations and faulty deployments.

Logging and monitoring

AWS provides tools and features that you can use to see what’s happening in your AWS environment. We recommend that you use these logging and monitoring practices for your EC2-hosted domain controllers:

  1. Enable VPC Flow Logs data for each domain controller’s accounts to monitor the traffic that’s reaching your domain controller instance.
  2. Log Windows and Active Directory events in Amazon CloudWatch Logs for increased visibility.
  3. Consider setting up alerts and notifications for key security events for EC2 domain controllers, in real time. These alerts can be sent to your Red and Blue security response teams for further analysis.
  4. Deploy the CloudWatch agent or the Amazon Kinesis Agent for Windows on EC2 for detail monitoring and alerting at the domain controller operating system level.
  5. Log Systems Manager API calls with AWS CloudTrail.

Other security considerations

As a best practice, implement domain controller security at the operating system level, according to your security team’s recommendations. We recommend these options:

  1. Block executables from running on domain controllers.
  2. Prevent web browsing from domain controllers.
  3. Configure a Windows Server Core base image for domain controllers.
  4. Integrate bastion hosts with Systems Manager Session Manager and use MFA to manage domain controllers remotely.
  5. Perform regular system state backups of your Active Directory environments. Encrypt these backups.
  6. Perform Active Directory administrative management from a remote server, and avoid logging in to domain controllers interactively unless needed.
  7. For FSMO roles, you can follow the same recommendations you would follow for your on-premises deployment to determine FSMO roles on domain controllers. For more information, see these best practices from Microsoft. In the case of AWS Managed Microsoft AD, all domain controllers and FSMO role assignments are managed by AWS and don’t require you to manage or change them.

Domain controller ports

In this section, I’m going to cover the network ports and protocols that are needed to deploy domain services securely. Understanding how traffic flows and is processed by a network firewall is essential when someone requests or implements firewall rules, to avoid any connectivity issues.

Here are some common problems that you might observe because of network port blockage:

  • The RPC server is unavailable
  • Name resolution issues
  • A connectivity issue is detected with LDAP, LDAPS, and Kerberos
  • Domain replication issues
  • Domain authentication issues
  • Domain trust issues between on-premises Active Directory and AWS Managed Microsoft AD
  • AD Connector connectivity issues
  • Issues with domain join, password reset, and more

Understand Active Directory firewall ports

You must allow traffic from your on-premises network to the VPC that contains your extended domain controllers. To do this, make sure that the firewall ports that opened with the VPC subnets that were used to deploy your EC2-hosted domain controllers and the security group rules that are configured on your domain controllers both allow the network traffic to support domain trusts.

Domain controller to domain controller core ports requirements

The following table lists the port requirements for establishing DC-to-DC communication in all versions of Windows Server.

Source Destination Protocol Port Type Active Directory usage Type of traffic
Any domain controller Any domain controller TCP and UDP 53 Bi-directional User and computer authentication, name resolution, trusts DNS
TCP and UDP 88 Bi-directional User and computer authentication, forest level trusts Kerberos
UDP 123 Bi-directional Windows Time, trusts Windows Time
TCP 135 Bi-directional Replication RPC, Endpoint Mapper (EPM)
UDP 137 Bi-directional User and computer authentication NetLogon, NetBIOS name resolution
UDP 138 Bi-directional Distributed File System (DFS), Group Policy DFSN, NetLogon, NetBIOS Datagram Service
TCP 139 Bi-directional User and computer authentication, replication DFSN, NetBIOS Session Service, NetLogon
TCP and UDP 389 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trustss LDAP
TCP and UDP 445 Bi-directional Replication, user, and computer authentication, Group Policy, trusts SMB, CIFS, SMB2, DFSN, LSARPC, NetLogonR, SamR, SrvSvc
TCP and UDP 464 Bi-directional Replication, user, and computer authentication, trusts Kerberos change/set password
TCP 636 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trusts LDAP SSL (required only if LDAP over SSL is configured)
TCP 3268 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trusts LDAP Global Catalog (GC)
TCP 3269 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trusts LDAP GC SSL (required only if LDAP over SSL is configured)
TCP 5722 Bi-directional File replication RPC, DFSR (SYSVOL)
TCP 9389 Bi-directional AD DS web services SOAP
TCP Dynamic 49152–65535 Bi-directional Replication, user, and computer authentication, Group Policy, trusts RPC, DCOM, EPM, DRSUAPI, NetLogonR, SamR, File Replication Service (FRS)
UDP Dynamic 49152–65535 Bi-directional Group Policy DCOM, RPC, EPM

Note: There is no need to open a DNS port on domain controllers if you are not using a domain controller as a DNS server, or if you’re using any third-party DNS solutions.

Client to domain controller core ports requirements

The following table lists the port requirements for establishing client to domain controller communication for Active Directory.

Source Destination Protocol Port Type Usage Type of traffic
All internal company client network IP subnets Any domain controller TCP 53 Uni-directional DNS DNS
UDP 53 Uni-directional DNS Kerberos
TCP 88 Uni-directional Kerberos Auth Kerberos
UDP 88 Uni-directional Kerberos Auth Kerberos
UDP 123 Uni-directional Windows Time Windows Time
TCP 135 Uni-directional RPC, EPM RPC, EPM
UDP 137 Uni-directional NetLogon, NetBIOS name User and computer authentication
UDP 138 Uni-directional DFSN, NetLogon, NetBIOS datagram service DFS, Group Policy, NetBIOS, NetLogon, browsing
TCP 389 Uni-directional LDAP Directory, replication, user, and computer authentication, Group Policy, trust
UDP 389 Uni-directional LDAP Directory, replication, user, and computer authentication, Group Policy, trust
TCP 445 Uni-directional SMB, CIFS, SMB3, DFSN, LSARPC, NetLogonR, SamR, SrvSvc Replication, user, and computer authentication, Group Policy, trust
TCP 464 Uni-directional Kerberos change/set password Replication, user, and computer authentication, trust
UDP 464 Uni-directional Kerberos change/set password Replication, user, and computer authentication, trust
TCP 636 Uni-directional LDAP SSL Directory, replication, user, and computer authentication, Group Policy, trust
TCP 3268 Uni-directional LDAP GC Directory, replication, user, and computer authentication, Group Policy, trust
TCP 3269 Uni-directional LDAP GC SSL Directory, replication, user, and computer authentication, Group Policy, trust
TCP 9389 Uni-directional SOAP AD DS web service
TCP 49152–65535 Uni-directional DCOM, RPC, EPM Group Policy


  • You must allow network traffic communication from your on-premises network to the VPC that contains your AWS-hosted EC2 domain controllers.
  • You also can restrict DC-to-DC replication traffic and DC-to-client communications to specific ports.
  • Packet fragmentation can cause issues with services such as Kerberos. You should make sure that maximum transmission unit (MTU) sizes match your network devices.
  • Additionally, unless a tunneling protocol is used to encapsulate traffic to Active Directory, ranges of ephemeral TCP ports between 49152 to 65535 are required. Ephemeral ports are also known as service response ports. These ports are created dynamically for session responses for each client that establishes a session. These ports are required not only for Windows but for Linux and UNIX.

Manage domain controllers securely using a bastion host and RDGW

We recommend that you restrict the domain controller’s management by using a secure, highly available, and scalable Microsoft Remote Desktop Gateway (RDGW) solution in conjunction with bastion hosts. A bastion host that is designed to work with a specific part of the infrastructure should work with that unit only, and nothing else. Limiting the use of bastion hosts to a specific instance example domain controller can help improve your security posture.

The reference architecture shown in Figure 2 restricts management access to your domain controllers and access via port 443. The bastion hosts in the diagram are configured to only allow RDP from the RDGW.

For additional security, follow these best practices:

  • Configure RDGW and bastions hosts to use MFA for logins.
  • Implement login restrictions by using a Group Policy Object (GPO), so that only required administrators log in to RDGW and the bastion host, based on their group membership.

Bastion host to domain controllers ports requirements

The following table lists the port requirements for establishing bastion host-to-DC communication in all versions of Windows Server.

Source Destination Protocol Ports Type Usage Type of traffic
Bastion host to domain controller Any domain controller subset TCP 443 Uni-directional TPKT Remote Protocol Gateway access
UDP 3389 Uni-directional TPKT Remote Desktop Protocol
TCP 3389 Uni-directional WS-Man Remote Desktop Protocol
TCP 5985 Bi-directional HTTPS Windows Remote Management (WinRM)
TCP 5985 Bi-directional WS-Man Windows Remote Management (WinRM)

You can also take advantage of Systems Manager Session Manager to manage domain joined resources instead of using bastion hosts for management. This option eliminates the need to manage bastion infrastructure and open any inbound rules. It also integrates natively with IAM and AWS CloudTrail, two services that enhance your security and audit posture. In the next section, I’ll discuss Session Manager and how it is useful in this context.

Session Manager port forwarding

Active Directory administrators are accustomed to managing domain resources by using Remote Server Administrators Tools (RSAT) that are installed on either their workstations or a member server in the domain (for example, RDP to a bastion host). Although RDP is effective, using RDP requires more management, such as managing inbound rules for port 3389. In some cases, having this port exposed to the internet might put your systems at risk. For example, systems can be susceptible to brute force or unauthorized dictionary activity. Instead of using a RDGW host and opening RDP inbound RDP ports, we recommend using the Session Manager Service, which provides port-forwarding ability without opening inbound ports.

Port forwarding provides the ability to forward traffic between your clients to open ports on your EC2 instance. After you configure port forwarding, you can connect to the local port and access the server application that is running inside the instance, as shown in Figure 3. To configure the port-forwarding feature in Session Manager, you can use IAM policies and the AWS-StartPortForwardingSession document.

Figure 3: Session Manager tunnel

Figure 3: Session Manager tunnel

To start a session using the AWS Command Line Interface (AWS CLI), run the following command.

aws ssm start-session --target "instance-id" --document-name AWS StartPortForwardingSession -parameters portNumber="3389",localPortNumber="9999"

Note: You can use any available ephemeral port. 9999 is just an example. Install and configure the AWS CLI, if you haven’t already.

You can also start a session by using an IAM policy like the one shown in the following example. To learn more about creating IAM policies for Session Manager, see the topic Quickstart default IAM policies for Session Manager.

In this policy example, I created the policy for Systems Manager for both AWS-StartPortForwadingSession and AWS-StartSSHSession for Linux (SSH) environments, for your reference and guidance.

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "arn:aws:ec2:*: <AccountID>:instance/*",
                "arn:aws:ssm:*: <AccountID>:document/SSM-SessionManagerRunShell"
            "Condition": {
                "StringLike": {
                    "ssm:resourceTag/TAGKEY": [

            "Effect": "Allow",
            "Action": [
            "Resource": [

            "Effect": "Allow",

            "Action": [



            "Resource": "*"
            "Effect": "Allow",
            "Action": [
            "Resource": [

When you use the port-forwarding feature in Session Manager, you have the option to use an auditing service like AWS CloudTrail to provide a record of the connections made to your instances. You can also monitor the session by using Amazon CloudWatch Events with Amazon SNS to receive notifications when a user starts or ends session activity.

There is no additional charge for accessing EC2 instances by using Session Manager port forwarding. Port forwarding is available today in all AWS Regions where Systems Manager is available. You will be charged for the outgoing bandwidth from the NAT Gateway or your VPC Private Link.

Bastion host architecture using Session Manager

In this section, I discuss how to use a bastion host with Session Manager. Session Manager uses the Systems Manager infrastructure to create an SSH-like session with an instance. Session Manager tunnels real SSH connections, which allows you to tunnel to another resource within your VPC directly from your local machine. A managed instance that you create, acts as a bastion host, or gateway, to your AWS resources. The benefits of this configuration are:

  • Increased security: This configuration uses only one EC2 instance (the bastion host), and connects outbound port 443 to Systems Manager infrastructure. This allows you to use Session Manager without any inbound connections. The local resource must allow inbound traffic only from the instance that is acting as bastion host. Therefore, there is no need to open any inbound rule publicly.
  • Ease of use: You can access resources in your private VPC directly from your local machine.

In the example shown in Figure 4, the EC2 instance is acting as a domain controller that must be accessed securely by an Active Directory administrator who is working remotely via bastion host. To support this use case, I’ve chosen to use an interface VPC endpoint for Systems Manager, in order to facilitate private connectivity between Systems Manager Agent (SSM Agent) on the EC2 instance that is acting as a bastion host, and the Systems Manager service endpoints. You can configure Session Manager to enable port forwarding between the administrator’s local workstation and the private EC2 bastion instances, so that they can securely access the bastion host from the internet. This architecture helps you to eliminate RDGW infrastructure setup and reduce management efforts. You can add MFA at the bastion host level to enhance security.


  • If you want to use the AWS CLI to start and end sessions that connect you to your managed instances, you must first install the Session Manager plugin on your local machine.
  • Make sure that the bastion host has SSM Agent installed, because Session Manager only works with Systems Manager managed instances.
  • Follow the steps in Creating an interface endpoint to create the following interface endpoints:
    • com.amazonaws.<region>.ssm – The endpoint for the Systems Manager service.
    • com.amazonaws.><region>.ec2messages – Systems Manager uses this endpoint to make calls from the SSM Agent to the Systems Manager service.
    • com.amazonaws.<region>.ec2 – The endpoint to the EC2 service. If you’re using Systems Manager to create VSS-enabled snapshots, you must ensure that you have this endpoint. Without the EC2 endpoint defined, a call to enumerate attached EBS volumes fails. This causes the Systems Manager command to fail.
    • com.amazonaws.<region>.ssmmessages – This endpoint is required for connecting to your instances through a secure data channel by using Session Manager, in this case the port-forwarding requirement.

Support for domain controllers in Session Manager

You can use Session Manager to connect EC2 domain controllers directly, as well. To initiate a connection with either the default Session Manager connection or the port-forwarding feature discussed in this post, complete these steps.

To initiation a connection

  1. Create the ssm-user in your domain.
  2. Add the ssm-user to the domain groups that grant the user local access to the domain controller. One example is to add the user to the Domain Admins group.

IMPORTANT: Follow your organization’s security best practices when you grant the ssm-user access to the domain.


In this blog post, I described best practices for deploying domain controllers on EC2 instances and extending on-premises Active Directory to AWS for your guidance and quick reference. I also covered how you can maximize security for your extended EC2-hosted domain controller infrastructure by using AWS services. In addition, you learned about how AWS Systems Manager Session Manager port forwarding to RDP provides a simple and secure way to manage your domain resources remotely, without the need to open inbound ports and maintain RDGW hosts. Port forwarding works for Windows and Linux instances. It’s available today in all AWS Regions where Systems Manager is available. Depending on your use case, you should consider additional protection mechanisms per your organization’s security best practices.

To learn more about migrating Windows Server or SQL Server, visit Windows on AWS. For more information about how AWS can help you modernize your legacy Windows applications, see Modernize Windows Workloads with AWSContact us to start your modernization journey today.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Mangesh Budkule

Mangesh is a Microsoft Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance and technical assistance on AWS services, improving the value of their solutions when using AWS.

Migrate Resources Between AWS Accounts

Post Syndicated from Ashok Srirama original https://aws.amazon.com/blogs/architecture/migrate-resources-between-aws-accounts/

Have you ever wondered how to move resources between Amazon Web Services (AWS) accounts? You can really view this as a migration of resources. Migrating resources from one AWS account to another may be desired or required due to your business needs. Following are a few scenarios where this may be of benefit:

  1. When you acquire, sell, or merge overseas operations from other businesses.
  2. When you move regional operations from one Managed Service Provider (MSP) to another.
  3. When you reorganize your AWS account and organizational structure.

This process may involve migrating the infrastructure either partially or completely.

In this blog, we will discuss various approaches to migrating resources based on type, configuration, and workload needs. Usually, the first consideration is infrastructure. What’s in your environment? What are the interdependencies? How will you migrate each resource?

Using this information, you can outline a plan on how you will approach migrating each of the resources in your portfolio, and in what order.

Here are some considerations to address for a typical migration:

Let’s look at each of these considerations in detail.

Migrating infrastructure

To migrate infrastructure that includes ephemeral resources, you can use one of the following Infrastructure as Code (IaC) approaches, shown in Figure 1. IaC templates are like programming scripts that automate the provisioning of IT resources.

Figure 1. Approaches to migrate infrastructure using IaC

Figure 1. Approaches to migrate infrastructure using IaC

1. If you are already using AWS CloudFormation templates, you can easily import the existing templates to the target AWS account.

AWS CloudFormation simplifies provisioning and management on AWS. You can create templates for quick and reliable provisioning of services or applications (called “stacks”).

2. You can use tools like Former2 to templatize your existing resources in the source AWS account and deploy them in the target account.

Former2 is an open-source project that allows you to generate IaC templates. For example, AWS CloudFormation or HashiCorp Terraform can be generated from the existing resources within your AWS account. Read Accelerate infrastructure as code development with open source Former2 for step-by-step guidance.

Migrating compute resources

To migrate compute resources that have a persistent state, you can use one of the following approaches, shown in Figure 2. These provide a virtual computing environment, allowing you to launch instances with a variety of operating systems.

Figure 2. Approaches to migrate compute resources

Figure 2. Approaches to migrate compute resources

1. If you are already using AWS Backup service and AWS Organizations to centrally manage backup policies, you can enable AWS Backup cross-account management feature. This manages, monitors, restores your backup, and copies jobs across AWS accounts. Ensure you have both accounts in same AWS Organization. Once the backups are available in the target account, you can restore EC2 instances. Follow detailed instructions at Creating backup copies across AWS accounts.

AWS Backup is a fully managed data protection service that centralizes and automates data across AWS services, in the cloud, and on-premises. You can configure backup policies and monitor activity for your AWS resources. You can automate and consolidate backup tasks that were previously performed service-by-service. This removes the need to create custom scripts and use manual processes.

2. Create an Amazon Machine Image of your EC2 instances and share it with the target account. You can launch new EC2 instances using the shared AMI. Follow step-by-step instructions: How do I transfer an Amazon EC2 instance or AMI to a different AWS account?

Amazon Machine Image (AMI) provides the information required to launch an instance. Specify an AMI and then launch multiple instances from a single AMI with the same configuration. You can use different AMIs to launch instances when you need instances with different configurations.

For migrating non-persistent compute resources, refer Migrating Infrastructure section.

Migrating storage resources

AWS offers various storage services including object, file, and block storage. To migrate objects from a S3 bucket, you can take the following approaches, shown in Figure 3a.

Figure 3a. Approaches to migrate S3 buckets

Figure 3a. Approaches to migrate S3 buckets

1. Use Amazon S3 command line interface (CLI) commands to copy the initial load of objects from the source account to the target account. Read How can I copy S3 objects from another AWS account? After the initial copy, you can enable Amazon S3 replication feature to continuously replicate object changes across accounts. Add a bucket policy to grant source bucket permission to replicate objects in destination bucket. Read this walkthrough on how to configure replications.

2. If the S3 bucket contains large number of objects, use Amazon S3 Batch operations to copy objects across AWS accounts in bulk. Read Cross-account bulk transfer of files using Amazon S3 Batch Operations.

To migrate files from an Amazon EFS file system, you can take the following approach, shown in Figure 3b.

Figure 3b. Approach to migrate EFS file systems

Figure 3b. Approach to migrate EFS file systems

Use AWS DataSync agent to transfer data from one EFS file system to another. AWS DataSync is an online transfer service that simplifies moving, copying, and synchronizing large amounts of data between on-premises storage systems and AWS storage services. Read Transferring file data across AWS Regions and accounts using AWS DataSync for step-by-step guidance.

Migrating database resources

AWS offers various purpose-built database engines. These include relational, key-value, document, in-memory, graph, time series, wide column, and ledger databases. To migrate relational databases, you can take one of the following approaches, shown in Figure 4.

Figure 4. Approaches to migrate relational database resources

Figure 4. Approaches to migrate relational database resources

1. If you want to continuously replicate data changes, use AWS Database Migration Service (AWS DMS) to replicate your data across AWS accounts with high availability. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. You can set up a DMS task for either one-time migration or on-going replication. An on-going replication task keeps your source and target databases in sync. Once set up, the on-going replication task will continuously apply source changes to the target with minimal latency. Learn how to Set Up AWS DMS for Cross-Account Migration.

AWS DMS is a cloud service that streamlines the migration of relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.

2. Use RDS Snapshots to create and share database backups across AWS accounts. Use the shared snapshots to launch new Amazon Relational Database Service (RDS) instances in the target account. Read step-by-step instructions: How can I share an encrypted Amazon RDS DB snapshot with another account?

3. Use AWS Backup to create backup policies that automatically back up your AWS resources. Use AWS Backup cross-account management feature to manage and monitor your backup, restore, and copy jobs across AWS accounts. Once the backups are available in the target account, you can restore RDS instances. Learn about Creating backup copies across AWS accounts.

In this section, we discussed relational databases migration. You can also use AWS DMS for migrating other databases. Read supported AWS DMS source and target databases.


In this blog post, we discussed various approaches you can take to migrate resources from one account to another depending upon the resource type and configuration. Additionally, you can also explore CloudEndure Migration for continuous data replication. Learn more about Migrating workloads across AWS Regions with CloudEndure Migration.

Field Notes: Set Up a Highly Available Database on AWS with IBM Db2 Pacemaker

Post Syndicated from Sai Parthasaradhi original https://aws.amazon.com/blogs/architecture/field-notes-set-up-a-highly-available-database-on-aws-with-ibm-db2-pacemaker/

Many AWS customers need to run mission-critical workloads—like traffic control system, online booking system, and so forth—using the IBM Db2 LUW database server. Typically, these workloads require the right high availability (HA) solution to make sure that the database is available in the event of a host or Availability Zone failure.

This HA solution for the Db2 LUW database with automatic failover is managed using IBM Tivoli System Automation for Multiplatforms (Tivoli SA MP) technology with IBM Db2 high availability instance configuration utility (db2haicu). However, this solution is not supported on AWS Cloud deployment because the automatic failover may not work as expected.

In this blog post, we will go through the steps to set up an HA two-host Db2 cluster with automatic failover managed by IBM Db2 Pacemaker with quorum device setup on a third EC2 instance. We will also set up an overlay IP as a virtual IP pointing to a primary instance initially. This instance is used for client connections and in case of failover, the overlay IP will automatically point to a new primary instance.

IBM Db2 Pacemaker is an HA cluster manager software integrated with Db2 Advanced Edition and Standard Edition on Linux (RHEL 8.1 and SLES 15). Pacemaker can provide HA and disaster recovery capabilities on AWS, and an alternative to Tivoli SA MP technology.

Note: The IBM Db2 v11.5.5 database server implemented in this blog post is a fully featured 90-day trial version. After the trial period ends, you can select the required Db2 edition when purchasing and installing the associated license files. Advanced Edition and Standard Edition are supported by this implementation.

Overview of solution

For this solution, we will go through the steps to install and configure IBM Db2 Pacemaker along with overlay IP as virtual IP for the clients to connect to the database. This blog post also includes prerequisites, and installation and configuration instructions to achieve an HA Db2 database on Amazon Elastic Compute Cloud (Amazon EC2).

Figure 1. Cluster management using IBM Db2 Pacemaker

Prerequisites for installing Db2 Pacemaker

To set up IBM Db2 Pacemaker on a two-node HADR (high availability disaster recovery) cluster, the following prerequisites must be met.

  • Set up instance user ID and group ID.

Instance user id and group id’s must be set up as part of Db2 Server installation which can be verified as follows:

grep db2iadm1 /etc/group
grep db2inst1 /etc/group

  • Set up host names for all the hosts in /etc/hosts file on all the hosts in the cluster.

For both of the hosts in the HADR cluster, ensure that the host names are set up as follows.

Format: ipaddress fully_qualified_domain_name alias

  • Install kornshell (ksh) on both of the hosts.

sudo yum install ksh -y

  • Ensure that all instances have TCP/IP connectivity between their ethernet network interfaces.
  • Enable password less secure shell (ssh) for the root and instance user IDs across both instances.After the password less root ssh is enabled, verify it using the “ssh <host name> -l root ls” command (hostname is either an alias or fully-qualified domain name).

ssh <host name> -l root ls

  • Activate HADR for the Db2 database cluster.
  • Make available the IBM Db2 Pacemaker binaries in the /tmp folder on both hosts for installation. The binaries can be downloaded from IBM download location (login required).

Installation steps

After completing all prerequisites, run the following command to install IBM Db2 Pacemaker on both primary and standby hosts as root user.

cd /tmp
tar -zxf Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64.tar.gz
cd Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/RPMS/

dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm -y
dnf install */*.rpm -y

cp /tmp/Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/Db2/db2cm /home/db2inst1/sqllib/adm

chmod 755 /home/db2inst1/sqllib/adm/db2cm

Run the following command by replacing the -host parameter value with the alias name you set up in prerequisites.

/home/db2inst1/sqllib/adm/db2cm -copy_resources
/tmp/Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/Db2agents -host <host>

After the installation is complete, verify that all required resources are created as shown in Figure 2.

ls -alL /usr/lib/ocf/resource.d/heartbeat/db2*

Figure 2. List of heartbeat resources

Configuring Pacemaker

After the IBM Db2 Pacemaker is installed on both primary and standby hosts, initiate the following configuration commands from only one of the hosts (either primary or standby hosts) as root user.

  1. Create the cluster using db2cm utility.Create the Pacemaker cluster using db2cm utility using the following command. Before running the command, replace the -domain and -host values appropriately.

/home/db2inst1/sqllib/adm/db2cm -create -cluster -domain <anydomainname> -publicEthernet eth0 -host <primary host alias> -publicEthernet eth0 -host <standby host alias>

Note: Run ifconfig to get the –publicEthernet value and replace in the former command.

  1. Create instance resource model using the following commands.Modify -instance and -host parameter values in the following command before running.

/home/db2inst1/sqllib/adm/db2cm -create -instance db2inst1 -host <primary host alias>
/home/db2inst1/sqllib/adm/db2cm -create -instance db2inst1 -host <standby host alias>

  1. Create the database instance using db2cm utility. Modify -db parameter value accordingly.

/home/db2inst1/sqllib/adm/db2cm -create -db TESTDB -instance db2inst1

After configuring Pacemaker, run crm status command from both the primary and standby hosts to check if the Pacemaker is running with automatic failover activated.

Figure 3. Pacemaker cluster status

Quorum device setup

Next, we shall set up a third lightweight EC2 instance that will act as a quorum device (QDevice) which will act as a tie breaker avoiding a potential split-brain scenario. We need to install only corsync-qnetd* package from the Db2 Pacemaker cluster software.

Prerequisites (quorum device setup)

  1. Update /etc/hosts file on Db2 primary and standby instances to include the host details of QDevice EC2 instance.
  2. Set up password less root ssh access between Db2 instances and the QDevice instance.
  3. Ensure TCP/IP connectivity between the Db2 instances and the QDevice instance on port 5403.

Steps to set up quorum device

Run the following commands on the quorum device EC2 instance.

cd /tmp
tar -zxf Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64.tar.gz
cd Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/RPMS/
dnf install */corosync-qnetd* -y

  1. Run the following command from one of the Db2 instances to join the quorum device to the cluster by replacing the QDevice value appropriately.

/home/db2inst1/sqllib/adm/db2cm -create -qdevice <hostnameofqdevice>

  1. Verify the setup using the following commands.

From any Db2 servers:

/home/db2inst1/sqllib/adm/db2cm -list

From QDevice instance:

corosync-qnetd-tool -l

Figure 4. Quorum device status

Setting up overlay IP as virtual IP

For HADR activated databases, virtual IP provides a common connection point for the clients so that in case of failovers there is no need to update the connection strings with the actual IP address of the hosts. Furtermore, the clients can continue to establish the connection to the new primary instance.

We can use the overlay IP address routing on AWS to send the network traffic to HADR database servers within Amazon Virtual Private Cloud (Amazon VPC) using a route table so that the clients can connect to the database using the overlay IP from the same VPC (any Availability Zone) where the database exists. aws-vpc-move-ip is a resource agent from AWS which is available along with the Pacemaker software that helps to update the route table of the VPC.

If you need to connect to the database using overlay IP from on-premises or outside of the VPC (different VPC than database servers), then additional setup is needed using either AWS Transit Gateway or Network Load Balancer.

Prerequisites (setting up overlay IP as virtual IP)

  • Choose the overlay IP address range which needs to be configured. This IP should not be used anywhere in the VPC or on-premises, and should be a part of the private IP address range as defined in RFC 1918. If the VPC is configured in the range of or, we can use the overlay IP from the range of will use the following IP and ethernet settings.

  • To route traffic through overlay IP, we need to disable source and target destination checks on the primary and standby EC2 instances.

aws ec2 modify-instance-attribute –profile <AWS CLI profile> –instance-id EC2-instance-id –no-source-dest-check

Steps to configure overlay IP

The following commands can be run as root user on the primary instance.

  1. Create the following AWS Identity and Access Management (IAM) policy and attach it to the instance profile. Update region, account_id, and routetableid values.
  "Version": "2012-10-17",
  "Statement": [
      "Sid": "Stmt0",
      "Effect": "Allow",
      "Action": "ec2:ReplaceRoute",
      "Resource": "arn:aws:ec2:<region>:<account_id>:route-table/<routetableid>"
      "Sid": "Stmt1",
      "Effect": "Allow",
      "Action": "ec2:DescribeRouteTables",
      "Resource": "*"
  1. Add the overlay IP on the primary instance.

ip address add dev eth0

  1. Update the route table (used in Step 1) with the overlay IP specifying the node with the Db2 primary instance. The following command returns True.

aws ec2 create-route –route-table-id <routetableid> –destination-cidr-block –instance-id <primrydb2instanceid>

  1. Create a file overlayip.txt with the following command to create the resource manager for overlay ip.


primitive db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP ocf:heartbeat:aws-vpc-move-ip \
  params ip= routing_table=<routetableid> interface=<ethernet> profile=<AWS CLI profile name> \
  op start interval=0 timeout=180s \
  op stop interval=0 timeout=180s \
  op monitor interval=30s timeout=60s

eifcolocation db2_db2inst1_db2inst1_TESTDB_AWS_primary-colocation inf:


order order-rule-db2_db2inst1_db2inst1_TESTDB-then-primary-oip Mandatory:

db2_db2inst1_db2inst1_TESTDB-clone db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP
location prefer-node1_db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP 100: <primaryhostname>
location prefer-node2_db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP 100: <standbyhostname>

The following parameters must be replaced in the resource manager create command in the file.

    • Name of the database resource agent (This can be found through crm config show | grep primitive | grep DBNAME command. For this example, we will use: db2_db2inst1_db2inst1_TESTDB)
    • Overlay IP address (created earlier)
    • Routing table ID (used earlier)
    • AWS command-line interface (CLI) profile name
    • Primary and standby host names
  1. After the file with commands is ready, run the following command to create the overlay IP resource manager.

crm configure load update overlayip.txt

  1. Next, create the VIP resource manager—not in managed state. Run the following command to manage and start the resource.

crm resource manage db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

  1. Validate the setup with crm status command.

Figure 5. Pacemaker cluster status along with overlay IP resource

Test failover with client connectivity

For the purpose of this testing, launch another EC2 instance with Db2 client installed, and catalog the Db2 database server using overlay IP.

Figure 6. Database directory list

Establish a connection with the Db2 primary instance using the cataloged alias (created earlier) using overlay IP address.

Figure 7. Connect to database

If we connect to the primary instance and check the applications connected, we can see the active connection from the client’s IP as shown in Figure 8.

Check client connections before failover

Figure 8. Check client connections before failover

Next, let’s stop the primary Db2 instance and check if the Pacemaker cluster promoted the standby to primary and we can still connect to the database using the overlay IP, which now points to the new primary instance.

If we check the CRM status from the new primary instance, we can see that the Pacemaker cluster has promoted the standby database to new primary database as shown in Figure 9.

Figure 9. Automatic failover to standby

Let’s go back to our client and reestablish the connection using the cataloged DB alias created using overlay IP.

Figure 10. Database reconnection after failover

If we connect to the new promoted primary instance and check the applications connected, we can see the active connection from the client’s IP as shown in Figure 11.

Check client connections after failover

Figure 11. Check client connections after failover

Cleaning up

To avoid incurring future charges, terminate all EC2 instances which were created as part of the setup referencing this blog post.


In this blog post, we have set up automatic failover using IBM Db2 Pacemaker with overlay (virtual) IP to route traffic to secondary database instance during failover, which helps to reconnect to the database without any manual intervention. In addition, we can also enable automatic client reroute using the overlay IP address to achieve a seamless failover connectivity to the database for mission-critical workloads.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Build Your Own Game Day to Support Operational Resilience

Post Syndicated from Lewis Taylor original https://aws.amazon.com/blogs/architecture/build-your-own-game-day-to-support-operational-resilience/

Operational resilience is your firm’s ability to provide continuous service through people, processes, and technology that are aware of and adaptive to constant change. Downtime of your mission-critical applications can not only damage your reputation, but can also make you liable to multi-million-dollar financial fines.

One way to test operational resilience is to simulate life-like system failures. An effective way to do this is by running events in your organization known as game days. Game days test systems, processes, and team responses and help evaluate your readiness to react and recover from operational issues. The AWS Well-Architected Framework recommends game days as a key strategy to develop and operate highly resilient systems because they focus not only on technology resilience issues but identify people and process gaps.

This blog post will explain how you can apply game day concepts to your workloads to help achieve a highly resilient workload.

Why does operational resilience matter from a regulatory perspective?

In March 2021, the Bank of England, Prudential Regulation Authority, and Financial Conduct Authority published their Building operational resilience: Feedback to CP19/32 and final rules policy. In this policy, operational resilience refers to a firm’s ability to prevent, adapt, and respond to and return to a steady system state when a disruption occurs. Further, firms are expected to learn and implement process improvements from prior disruptions.

This policy will not apply to everyone. However, across the board if you don’t establish operational resilience strategies, you are likely operating at an increased risk. If you have a service disruption, you may incur lost revenue and reputational damage.

What does it mean to be operationally resilient?

The final policy provides guidance on how firms should achieve operational resilience, which includes but is not limited to the following:

  • Identify and prioritize services based on the potential of intolerable harm to end consumers or risk to market integrity.
  • Define appropriate maximum impact tolerance of an important business service. This is reviewed annually using metrics to measure impact tolerance and answers questions like, “How long (in hours) can a service be offline before causing intolerable harm to end consumers?”
  • Document a complete view of all the aspects required to deliver each important service. This includes people, processes, technology, facilities, and information (resources). Firms should also test their ability to remain within the impact tolerances and provide assurance of resilience along with areas that need to be addressed.

What is a game day?

The AWS Well-Architected Framework defines a game day as follows:

“A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. These should be conducted regularly so that your team builds “muscle memory” on how to respond. Your game days should cover the areas of operations, security, reliability, performance, and cost.

In AWS, your game days can be carried out with replicas of your production environment using AWS CloudFormation. This enables you to test in a safe environment that resembles your production environment closely.”

Running game days that simulate system failure helps your organization evaluate and build operational resilience.

How can game days help build operational resilience?

Running a game day alone is not sufficient to ensure operational resilience. However, by navigating the following process to set up and perform a game day, you will establish a best practice-based approach for operating resilient systems.

Stage 1 – Identify key services

As part of setting up a game day event, you will catalog and identify business-critical services.

Game days are performed to test services where operational failure could result in significant financial, customer, and/or reputational impact to the firm. Game days can also evaluate other key factors, like the impact of a failure on the wider market where your firm operates.

For example, a firm may identify its digital banking mobile application from which their customers can initiate payments as one of its important business services.

Stage 2 – Map people, process, and technology supporting the business service

Game days are holistic events. To get a full picture of how the different aspects of your workload operate together, you’ll generate a detailed map of people and processes as they interact and operate the technical and non-technical components of the system. This mapping also helps your end consumers understand how you will provide them reliable support during a failure.

Stage 3 – Define and perform failure scenarios

Systems fail, and failures often happen when a system is operating at scale because various services working together can introduce complexity. To ensure operational resilience, you must understand how systems react and adapt to failures. To do this, you’ll identify and perform failure scenarios so you can understand how your systems will react and adapt and build “muscle memory” for actual events.

AWS builds to guard against outages and incidents, and accounts for them in the design of AWS services—so when disruptions do occur, their impact on customers and the continuity of services is as minimal as possible. At AWS, we employ compartmentalization throughout our infrastructure and services. We have multiple constructs that provide different levels of independent, redundant components.

Stage 4 – Observe and document people, process, and technology reactions

In running a failure scenario, you’ll observe how technological and non-technological components react to and recover from failure. This helps you identify failures and fix them as they cascade through impacted components across your workload. This also helps identify technical and operational challenges that might not otherwise be obvious.

Stage 5 – Conduct lessons learned exercises

Game days generate information on people, processes, and technology and also capture data on customer impact, incident response and remediation timelines, contributing factors, and corrective actions. By incorporating these data points into the system design process, you can implement continuous resilience for critical systems.

How to run your own game day in AWS

You may have heard of AWS GameDay events. This is an AWS organized event for our customers. In this team-based event, AWS provides temporary AWS accounts running fictional systems. Failures are injected into these systems and teams work together on completing challenges and improving the system architecture.

However, the method and tooling and principles we use to conduct AWS GameDays are agnostic and can be applied to your systems using the following services:

  • AWS Fault Injection Simulator is a fully managed service that runs fault injection experiments on AWS, which makes it easier to improve an application’s performance, observability, and resiliency.
  • Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health.
  • AWS X-Ray helps you analyze and debug production and distributed applications (such as those built using a microservices architecture). X-Ray helps you understand how your application and its underlying services are performing to identify and troubleshoot the root cause of performance issues and errors.

Please note you are not limited to the tools listed for simulating failure scenarios. For complete coverage of failure scenarios, we encourage you to explore additional tools and strategies.

Figure 1 shows a reference architecture example that demonstrates conducting a game day for an Open Banking implementation.

Game day reference architecture example

Figure 1. Game day reference architecture example

Game day operators use Fault Injection Simulator to catalog and perform failure scenarios to be included in your game day. For example, in our Open Banking use case in Figure 1, a failure scenario might be for the business API functions servicing Open Banking requests to abruptly stop working. You can also combine such simple failure scenarios into a more complex one with failures injected across multiple components of the architecture.

Game day participants use CloudWatch, X-Ray, and their own custom observability and monitoring tooling to identify failures as they cascade through systems.

As you go through the process of identifying, communicating, and fixing issues, you’ll also document impact of failures on end-users. From there, you’ll generate lessons learned to holistically improve your workload’s resilience.


In this blog, we discussed the significance of ensuring operational resilience. We demonstrated how to set up game days and how they can supplement your efforts to ensure operational resilience. We discussed how using AWS services such as Fault Injection Simulator, X-Ray, and CloudWatch can be used to facilitate and implement game day failure scenarios.

Ready to get started? For more information, check out our AWS Fault Injection Simulator User Guide.

Related information:

What to Consider when Selecting a Region for your Workloads

Post Syndicated from Saud Albazei original https://aws.amazon.com/blogs/architecture/what-to-consider-when-selecting-a-region-for-your-workloads/

The AWS Cloud is an ever-growing network of Regions and points of presence (PoP), with a global network infrastructure that connects them together. With such a vast selection of Regions, costs, and services available, it can be challenging for startups to select the optimal Region for a workload. This decision must be made carefully, as it has a major impact on compliance, cost, performance, and services available for your workloads.

Evaluating Regions for deployment

There are four main factors that play into evaluating each AWS Region for a workload deployment:

  1. Compliance. If your workload contains data that is bound by local regulations, then selecting the Region that complies with the regulation overrides other evaluation factors. This applies to workloads that are bound by data residency laws where choosing an AWS Region located in that country is mandatory.
  2. Latency. A major factor to consider for user experience is latency. Reduced network latency can make substantial impact on enhancing the user experience. Choosing an AWS Region with close proximity to your user base location can achieve lower network latency. It can also increase communication quality, given that network packets have fewer exchange points to travel through.
  3. Cost. AWS services are priced differently from one Region to another. Some Regions have lower cost than others, which can result in a cost reduction for the same deployment.
  4. Services and features. Newer services and features are deployed to Regions gradually. Although all AWS Regions have the same service level agreement (SLA), some larger Regions are usually first to offer newer services, features, and software releases. Smaller Regions may not get these services or features in time for you to use them to support your workload.

Evaluating all these factors can make coming to a decision complicated. This is where your priorities as a business should influence the decision.

Assess potential Regions for the right option

Evaluate by shortlisting potential Regions.

  • Check if these Regions are compliant and have the services and features you need to run your workload using the AWS Regional Services website.
  • Check feature availability of each service and versions available, if your workload has specific requirements.
  • Calculate the cost of the workload on each Region using the AWS Pricing Calculator.
  • Test the network latency between your user base location and each AWS Region.

At this point, you should have a list of AWS Regions with varying cost and network latency that looks something Table 1:

Region Compliance Latency Cost Services / Features
Region A

15 ms $$
Region B

20 ms



Region C

80 ms $

Table 1. Region evaluation matrix

Many workloads such as high performance computing (HPC), analytics, and machine learning (ML), are not directly linked to a customer-facing application. These would not be sensitive to network latency, so you may want to select the Region with the lowest cost.

Alternatively, you may have a backend service for a game or mobile application in which network latency has a direct impact on user experience. Measure the difference in network latency between each Region, and determine if it is worth the increased cost. You can leverage the Amazon CloudFront edge network, which helps reduce latency and increases communication quality. This is because it uses a fully managed AWS network infrastructure, which connects your application to the edge location nearest to your users.

Multi-Region deployment

You can also split the workload across multiple Regions. The same workload may have some components that are sensitive to network latency and some that are not. You may determine you can benefit from both lower network latency and reduced cost at the same time. Here’s an example:

Figure 1. Multi-Region deployment optimized for feature availability

Figure 1. Multi-Region deployment optimized for feature availability

Figure 1 shows a serverless application deployed at the Bahrain Region (me-south-1) which has a close proximity to the customer base in Riyadh, Saudi Arabia. Application users enjoy a lower latency network connecting to the AWS Cloud. Analytics workloads are deployed in the Ireland Region (eu-west-1), which has a lower cost for Amazon Redshift and other features.

Note that data transfer between Regions is not free and, in this example, costs $0.115 per GB. However, even with this additional cost factored in, running the analytical workload in Ireland (eu-west-1) is still more cost-effective. You can also benefit from additional capabilities and features that may have not yet been released in the Bahrain (me-south-1) Region.

This multi-Region setup could also be beneficial for applications with a global user base. The application can be deployed in multiple secondary AWS Regions closer to the user base locations. It uses a primary AWS Region with a lower cost for consolidated services and latency-insensitive workloads.

Figure 2. Multi-Region deployment optimized for network latency

Figure 2. Multi-Region deployment optimized for network latency

Figure 2 allows for an application to span multiple Regions to serve read requests with the lowest network latency possible. Each client will be routed to the nearest AWS Region. For read requests, an Amazon Route 53 latency routing policy will be used. For write requests, an endpoint routed to the primary Region will be used. This primary endpoint can also have periodic health checks to failover to a secondary Region for disaster recovery (DR).

Other factors may also apply for certain applications such as ones that require Amazon EC2 Spot Instances. Regions differ in size, with some having three, and others up to six Availability Zones (AZ). This results in varying Spot Instance capacity available for Amazon EC2. Choosing larger Regions offers larger Spot capacity. A multi-Region deployment offers the most Spot capacity.


Selecting the optimal AWS Region is an important first step when deploying new workloads. There are many other scenarios in which splitting the workload across multiple AWS Regions can result in a better user experience and cost reduction. The four factors mentioned in this blog post can be evaluated together to find the most appropriate Region to deploy your workloads.

If the workload is bound by any regulations, shortlist the Regions that are compliant. Measure the network latency between each Region and the location of the user base. Estimate the workload cost for each Region. Check that the shortlisted Regions have the services and features your workload requires. And finally, determine if your workload can benefit from running in multiple Regions.

Dive deeper into the AWS Global Infrastructure Website for more information.

Optimizing Cloud Infrastructure Cost and Performance with Starburst on AWS

Post Syndicated from Kinnar Kumar Sen original https://aws.amazon.com/blogs/architecture/optimizing-cloud-infrastructure-cost-and-performance-with-starburst-on-aws/

Amazon Web Services (AWS) Cloud is elastic, convenient to use, easy to consume, and makes it simple to onboard workloads. Because of this simplicity, the cost associated with onboarding workloads is sometimes overlooked.

There is a notion that when an organization moves its workload to the cloud, agility, scalability, performance, and cost issues will disappear. While this may be true for agility and scalability, you must optimize your workload. You can do this with services like Amazon EC2 Auto Scaling via Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EC2 Spot Instances to realize the performance and cost benefits the cloud offers.

In this blog post, we show you how Starburst Enterprise (Starburst) addressed a sudden increase in cost for their data analytics platform as they grew their internal teams and scaled out their infrastructure. After reviewing their architecture and deployments with AWS specialist architects, Starburst and AWS concluded that they could take the following steps to greatly reduce costs:

  1. Use Spot Instances to run workloads.
  2. Add Amazon EC2 Auto Scaling into their training and demonstration environments, because the Starburst platform is designed to elastically scale up and down.

For analytics workloads, when you rein in costs, you typically rein in performance. Starburst and AWS worked together to balance the cost and performance of Starburst’s data analytics platform while also harnessing the flexibility, scalability, security, and performance of the cloud.

What is Starburst Enterprise?

Starburst provides a Massively Parallel Processing SQL (MPPSQL) engine based on open source Trino. It is an analytics platform that provides the cornerstone in customers’ intelligent data mesh and offers the following benefits and services:

  • The platform gives you a single point to access, monitor, and secure your data mesh.
  • The platform gives you options for your data compute. You no longer have to wait on data migrations or extract, transform, and load (ETL), there is no vendor lock-in, and there is no need to swap out your existing analytics tools.
  • Starburst Stargate (Stargate) ensures that large jobs are completed within each data domain of your data mesh. Only the result set is retrieved from the domain.
    • Stargate reduces data output, which reduces costs and increases performance.
    • Data governance policies can also be applied uniquely in each data domain, ensuring security compliance and federation.

As shown in Figure 1, there are many connectors for input and output that ensure you experience improved performance and security.

Starburst platform

Figure 1. Starburst platform

Integrating Starburst Enterprise with AWS

As shown in Figure 2, Starburst Enterprise uses AWS services to deliver elastic scaling and optimize cost. The platform is architected with decoupled storage and compute. This allows the platform to scale as needed to analyze petabytes of data.

The platform can be deployed via AWS CloudFormation or Amazon Elastic Kubernetes Service (Amazon EKS). Starburst on AWS allows you to run analytic queries across AWS data sources and on-premises systems such as Teradata and Oracle.

Deployment architecture of Starburst platform on AWS

Figure 2. Deployment architecture of Starburst platform on AWS

Amazon EC2 Auto Scaling

Enterprises have diverse analytic workloads; their compute and memory requirements vary with time. Starburst uses Amazon EKS and Amazon EC2 Auto Scaling to elastically scale compute resources to meet the demands of their analytics workloads.

  • Amazon EC2 Auto Scaling ensures that you have the compute capacity for your workloads to handle the load elastically. It is used to architect sophisticated, elastic, and resilient applications on the AWS Cloud.
    • Starburst uses the scheduled scaling feature of Amazon EC2 Auto Scaling to scale the cluster up/down based on time. Thus, they incur no costs when the cluster is not in use.
  • Amazon EKS is a fully managed Kubernetes service that allows you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane.

Scaling cloud resource consumption on demand has a major impact on controlling cloud costs. Starburst supports scaling down elastically, which means removing compute resources doesn’t impact the underlying processes.

Amazon EC2 Spot Instances

Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. They are available at up to a 90% discount compared to On-Demand Instance prices. If EC2 needs capacity for On-Demand Instance usage, Spot Instances can be interrupted by Amazon EC2 with a two-minute notification. There are many ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance.

Starburst has integrated Spot Instances as a part of the Amazon EKS managed node groups to cost optimize the analytics workloads. This best practice of instance diversification is implemented by using the integration eksctl and instance selector with dry-run flag. This creates a list of instances of same size (vCPU/Mem ratio) and uses them in the underlying node groups.

Same size instances are required to make best use of Kubernetes Cluster Autoscaler, which is used to manage the size of the cluster.

Scaling down, handling interruptions, and provisioning compute

“Scaling in” an active application is tricky, but Starburst was built with resiliency in mind, and it can effectively manage shut downs.

Spot Instances are an ideal compute option because Starburst can handle potential interruptions natively. Starburst also uses Amazon EKS managed node groups to provision nodes in the cluster. This requires significantly less operational effort compared to using self-managed node groups. It allows Starburst to enforce best practices like capacity optimized allocation strategy, capacity rebalancing, and instance diversification.

When you need to “scale out” the deployment, Amazon EKS and Amazon EC2 Auto Scaling help to provision capacity, as depicted in Figure 3.

Depicting “scale out” in a Starburst deployment

Figure 3. Depicting “scale out” in a Starburst deployment

Benefits realized from using AWS services

In a short period, Starburst was able to increase the number of people working on AWS. They added five times the number of Solutions Architects, they have previously. Additionally, in their initial tests of their new deployment architecture, their Solutions Architects were able to complete up to three times the amount of work than they had been able to previously. Even after the workload increased more than 15 times, with two simple changes they only had a slight increase in total cost.

This cost and performance optimization allows Starburst to be more productive internally and realize value for each dollar spent. This further justified investing more into building out infrastructure footprint.


In building their architecture with AWS, Starburst realized the importance of having a robust and comprehensive cloud administration plan that they can implement and manage going forward. They are now able to balance the cloud costs with performance and stability, even after considering the SLA requirements. Starburst is planning to teach their customers about the Spot Instance and Amazon EC2 Auto Scaling best practices to ensure they maintain a cost and performance optimized cloud architecture.

If you want to see the Starburst data analytics platform in action, you can get a free trial in the AWS Marketplace: Starburst Data Free Trial.

Disaster Recovery (DR) for a Third-party Interactive Voice Response on AWS

Post Syndicated from Priyanka Kulkarni original https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-for-a-third-party-interactive-voice-response-on-aws/

Voice calling systems are prevalent and necessary to many businesses today. They are usually designed to provide a 24×7 helpline support across multiple domains and use cases. Reliability and availability of such systems are important for a good customer experience. The thoughtful design of a cost-optimized solution will allow your business to sustain the system into the future.

We address a scenario in which you are mandated to host the workload on a corporate data center (DC), and configure the backup site on Amazon Web Services (AWS). Since the primary objective of a backup site is disaster recovery (DR) management, this site is often referred to as a DR site.

Disaster Recovery on AWS

DR strategy defines the recovery objectives for downtime and data loss. The workload has a recovery time objective (RTO) and a recovery point objective (RPO). RTO is the maximum acceptable delay between the interruption of service and the restoration of service. RPO is the maximum acceptable amount of time since the last data recovery point. AWS defines four DR strategies in increasing order of complexity, and decreasing order of RTO and RPO. These are backup and restore, active/passive (pilot light or warm standby), or active/active.

Figure 1. Disaster recovery (DR) options

Figure 1. Disaster recovery (DR) options

In our use case, the DR site on AWS must serve the user traffic with RPO and RTO in seconds. Warm standby is the optimal choice in this case. It is a scaled-down version of a fully functional environment, and is always running in the cloud.

Amazon Connect is an omnichannel cloud contact center that helps you provide great customer service at a lower cost. But in some situations, Amazon Connect may not be available. In other cases, the customer may want to use their home developed or third-party contact center application. Our solution is designed to help in both these scenarios.

This architecture enables customers facing challenges of cost overhead with redundant Session Initiation Protocol (SIP) trunks for the DC and DR sites. It allows you to optimize your spend and yet retain a reliable workflow.

SIP trunk communication on AWS

Let’s see how the SIP trunk termination on the AWS network handles the failover scenario of a third-party IVR application installed on Amazon EC2 at the DR site.

There will be two connections made from the AWS Direct Connect location (DX). The first will be for a point-to-point connectivity between the corporate DC and the AWS DR site. The second connection will be originating from the multiplexer (MUX) of the telecom provider who is providing you the SIP trunk.

The telecom provider will lay the SIP trunk from its MUX to the customer router at the DX location. At this point, the mode of communication becomes IP-based. The telecom provider will send the call to the IP address attached to the Network Load Balancer (NLB) in Amazon Virtual Private Cloud (VPC).

Figure 2. Communication circuitry at telecom side

Figure 2. Communication circuitry at telecom side

AWS Network Load Balancers can now distribute traffic to AWS resources using their IP addresses and instance IDs as targets. You can also distribute the traffic with on-premises resources over AWS Direct Connect. Load balancing across AWS and on-premises resources using the same load balancer streamlines migrate-to-cloud, burst-to-cloud, or failover-to-cloud.

In the backup site, the NLB will point to the Session Border Controller (SBC). This is a special-purpose device that protects and regulates IP communications flows. You can bring your own SBC, or you can use an SBC offered in the AWS Marketplace.

Best practices for high availability of IVR solution on AWS

  • Configure the multiple Availability Zone (Multi-AZ) SBC setup
  • Make sure that the telecom provider for the SIP trunk is different from the internet service provider (ISP). This is for last mile connectivity for the DC from Direct Connect
  • Consider redundancy for Direct Connect by using a Site-to-Site VPN tunnel
Figure 3. Solution architecture of DR on AWS for a third-party IVR solution

Figure 3. Solution architecture of DR on AWS for a third-party IVR solution

Communication flow for an IVR solution deployed on a corporate DC and its DR on AWS

  1. The callers are received on the telecom providers SIP line, which terminates on the AWS Direct Connect location.
  2. At the DX location, you will configure a route in the AWS router to send the traffic to the IP address of the NLB. The NLB should be configured to perform health checks on the virtual machine in your on-premises DC. Based on these health checks, the NLB will do the routing and the failover.
  3. In a live scenario with successful health checks at the DC, the NLB will forward the call to the IP of the on-premises virtual machine. This is where the IVR application will be installed.
  4. The communication between the NLB in Amazon VPC and the virtual machine in DC, will happen over Direct Connect.
  5. In a DR scenario, the NLB will failover the communication to SBCs in Amazon VPC.


This solution is useful when a third-party IVR system is deployed in a corporate data center, and the passive DR site is hosted on AWS. Cost optimization on telecom components is an important aspect of this design. AWS Direct Connect provides dedicated connectivity to the AWS environment, from 50 Mbps up to 10 Gbps. This gives you managed and controlled latency. It also provides provisioned bandwidth, so your workload can connect to AWS resources in a reliable, scalable, and cost-effective way.

The solution in this blog explains the end-to-end flow of communication, from the user to the IVR agents. It also provides insights into managing failover and failback between DR and the DR site.

Further Reading:

Optimize costs by up to 70% with new Amazon T3 Dedicated Hosts

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/optimize-costs-by-up-to-70-with-new-amazon-t3-dedicated-hosts/

This post is written by Andy Ward, Senior Specialist Solutions Architect, and Yogi Barot, Senior Specialist Solutions Architect.

Customers have been taking advantage of Amazon Elastic Compute Cloud (Amazon EC2) Dedicated Hosts to enable them to use their eligible software licenses from vendors such as Microsoft and Oracle since the feature launched in 2015. Amazon EC2 Dedicated Hosts have gained new features over the years. For example, Customers can launch different-sized instances within the same instance family and use AWS License Manager to track and manage software licenses. Host Resource Groups have enabled customers to take advantage of automated host management. Further, the ability to use license included Windows Server on Dedicated Hosts has opened up new possibilities for cost-optimization.

The ability to bring your own license (BYOL) to Amazon EC2 Dedicated Hosts has been an invaluable cost-optimization tool for customers. Since the introduction of Dedicated Hosts on Amazon EC2, customers have requested additional flexibility to further optimize their ability to save on licensing costs on AWS. We listened to that feedback, and are now launching a new type of Amazon EC2 Dedicated Host to enable additional cost savings.

In this blog post, we discuss how our customers can benefit from the newest member of our Amazon EC2 Dedicated Hosts family – the T3 Dedicated Host. The T3 Dedicated Host is the first Amazon EC2 Dedicated Host to support general-purpose burstable T3 instances, providing the most cost-efficient way of using eligible BYOL software on dedicated hardware.

Introducing T3 Dedicated Hosts

When we talk to our customers about BYOL, we often hear the following:

  • We currently run our workloads on-premises and want to move our workloads to AWS with BYOL.
  • We currently benefit from oversubscribing CPU on our on-premises hosts, and want to retain our oversubscription benefits when bringing our eligible BYOL software to AWS.
  • How can we further cost-optimize our AWS environment, increasing the flexibility and cost effectiveness of our licenses?
  • Some of our virtual servers use minimal resources. How can we use smaller instance sizes with BYOL?

T3 Dedicated Hosts differ from our other EC2 Dedicated Hosts. Where our traditional EC2 Dedicated Hosts provide fixed CPU resources, T3 Dedicated Hosts support burstable instances capable of sharing CPU resources, providing a baseline CPU performance and the ability to burst when needed. Sharing CPU resources, also known as oversubscription, is what enables a single T3 Dedicated Host to support up to 4x more instances than comparable general-purpose Dedicated Hosts. This increase in the number of instances supported can enable customers to save on licensing and infrastructure costs by as much as 70%.

Advantages of T3 Dedicated Hosts

 T3 Dedicated Hosts drive a lower total cost of ownership (TCO) by delivering a higher instance density than any other EC2 Dedicated Host. Burstable T3 instances allow customers to consolidate a higher number of instances with low-to-moderate average CPU utilization on fewer hosts than ever before.

T3 Dedicated Hosts also offer smaller instance sizes, in a greater number of vCPU and memory combinations, than other EC2 Dedicated Hosts. Smaller instance sizes can contribute to lower TCO and help deliver consolidation ratios equivalent to or greater than on-premises hosts.

AWS hypervisor management features provide consistent performance for customer workloads. Customers can choose between a wide selection of instance configurations with different vCPU and memory sizes, mixing and matching instances sizes from t3.nano up to t3.2xlarge

You can use your existing eligible per-socket, per-core, or per-VM software licenses, including licenses for Windows Server, SQL Server, SUSE Linux Enterprise Server and Red Hat Enterprise Linux. As licensing terms often change over time, we recommend checking eligibility for BYOL with your license vendor.

You can track your license usage using your license configuration in AWS License Manager. For more information, see the Track your license using AWS License Manager blog post and the Manage Software Licenses with AWS License Manager video on YouTube.

When to use T3 Dedicated Hosts

T3 Dedicated Hosts are best suited for running instances such as small and medium databases and application servers, virtual desktops, and development and test environments. In common with on-premises hypervisor hosts that allow CPU oversubscription, T3 Dedicated Hosts are less suitable for workloads that experience correlated CPU burst patterns.

T3 Dedicated Hosts support all instance sizes of the T3 family, with a wide variety of CPU and RAM ratios. Additionally, as T3 Dedicated Hosts are powered by the AWS Nitro System, they support multiple instance sizes on a single host. Customers can run up to 192 instances on a single T3 Dedicated Host, each capable of supporting multiple processes. The maximum instance limits are shown in the following table:

Instance Family Sockets Physical Cores nano micro small medium large xlarge 2xlarge
t3 2 48 192 192 192 192 96 48 24






Any combination of T3 instance types can be run, up to the memory limit of the host (768GB). Examples of supported blended instance type combinations are:

  • 132 t3.small and 60 t3.large
  • 128 t3.small and 64 t3.large
  • 24 t3.xlarge and 12 t3.2xlarge

Use Cases

If you are looking for ways to decrease your license costs and host footprint in order to achieve the lowest TCO, then using T3 Dedicated Hosts enables a set of previously unavailable scenarios to help you achieve this goal. The ability to run a greater number of instances per host compared to existing Dedicated Hosts leads directly to lower licensing and infrastructure costs on AWS, for suitable BYOL workloads.

The following three scenarios are typical examples of benefits that can be realized by customers using T3 Dedicated Hosts.

Retaining Existing Server Consolidation Ratios While Migrating to AWS

On-premises, you are taking advantage of the fact that you can easily oversubscribe your physical CPUs on VMware hosts and achieve high-levels of consolidation. As you can license Windows Server on a per-physical-core basis, you only need to license the physical cores of the VMware hosts, and not the vCPUs of the Windows Server virtual machines.

  • You are currently running 7 x 48 core VMware Hosts on-premises.
  • Each host is running 150 x 2 vCPU low average-CPU-utilization Windows Server virtual machines.
  • You have Windows Server Datacenter licenses that are eligible for BYOL to AWS.

In this scenario, T3 Dedicated Hosts enable you to achieve similar, or better, levels of consolidation. Additionally, the number of Windows Server Datacenter licenses required in order to bring your workloads to AWS is reduced from 336 cores to 288 cores – a saving of 14%.

On-Premises VMware Hosts T3 Dedicated Hosts Savings
Physical Servers (48 Cores) 7 6
2 vCPU VMs per Host 150 192
Total number of VMs 1000 1000
Total Windows Server Datacenter Licenses (Per Core) 336 288 14%

Reducing License Requirements While Migrating To AWS

On-premises you are taking advantage of the fact that you can easily oversubscribe your physical CPUs on VMware hosts and achieve high-levels of consolidation. You can now achieve far greater levels of consolidation by moving your virtual machines to T3 Dedicated Hosts, which have double the amount of RAM compared to your current on-premises VMware hosts.

  • You are currently using 10 x 36 core, 384GB RAM VMware Hosts on-premises.
  • Each host is running 96 x 2 vCPU, 4GB RAM low average-CPU-utilization Windows Server virtual machines.

By taking advantage of oversubscription and the increased RAM on T3 Dedicated Hosts, you can now achieve far greater levels of consolidation. Additionally, you are able to reduce the number of Windows Server Datacenter licenses required for BYOL. In this scenario, you can achieve a license reduction from 360 cores to 240 cores – a 33% saving.

On-Premises VMware Hosts T3 Dedicated Hosts Savings
Physical Servers 10 5
Physical Cores per Host 36 48
RAM per Host (GB) 384 768
2 vCPU, 4GB RAM VMs per Host 96 192
Total number of VMs 960 960
Total Windows Server Datacenter Licenses (Per Core) = Number of Servers * Physical Core Count 10 * 36 = 360 5 * 48 = 240 33%

Reducing License and Infrastructure Cost by Migrating from C5 Dedicated Hosts to T3 Dedicated Hosts

In this scenario, you are taking advantage of the fact that you can bring your own eligible Windows Server and SQL Server licenses to AWS for use on Dedicated Hosts. However, as your instances all have low average-CPU-utilization, your current C5 Dedicated Hosts, with fixed CPU resources, are largely underutilized.

  • You are currently using C5 EC2 Dedicated Hosts on AWS with eligible Windows Server Datacenter licenses and SQL Server 2017 Enterprise Edition (BYOL).
  • Each host is running 36 x 2 vCPU low average-CPU-utilization Windows Server virtual machines.

By migrating to T3 Dedicated Hosts, you can achieve a substantial reduction in licensing costs. As the total number of physical cores requiring licensing is reduced, you can benefit from a corresponding reduction in the number of SQL Server Enterprise Edition licenses required – a saving of 71%.

C5 Dedicated Hosts T3 Dedicated Hosts Savings
Total Number of Hosts Required 28 6
2 vCPU, 4GB VMs per Host 36 192
Total number of VMs 1008 1008
Total SQL Server EE Licenses = Number of Servers * Physical Core Count 36 * 28 = 1008 48 * 6 = 288 71%


In this blog post, we described the new T3 Dedicated Hosts and how they help customers benefit from running more instances per host in BYOL scenarios. We showed that heavily oversubscribed on-premises environments can be migrated to T3 Dedicated Hosts on AWS while lowering existing licensing and infrastructure costs. We further showed how significant licensing and infrastructure savings can be realized by moving existing workloads from EC2 Dedicated Hosts with fixed CPU resources to new T3 Dedicated Hosts.

Visit the Dedicated Hosts, AWS License Manager and host resource group pages to get started with saving costs on licensing and infrastructure.

AWS can help you assess how your company can get the most out of cloud. Join the millions of AWS customers that trust us to migrate and modernize their most important applications in the cloud. To learn more on modernizing Windows Server or SQL Server, visit Windows on AWS. Contact us to start your migration journey today.

New – Amazon EC2 VT1 Instances for Live Multi-stream Video Transcoding

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-vt1-instances-for-live-multi-stream-video-transcoding/

Global demand for video content has been rapidly increasing and now has the major audiences of Internet and mobile network traffic. Over-the-top streaming services such as Twitch continue to see an explosion of content creators who are seeking live delivery with great image quality, while live event broadcasters are increasingly looking to embrace agile cloud infrastructure to reduce costs without sacrificing reliability, and efficiently scale with demand.

Today, I am happy to announce the general availability of Amazon EC2 VT1 instances that are designed to provide the best price performance for multi-stream video transcoding with resolutions up to 4K UHD. These VT1 instances feature Xilinx® Alveo™ U30 media accelerator transcoding cards with accelerated H.264/AVC and H.265/HEVC codecs and provide up to 30% better price per stream compared to the latest GPU-based EC2 instances and up to 60% better price per stream compared to the latest CPU-based EC2 instances.

Customers with their own live broadcast and streaming video pipelines can use VT1 instances to transcode video streams with resolutions up to 4K UHD. VT1 instances feature networking interfaces of up to 25 Gbps that can ingest multiple video streams over IP with low latency and low jitter. This capability makes it possible for these customers to fully embrace scalable, cost-effective, and resilient infrastructure.

Amazon EC2 VT1 Instance Type
EC2 VT1 instances are available in three sizes. The accelerated H.264/AVC and H.265/HEVC codecs are integrated into Xilinx Zynq ZU7EV SoCs. Each Xilinx® Alveo™ U30 media transcoding accelerator card contains two Zynq SoCs.

Instance size vCPUs Xilinx U30 card Memory Network bandwidth EBS-optimized bandwidth 1080p60 Streams per instance
vt1.3xlarge 12 1 24GB Up to 25 Gbps Up to 4.75 Gbps 8
vt1.6xlarge 24 2 48GB 25 Gbps 4.75 Gbps 16
vt1.24xlarge 96 8 192GB 25 Gbps 19 Gbps 64

The VT1 instances are suitable for transcoding multiple streams per instance. The streams can be processed independently in parallel or mixed (picture-in-picture, side-by-side, transitions). The vCPU cores help with implementing image processing, audio processing, and multiplexing. The Xilinx® Alveo™ U30 card can simultaneously output multiple streams at different resolutions (1080p, 720p, 480p, and 360p) and in both H.264 and H.265.

Each VT1 instance can be configured to produce parallel encoding with different settings, resolutions and transmission bit rate (“ABR ladders“). For example, a 4K UHD stream can be encoded at 60 frames per second with H.265 for high resolution display. Multiple lower resolutions can be encoded with H.264 for delivery to standard displays.

Get Started with EC2 VT1 Instances
You can now launch VT1 instances in the Amazon EC2 console, AWS Command Line Interface (AWS CLI), or using an SDK with the Amazon EC2 API.

We provide a number of sample video processing pipelines for the VT1 instances. There are tutorials and code examples in the GitHub repository that cover how to tune the codecs for image quality and transcoding latency, call the runtime for the U30 cards directly from your own applications, incorporate video filters such as titling and watermarking, and deploy with container orchestration frameworks.

Xilinx provides the “Xilinx Video Transcoding SDK” which includes:

VT1 instances can be coupled with Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS) to efficiently scale transcoding workloads and with Amazon CloudFront to deliver content globally. VT1 instances can also be launched with Amazon Machine Images (AMIs) and containers developed by AWS Marketplace partners, such as Nginx for supplemental video processing functionality.

You can complement VT1 instances with AWS Media Services for reliable packaging and origination of transcoded content. To learn more, you can use a solution library of Live Streaming on AWS to build a live video workflow using these AWS services.

Available Now
Amazon EC2 VT1 instances are now available in the US East (N. Virginia), US West (Oregon), Europe (Ireland), Asia Pacific (Tokyo) Regions. To learn more, visit the EC2 VT1 instance page. Please send feedback to the AWS forum for Amazon EC2 or through your usual AWS support contacts.

– Channy

Enabling parallel file systems in the cloud with Amazon EC2 (Part I: BeeGFS)

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/enabling-parallel-file-systems-in-the-cloud-with-amazon-ec2-part-i-beegfs/

This post was authored by AWS Solutions Architects Ray Zaman, David Desroches, and Ameer Hakme.

In this blog series, you will discover how to build and manage your own Parallel Virtual File System (PVFS) on AWS. In this post you will learn how to deploy the popular open source parallel file system, BeeGFS, using AWS D3en and I3en EC2 instances. We will also provide a CloudFormation template to automate this BeeGFS deployment.

A PVFS is a type of distributed file system that distributes file data across multiple servers and provides concurrent data access to multiple execution tasks of an application. PVFS focuses on high-performance access to large datasets. It consists of a server process and a client library, which allows the file system to be mounted and used with standard utilities. PVFS on the Linux OS originated in the 1990’s and today several projects are available including Lustre, GlusterFS, and BeeGFS. Workloads such as shared storage for video transcoding and export, batch processing jobs, high frequency online transaction processing (OLTP) systems, and scratch storage for high performance computing (HPC) benefit from the high throughput and performance provided by PVFS.

Implementation of a PVFS can be complex and expensive. There are many variables you will want to take into account when designing a PVFS cluster including the number of nodes, node size (CPU, memory), cluster size, storage characteristics (size, performance), and network bandwidth. Due to the difficulty in estimating the correct configuration, systems procured for on-premises data centers are typically oversized, resulting in additional costs, and underutilized resources. In addition, the hardware procurement process is lengthy and the installation and maintenance of the hardware adds additional overhead.

AWS makes it easy to run and fully manage your parallel file systems by allowing you to choose from a variety of Amazon Elastic Compute Cloud (EC2) instances. EC2 instances are available on-demand and allow you to scale your workload as needed. AWS storage-optimized EC2 instances offer up to 60 TB of NVMe SSD storage per instance and up to 336 TB of local HDD storage per instance. With storage-optimized instances, you can easily deploy PVFS to support workloads requiring high-performance access to large datasets. You can test and iterate on different instances to find the optimal size for your workloads.

D3en instances leverage 2nd-generation Intel Xeon Scalable Processors (Cascade Lake) and provide a sustained all core frequency up to 3.1 GHz. These instances provide up to 336 TB of local HDD storage (which is the highest local storage capacity in EC2), up to 6.2 GiBps of disk throughput, and up to 75 Gbps of network bandwidth.

I3en instances are powered by 1st or 2nd generation Intel® Xeon® Scalable (Skylake or Cascade Lake) processors with 3.1 GHz sustained all-core turbo performance. These instances provide up to 60 TB of NVMe storage, up to 16 GB/s of sequential disk throughput, and up to 100 Gbps of network bandwidth.

BeeGFS, originally released by ThinkParQ in 2014, is an open source, software defined PVFS that runs on Linux. You can scale the size and performance of the BeeGFS file-system by configuring the number of servers and disks in the clusters up to thousands of nodes.

BeeGFS architecture

D3en instances offer HDD storage while I3en instances offer NVMe SSD storage. This diversity allows you to create tiers of storage based on performance requirements. In the example presented in this post you will use four D3en.8xlarge (32 vCPU, 128 GB, 16x14TB HDD, 50 Gbit) and two I3en.12xlarge (48 vCPU, 384 GB, 4 x 7.5-TB NVMe) instances to create two storage tiers. You may choose different sizes and quantities to meet your needs. The I3en instances, with SSD, will be configured as tier 1 and the D3en instances, with HDD, will be configured as tier 2. One disk from each instance will be formatted as ext4 and used for metadata while the remaining disks will be formatted as XFS and used for storage. You may choose to separate metadata and storage on different hosts for workloads where these must scale independently. The array will be configured RAID 0, since it will provide maximum performance. Software replication or other RAID types can be employed for higher durability.

BeeGFS architecture

Figure 1: BeeGFS architecture

You will deploy all instances within a single VPC in the same Availability Zone and subnet to minimize latency. Security groups must be configured to allow the following ports:

  • Management service (beegfs-mgmtd): 8008
  • Metadata service (beegfs-meta): 8005
  • Storage service (beegfs-storage): 8003
  • Client service (beegfs-client): 8004

You will use the Debian Quick Start Amazon Machine Image (AMI) as it supports BeeGFS. You can enable Amazon CloudWatch to capture metrics.

How to deploy the BeeGFS architecture

Follow the steps below to create the PVFS described above. For automated deployment, use the CloudFormation template located at AWS Samples.

  1. Use the AWS Management Console or CLI to deploy one D3en.8xlarge instance into a VPC as described above.
  2. Log in to the instance and update the system:
    • sudo apt update
    • sudo apt upgrade
  3. Install the XFS utilities and load the kernel module:
    • sudo apt-get -y install xfsprogs
    • sudo modprobe -v xfs

Format the first disk ext4 as it is used for metadata, the rest are formatted xfs. The disks will appear as “nvme???” which actually represent the HDD drives on the D3en instances.

4. View a listing of available disks:

    • sudo lsblk

5. Format hard disks:

    • sudo mkfs -t ext4 /dev/nvme0n1
    • sudo mkfs -t xfs /dev/nvme1n1
    • Repeat this command for disks nvme2n1 through nvme15n1

6. Create file system mount points:

    • sudo mkdir /disk00
    • sudo mkdir /disk01
    • Repeat this command for disks disk02 through disk15

7. Mount the filesystems:

    • sudo mount /dev/nvme0n1 /disk00
    • sudo mount /dev/nvme0n1 /disk01
    • Repeat this command for disks disk02 through disk15

Repeat steps 1 through 7 on the remaining nodes. Remember to account for fewer disks for i3en.12xlarge instances or if you decide to use different instance sizes.

8. Add the BeeGFS Repo to each node:

    • sudo apt-get -y install gnupg
    • wget https://www.beegfs.io/release/beegfs_7.2.3/dists/beegfs-deb10.list
    • sudo cp beegfs-deb10.list /etc/apt/sources.list.d/
    • sudo wget -q https://www.beegfs.io/release/latest-stable/gpg/DEB-GPG-KEY-beegfs -O- | sudo apt-key add -
    • sudo apt update

9. Install BeeGFS management (node 1 only):

    • sudo apt-get -y install beegfs-mgmtd
    • sudo mkdir /beegfs-mgmt
    • sudo /opt/beegfs/sbin/beegfs-setup-mgmtd -p /beegfs-mgmt/beegfs/beegfs_mgmtd

10. Install BeeGFS metadata and storage (all nodes):

    • sudo apt-get -y install beegfs-meta beegfs-storage beegfs-meta beegfs-client beegfs-helperd beegfs-utils
    • # -s is unique ID based on node - change this!, -m is hostname of management server
    • sudo /opt/beegfs/sbin/beegfs-setup-meta -p /disk00/beegfs/beegfs_meta -s 1 -m ip-XXX-XXX-XXX-XXX
    • # Change -s to nodeID and -i to (nodeid)0(disk), -m is hostname of management server
    • sudo /opt/beegfs/sbin/beegfs-setup-storage -p /disk01/beegfs_storage -s 1 -i 101 -m ip-XXX-XXX-XXX-XXX
    • sudo /opt/beegfs/sbin/beegfs-setup-storage -p /disk02/beegfs_storage -s 1 -i 102 -m ip-XXX-XXX-XXX-XXX
    • Repeat this last command for the remaining disks disk03 through disk15

11. Start the services:

    • #Only on node1
    • sudo systemctl start beegfs-mgmtd
    • #All servers
    • sudo systemctl start beegfs-meta
    • sudo systemctl start beegfs-storage

At this point, your BeeGFS cluster is running and ready for use by a client system. The client system requires BeeGFS client software in order to mount the cluster.

12. Deploy an m5n.2xlarge instance into the same subnet as the PVFS cluster.

13. Log in to the instance, install, and configure the client:

    • sudo apt update
    • sudo apt upgrade
    • sudo apt-get -y install gnupg
    • #Need linux sources for client compilation
    • sudo apt-get -y install linux-source
    • sudo apt-get -y install linux-headers-4.19.0-14-all
    • wget https://www.beegfs.io/release/beegfs_7.2.3/dists/beegfs-deb10.list
    • sudo cp beegfs-deb10.list /etc/apt/sources.list.d/
    • sudo wget -q https://www.beegfs.io/release/latest-stable/gpg/DEB-GPG-KEY-beegfs -O- | sudo apt-key add -
    • sudo apt update
    • sudo apt-get -y install beegfs-client beegfs-helperd beegfs-utils
    • sudo /opt/beegfs/sbin/beegfs-setup-client -m ip-XXX-XXX-XXX-XX # use the ip address of the management node
    • sudo systemctl start beegfs-helperd
    • sudo systemctl start beegfs-client

14. Create the storage pools:

    • sudo beegfs-ctl --addstoragepool —desc="tier1" —targets=501,502,503,601,602,603
    • sudo beegfs-ctl --addstoragepool --desc="tier2" --targets=101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,201,202,203,204,205,206,207,208,209,210,
    • sudo beegfs-ctl --liststoragepools
    • Pool ID Pool Description                      Targets                 Buddy Groups
    • ======= ================== ============================ ============================
      • Default
      • tier1 501,502,503,601,602,603
      • tier2 101,102,103,104,105,106,107,
        • 108,109,110,111,112,113,114,
        • 115,201,202,203,204,205,206,
        • 207,208,209,210,211,212,213,
        • 214,215,301,302,303,304,305,
        • 306,307,308,309,310,311,312,
        • 313,314,315,401,402,403,404,
        • 405,406,407,408,409,410,411,
        • 412,413,414,415

15. Mount the pools to the file system:

    • sudo beegfs-ctl --setpattern --storagepoolid=2 /mnt/beegfs/tier1
    • sudo beegfs-ctl --setpattern --storagepoolid=3 /mnt/beegfs/tier2

The BeeGFS PVFS is now ready to be used by the client system.

How to test your new BeeGFS PVFS

BeeGFS provides StorageBench to evaluate the performance of BeeGFS on the storage targets. This benchmark measures the streaming throughput of the underlying file system and devices independent of the network performance. To simulate client I/O, this benchmark generates read/write locally on the servers without any client communication.

It is possible to benchmark specific targets or all targets together using the “servers” parameter. A “read” or “write” parameter sets the type pf test to perform. The “threads” parameter is set to the number of storage devices.

Try the following commands to test performance:

Write test (1x d3en):

sudo beegfs-ctl --storagebench --servers=1 --write --blocksize=512K —size=20G —threads=15

Write test (4x d3en):

sudo beegfs-ctl --storagebench --alltargets --write --blocksize=512K —size=20G —threads=15

Read test (4x d3en):

sudo beegfs-ctl --storagebench --servers=1,2,3,4 --read --blocksize=512K --size=20G --threads=15

Write test (1x i3en):

sudo beegfs-ctl --storagebench --servers=5 --write --blocksize=512K --size=20G --threads=3

Read test (2x i3en):

sudo beegfs-ctl --storagebench --servers=5,6 --read --blocksize=512K —size=20G —threads=3

StorageBench is a great way to test what the potential performance of a given environment looks like by reducing variables like network throughput and latency, but you may want to test in a more real-world fashion. For this, tools like ‘fio’ can generate mixed read/write workloads against files on the client BeeGFS mountpoint.

First, we need to define which directory goes to which Storage Pool (tier) by setting a pattern:

sudo beegfs-ctl --setpattern --storagepoolid=2 /mnt/beegfs/tier1 sudo beegfs-ctl --setpattern --storagepoolid=3 /mnt/beegfs/tier2

You can see how a file gets striped across the various disks in a pool by adding a file and running the command:

sudo beegfs-ctl —getentryinfo /mnt/beegfs/tier1/myfile.bin

Install fio:

sudo apt-get install -y fio

Now you can run a fio test against one of the tiers.  This example command runs eight threads running a 75/25 read/write workload against a 10-GB file:

sudo fio --numjobs=8 --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/mnt/beegfs/tier1/test --bs=512k --iodepth=64 --size=10G --readwrite=randrw --rwmixread=75

Cleaning up

To avoid ongoing charges for resources you created, you should:


In this blog post we demonstrated how to build and manage your own BeeGFS Parallel Virtual File System on AWS. In this example, you created two storage tiers using the I3en and D3en. The I3en was used as the first tier for SSD storage and the D3en was used as a second tier for HDD storage. By using two different tiers, you can optimize performance to meet your application requirements.

Amazon EC2 storage-optimized instances make it easy to deploy the BeeGFS Parallel Virtual File System. Using combinations of SSD and HDD storage available on the I3en and D3en instance types, you can achieve the capacity and performance needed to run the most demanding workloads. Read more about the D3en and I3en instances.

Field Notes: Automate Disaster Recovery for AWS Workloads with Druva

Post Syndicated from Girish Chanchlani original https://aws.amazon.com/blogs/architecture/field-notes-automate-disaster-recovery-for-aws-workloads-with-druva/

This post was co-written by Akshay Panchmukh, Product Manager, Druva and Girish Chanchlani, Sr Partner Solutions Architect, AWS.

The Uptime Institute’s Annual Outage Analysis 2021 report estimated that 40% of outages or service interruptions in businesses cost between $100,000 and $1 million, while about 17% cost more than $1 million. To guard against this, it is critical for you to have a sound data protection and disaster recovery (DR) strategy to minimize the impact on your business. With the greater adoption of the public cloud, most companies are either following a hybrid model with critical workloads spread across on-premises data centers and the cloud or are all in the cloud.

In this blog post, we focus on how Druva, a SaaS based data protection solution provider, can help you implement a DR strategy for your workloads running in Amazon Web Services (AWS). We explain how to set up protection of AWS workloads running in one AWS account, and fail them over to another AWS account or Region, thereby minimizing the impact of disruption to your business.

Overview of the architecture

In the following architecture, we describe how you can protect your AWS workloads from outages and disasters. You can quickly set up a DR plan using Druva’s simple and intuitive user interface, and within minutes you are ready to protect your AWS infrastructure.

Figure 1. Druva architecture

Druva’s cloud DR is built on AWS using native services to provide a secure operating environment for comprehensive backup and DR operations. With Druva, you can:

  • Seamlessly create cross-account DR sites based on source sites by cloning Amazon Virtual Private Clouds (Amazon VPCs) and their dependents.
  • Set up backup policies to automatically create and copy snapshots of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Relational Database Service (Amazon RDS) instances to DR Regions based on recovery point objective (RPO) requirements.
  • Set up service level objective (SLO) based DR plans with options to schedule automated tests of DR plans and ensure compliance.
  • Monitor implementation of DR plans easily from the Druva console.
  • Generate compliance reports for DR failover and test initiation.

Other notable features include support for automated runbook initiation, selection of target AWS instance types for DR, and simplified orchestration and testing to help protect and recover data at scale. Druva provides the flexibility to adopt evolving infrastructure across geographic locations, adhere to regulatory requirements (such as, GDPR and CCPA), and recover workloads quickly following disasters, helping meet your business-critical recovery time objectives (RTOs). This unified solution offers taking snapshots as frequently as every five minutes, improving RPOs. Because it is a software as a service (SaaS) solution, Druva helps lower costs by eliminating traditional administration and maintenance of storage hardware and software, upgrades, patches, and integrations.

We will show you how to set up Druva to protect your AWS workloads and automate DR.

Step 1: Log into the Druva platform and provide access to AWS accounts

After you sign into the Druva Cloud Platform, you will need to grant Druva third-party access to your AWS account by pressing Add New Account button, and following the steps as shown in Figure 2.

Figure 2. Add AWS account information

Druva uses AWS Identity and Access Management (IAM) roles to access and manage your AWS workloads. To help you with this, Druva provides an AWS CloudFormation template to create a stack or stack set that generates the following:

  • IAM role
  • IAM instance profile
  • IAM policy

The generated Amazon Resource Name (ARN) of the IAM role is then linked to Druva so that it can run backup and DR jobs for your AWS workloads. Note that Druva follows all security protocols and best practices recommended by AWS. All access permissions to your AWS resources and Regions are controlled by IAM.

After you have logged into Druva and set up your account, you can now set up DR for your AWS workloads. The following steps will allow you to set up DR for AWS infrastructure.

Step 2: Identify the source environment

A source environment refers to a logical grouping of Amazon VPCs, subnets, security groups, and other infrastructure components required to run your application.

Figure 3. Create a source environment

In this step, create your source environment by selecting the appropriate AWS resources you’d like to set up for failover. Druva currently supports Amazon EC2 and Amazon RDS as sources that can be protected. With Druva’s automated DR, you can failover these resources to a secondary site with the press of a button.

Figure 4. Add resources to a source environment

Note that creating a source environment does not create or update existing resources or configurations in your AWS account. It only saves this configuration information with Druva’s service.

Step 3: Clone the environment

The next step is to clone the source environment to a Region where you want to failover in case of a disaster. Druva supports cloning the source environment to another Region or AWS account that you have selected. Cloning essentially replicates the source infrastructure in the target Region or account, which allows the resources to be failed over quickly and seamlessly.

Figure 5. Clone the source environment

Step 4: Set up a backup policy

You can create a new backup policy or use an existing backup policy to create backups in the cloned or target Region. This enables Druva to restore instances using the backup copies.

Figure 6. Create a new backup policy or use an existing backup policy

Figure 7. Customize the frequency of backup schedules

Step 5: Create the DR plan

A DR plan is a structured set of instructions designed to recover resources in the event of a failure or disaster. DR aims to get you back to the production-ready setup with minimal downtime. Follow these steps to create your DR plan.

  1. Create DR Plan: Press Create Disaster Recovery Plan button to open the DR plan creation page.

Figure 8. Create a disaster recovery plan

  1. Name: Enter the name of the DR plan.
    Service Level Objective (SLO): Enter your RPO and RTO.
    • Recovery Point Objective – Example: If you set your RPO as 24 hours, and your backup was scheduled daily at 8:00 PM, but a disaster occurred at 7:59 PM, you would be able to recover data that was backed up on the previous day at 8:00 PM. You would lose the data generated after the last backup (24 hours of data loss).
    • Recovery Time Objective – Example: If you set your RTO as 30 hours, when a disaster occurred, you would be able to recover all critical IT services within 30 hours from the point in time the disaster occurs.

Figure 9. Add your RTO and RPO requirements

  1. Create your plan based off the source environment, target environment, and resources.
Source Account By default, this is the Druva account in which you are currently creating the DR plan.
Source Environment Select the source environment applicable within the Source Account (your Druva account in which you’re creating the DR plan).
Target Account Select the same or a different target account.
Target Environment Select the Target Environment, applicable within the Target Account.
Create Policy If you do not have a backup policy, then you can create one.
Add Resources Add resources from the source environment that you want to restore. Make sure the verification column shows a ‘Valid Backup Policy’ status. This ensures that the backup policy is frequently creating copies as per the RPO defined previously.

Figure 10. Create DR plan based off the source environment, target environment, and resources

  1. Identify target instance type, test plan instance type, and run books for this DR plan.
Target Instance Type Target Instance Type can be selected. If instance type is not selected then:

  • Select an instance type which is large in size.
  • Select an instance type which is smaller.
  • Fail the creation of the instance.
Test Plan Instance Type There are many options. To reduce incurring cost, the lower instance type can be selected from all available AWS instance types.
Run Books Select this option if you would like to shutdown the source server after failover occurs.

Figure 11. Identify target instance type, test plan instance type, and run books

Step 6: Test the DR plan

After you have defined your DR plan, it is time to test it so that you can—when necessary—initiate a failover of resources to the target Region. You can now easily try this on the resources in the cloned environment without affecting your production environment.

Figure 12. Test the DR plan

Testing your DR plan will help you to find answers for some of the questions like: How long did the recovery take? Did I meet my RTO and RPO objectives?

Step 7: Initiate the DR plan

After you have successfully tested the DR plan, it can easily be initiated with the click of a button to failover your resources from the source Region or account to the target Region or account.


With the growth of cloud-based services, businesses need to ensure that mission-critical applications that power their businesses are always available. Any loss of service has a direct impact on the bottom line, which makes business continuity planning a critical element to any organization. Druva offers a simple DR solution which will help you keep your business always available.

Druva provides unified backup and cloud DR with no need to manage hardware, software, or costs and complexity. It helps automate DR processes to ensure your teams are prepared for any potential disasters while meeting compliance and auditing requirements.

With Druva, you can easily validate your RTO and RPO with automated regular DR testing, cross-account DR for protection against attacks and accidental deletions, and ensure backups are isolated from your primary production account for DR planning. With cross-Region DR, you can duplicate the entire Amazon VPC environment across Regions to protect you against Regionwide failures. In conclusion, Druva is a complete solution built with a goal to protect your native AWS workloads from any disasters.

To learn more, visit: https://www.druva.com/use-cases/aws-cloud-backup/

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.
Akshay Panchmukh

Akshay Panchmukh

Akshay Panchmukh is the Product Manager focusing on cloud-native workloads at Druva. He loves solving complex problems for customers and helping them achieve their business goals. He leverages his SaaS product and design thinking experience to help enterprises build a complete data protection solution. He works combining the best of technology, data, product, empathy, and design to execute a successful product vision.

Scaling Data Analytics Containers with Event-based Lambda Functions

Post Syndicated from Brian Maguire original https://aws.amazon.com/blogs/architecture/scaling-data-analytics-containers-with-event-based-lambda-functions/

The marketing industry collects and uses data from various stages of the customer journey. When they analyze this data, they establish metrics and develop actionable insights that are then used to invest in customers and generate revenue.

If you’re a data scientist or developer in the marketing industry, you likely often use containers for services like collecting and preparing data, developing machine learning models, and performing statistical analysis. Because the types and amount of marketing data collected are quickly increasing, you’ll need a solution to manage the scale, costs, and number of required data analytics integrations.

In this post, we provide a solution that can perform and scale with dynamic traffic and is cost optimized for on-demand consumption. It uses synchronous container-based data science applications that are deployed with asynchronous container-based architectures on AWS Lambda. This serverless architecture automates data analytics workflows utilizing event-based prompts.

Synchronous container applications

Data science applications are often deployed to dedicated container instances, and their requests are routed by an Amazon API Gateway or load balancer. Typically, an Amazon API Gateway routes HTTP requests as synchronous invocations to instance-based container hosts.

The target of the requests is a container-based application running a machine learning service (SciLearn). The service container is configured with the required dependency packages such as scikit-learn, pandas, NumPy, and SciPy.

Containers are commonly deployed on different targets such as on-premises, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Compute Cloud (Amazon EC2), and AWS Elastic Beanstalk. These services run synchronously and scale through Amazon Auto Scaling groups and a time-based consumption pricing model.

Figure 1. Synchronous container applications diagram

Figure 1. Synchronous container applications diagram

Challenges with synchronous architectures

When using a synchronous architecture, you will likely encounter some challenges related to scale, performance, and cost:

  • Operation blocking. The sender does not proceed until Lambda returns, and failure processing, such as retries, must be handled by the sender.
  • No native AWS service integrations. You cannot use several native integrations with other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon EventBridge, and Amazon Simple Queue Service (Amazon SQS).
  • Increased expense. A 24/7 environment can result in increased expenses if resources are idle, not sized appropriately, and cannot automatically scale.

The New for AWS Lambda – Container Image Support blog post offers a serverless, event-based architecture to address these challenges. This approach is explained in detail in the following section.

Benefits of using Lambda Container Image Support

In our solution, we refactored the synchronous SciLearn application that was deployed on instance-based hosts as an asynchronous event-based application running on Lambda. This solution includes the following benefits:

  • Easier dependency management with Dockerfile. Dockerfile allows you to install native operating system packages and language-compatible dependencies or use enterprise-ready container images.
  • Similar tooling. The asynchronous and synchronous solutions use Amazon Elastic Container Registry (Amazon ECR) to store application artifacts. Therefore, they have the same build and deployment pipeline tools to inspect Dockerfiles. This means your team will likely spend less time and effort learning how to use a new tool.
  • Performance and cost. Lambda provides sub-second autoscaling that’s aligned with demand. This results in higher availability, lower operational overhead, and cost efficiency.
  • Integrations. AWS provides more than 200 service integrations to deploy functions as container images on Lambda, without having to develop it yourself so it can be deployed faster.
  • Larger application artifact up to 10 GB. This includes larger application dependency support, giving you more room to host your files and packages in oppose to hard limit of 250 MB of unzipped files for deployment packages.

Scaling with asynchronous events

AWS offers two ways to asynchronously scale processing independently and automatically: Elastic Beanstalk worker environments and asynchronous invocation with Lambda. Both options offer the following:

  • They put events in an SQS queue.
  • They can be designed to take items from the queue only when they have the capacity available to process a task. This prevents them from becoming overwhelmed.
  • They offload tasks from one component of your application by sending them to a queue and process them asynchronously.

These asynchronous invocations add default, tunable failure processing and retry mechanisms through “on failure” and “on success” event destinations, as described in the following section.

Integrations with multiple destinations

“On failure” and “on success” events can be logged in an SQS queue, Amazon Simple Notification Service (Amazon SNS) topic, EventBridge event bus, or another Lambda function. All four are integrated with most AWS services.

“On failure” events are sent to an SQS dead-letter queue because they cannot be delivered to their destination queues. They will be reprocessed them as needed, and any problems with message processing will be isolated.

Figure 2 shows an asynchronous Amazon API Gateway that has placed an HTTP request as a message in an SQS queue, thus decoupling the components.

The messages within the SQS queue then prompt a Lambda function. This runs the machine learning service SciLearn container in Lambda for data analysis workflows, which are integrated with another SQS dead letter queue for failures processing.

Figure 2. Example asynchronous-based container applications diagram

Figure 2. Example asynchronous-based container applications diagram

When you deploy Lambda functions as container images, they benefit from the same operational simplicity, automatic scaling, high availability, and native integrations. This makes it an appealing architecture for our data analytics use cases.

Design considerations

The following can be considered when implementing Docker Container Images with Lambda:

  • Lambda supports container images that have manifest files that follow these formats:
    • Docker image manifest V2, schema 2 (Docker version 1.10 and newer)
    • Open Container Initiative (OCI) specifications (v1.0.0 and up)
  • Storage in Lambda:
  • On create/update, Lambda will cache the image to speed up the cold start of functions during execution. Cold starts occur on an initial request to Lambda, which can lead to longer startup times. The first request will maintain an instance of the function for only a short time period. If Lambda has not been called during that period, the next invocation will create a new instance
  • Fine grain role policies are highly recommended for security purposes
  • Container images can use the Lambda Extensions API to integrate monitoring, security and other tools with the Lambda execution environment.


We were able to architect this synchronous service based on a previously deployed on instance-based hosts and design it to become asynchronous on Amazon Lambda.

By using the new support for container-based images in Lambda and converting our workload into an asynchronous event-based architecture, we were able to overcome these challenges:

  • Performance and security. With batch requests, you can scale asynchronous workloads and handle failure records using SQS Dead Letter Queues and Lambda destinations. Using Lambda to integrate with other services (such as EventBridge and SQS) and using Lambda roles simplifies maintaining a granular permission structure. When Lambda uses an Amazon SQS queue as an event source, it can scale up to 60 more instances per minute, with a maximum of 1,000 concurrent invocations.
  • Cost optimization. Compute resources are a critical component of any application architecture. Overprovisioning computing resources and operating idle resources can lead to higher costs. Because Lambda is serverless, it only incurs costs on when you invoke a function and the resources allocated for each request.

Use IAM Access Analyzer to generate IAM policies based on access activity found in your organization trail

Post Syndicated from Mathangi Ramesh original https://aws.amazon.com/blogs/security/use-iam-access-analyzer-to-generate-iam-policies-based-on-access-activity-found-in-your-organization-trail/

In April 2021, AWS Identity and Access Management (IAM) Access Analyzer added policy generation to help you create fine-grained policies based on AWS CloudTrail activity stored within your account. Now, we’re extending policy generation to enable you to generate policies based on access activity stored in a designated account. For example, you can use AWS Organizations to define a uniform event logging strategy for your organization and store all CloudTrail logs in your management account to streamline governance activities. You can use Access Analyzer to review access activity stored in your designated account and generate a fine-grained IAM policy in your member accounts. This helps you to create policies that provide only the required permissions for your workloads.

Customers that use a multi-account strategy consolidate all access activity information in a designated account to simplify monitoring activities. By using AWS Organizations, you can create a trail that will log events for all Amazon Web Services (AWS) accounts into a single management account to help streamline governance activities. This is sometimes referred to as an organization trail. You can learn more from Creating a trail for an organization. With this launch, you can use Access Analyzer to generate fine-grained policies in your member account and grant just the required permissions to your IAM roles and users based on access activity stored in your organization trail.

When you request a policy, Access Analyzer analyzes your activity in CloudTrail logs and generates a policy based on that activity. The generated policy grants only the required permissions for your workloads and makes it easier for you to implement least privilege permissions. In this blog post, I’ll explain how to set up the permissions for Access Analyzer to access your organization trail and analyze activity to generate a policy. To generate a policy in your member account, you need to grant Access Analyzer limited cross-account access to access the Amazon Simple Storage Service (Amazon S3) bucket where logs are stored and review access activity.

Generate a policy for a role based on its access activity in the organization trail

In this example, you will set fine-grained permissions for a role used in a development account. The example assumes that your company uses Organizations and maintains an organization trail that logs all events for all AWS accounts in the organization. The logs are stored in an S3 bucket in the management account. You can use Access Analyzer to generate a policy based on the actions required by the role. To use Access Analyzer, you must first update the permissions on the S3 bucket where the CloudTrail logs are stored, to grant access to Access Analyzer.

To grant permissions for Access Analyzer to access and review centrally stored logs and generate policies

  1. Sign in to the AWS Management Console using your management account and go to S3 settings.
  2. Select the bucket where the logs from the organization trail are stored.
  3. Change object ownership to bucket owner preferred. To generate a policy, all of the objects in the bucket must be owned by the bucket owner.
  4. Update the bucket policy to grant cross-account access to Access Analyzer by adding the following statement to the bucket policy. This grants Access Analyzer limited access to the CloudTrail data. Replace the <organization-bucket-name>, and <organization-id> with your values and then save the policy.
        "Version": "2012-10-17",
            "Sid": "PolicyGenerationPermissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            "Action": [
            "Resource": [
            "Condition": {
                "StringLike": {"aws:PrincipalArn":"arn:aws:iam::${aws:PrincipalAccount}:role/service-role/AccessAnalyzerMonitorServiceRole*"            }

By using the preceding statement, you’re allowing listbucket and getobject for the bucket my-organization-bucket-name if the role accessing it belongs to an account in your Organizations and has a name that starts with AccessAnalyzerMonitorServiceRole. Using aws:PrincipalAccount in the resource section of the statement allows the role to retrieve only the CloudTrail logs belonging to its own account. If you are encrypting your logs, update your AWS Key Management Service (AWS KMS) key policy to grant Access Analyzer access to use your key.

Now that you’ve set the required permissions, you can use the development account and the following steps to generate a policy.

To generate a policy in the AWS Management Console

  1. Use your development account to open the IAM Console, and then in the navigation pane choose Roles.
  2. Select a role to analyze. This example uses AWS_Test_Role.
  3. Under Generate policy based on CloudTrail events, choose Generate policy, as shown in Figure 1.
    Figure 1: Generate policy from the role detail page

    Figure 1: Generate policy from the role detail page

  4. In the Generate policy page, select the time window for which IAM Access Analyzer will review the CloudTrail logs to create the policy. In this example, specific dates are chosen, as shown in Figure 2.
    Figure 2: Specify the time period

    Figure 2: Specify the time period

  5. Under CloudTrail access, select the organization trail you want to use as shown in Figure 3.

    Note: If you’re using this feature for the first time: select create a new service role, and then choose Generate policy.

    This example uses an existing service role “AccessAnalyzerMonitorServiceRole_MBYF6V8AIK.”

    Figure 3: CloudTrail access

    Figure 3: CloudTrail access

  6. After the policy is ready, you’ll see a notification on the role page. To review the permissions, choose View generated policy, as shown in Figure 4.
    Figure 4: Policy generation progress

    Figure 4: Policy generation progress

After the policy is generated, you can see a summary of the services and associated actions in the generated policy. You can customize it by reviewing the services used and selecting additional required actions from the drop down. To refine permissions further, you can replace the resource-level placeholders in the policies to restrict permissions to just the required access. You can learn more about granting fine-grained permissions and creating the policy as described in this blog post.


Access Analyzer makes it easier to grant fine-grained permissions to your IAM roles and users by generating IAM policies based on the CloudTrail activity centrally stored in a designated account such as your AWS Organizations management accounts. To learn more about how to generate a policy, see Generate policies based on access activity in the IAM User Guide.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the IAM forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Mathangi Ramesh

Mathangi Ramesh

Mathangi is the product manager for AWS Identity and Access Management. She enjoys talking to customers and working with data to solve problems. Outside of work, Mathangi is a fitness enthusiast and a Bharatanatyam dancer. She holds an MBA degree from Carnegie Mellon University.

Apply the principle of separation of duties to shell access to your EC2 instances

Post Syndicated from Vesselin Tzvetkov original https://aws.amazon.com/blogs/security/apply-the-principle-of-separation-of-duties-to-shell-access-to-your-ec2-instances/

In this blog post, we will show you how you can use AWS Systems Manager Change Manager to control access to Amazon Elastic Compute Cloud (Amazon EC2) instance interactive shell sessions, to enforce separation of duties. Separation of duties is a design principle where more than one person’s approval is required to conclude a critical task, and it is an important part of the AWS Well-Architected Framework. You will be using AWS Systems Manager Session Manager in this post to start a shell session in managed EC2 instances.

To get approval, the operator requests permissions by creating a change request for a shell session to an EC2 instance. An approver reviews and approves the change request. The approver and requestor cannot be the same Identity and Access Management (IAM) principal. Upon approval, an AWS Systems Manager Automation runbook is started. The Automation runbook adds a tag to the operator’s IAM principal that allows it to start a shell in the specified targets. By default, the operator needs to start the session within 10 minutes of approval (although the validity period is configurable). After the 10 minutes elapse, the Automation runbook removes the tag from the principal, which means that the permissions to start new sessions are revoked.

To implement the solution described in this post, you use attribute-based access control (ABAC) based policy. In order to start a Systems Manager session, the operator’s IAM principal must have the tag key SecurityAllowSessionInstance, and the tag value set to the target EC2 instance ID. All operator principals have attached the same managed policy, which allows the session to start only if the tag is present and the value is equal to the instance ID. Figure 1 shows an example in which the IAM principal tag SecurityAllowSessionInstance has the value i-1234567890abcdefg, which is the same as the instance ID.

Figure 1: Tag and managed policy pattern

Figure 1: Tag and managed policy pattern

In this post, we will take you through the following steps:

  1. Review the architecture of the solution. (See the Architecture section.)
  2. Set up Systems Manager and Change Manager in the console.
  3. Deploy an AWS CloudFormation template that will provision the following:
    • A change management template AllowSsmSessionStartTemplate to request permission for a Session Manager shell session on a specified EC2 instance.
    • The Systems Manager Automation runbook with three steps that: adds a tag to the principal; waits 10 minutes (configurable); and removes the tag. The tag key is SecurityAllowSessionInstance.
    • An IAM managed policy to be added to an IAM principal, which allows starting a Systems Manager session only if the tag AllowStartSsmSessionBasedOnIamTags is present.
    • An Amazon SNS topic change-manager-ssm-approval where approvers can get notification about requests.
    • An IAM role named SsmSessionControlChangeMangerRole, to be used for the Systems Manager Automation runbook.

    Note: Before you use the change template, you will approve the change management template in the AWS Management Console (one time).

  4. Perform simple test cases to demonstrate how an operator can obtain permission and start a session in a managed instance.
  5. Perform status monitoring.

You can use this solution across your AWS Organizations to give you the benefit of centrally managing change-related tasks in one member account, which you specify to be the delegated administrator account. For more information about how to set this up, see Setting up Change Manager for an organization.

Note: The operator can have multiple sessions in different EC2 instances simultaneously, but the sessions must be approved and started one after another because of tag overwrite on approval.

For more information about change management actions, including approvals and starting the runbook, see Auditing and logging Change Manager activity in the AWS Systems Manager User Guide.


The architecture of this solution is shown in Figure 2.

Figure 2: Solution architecture

Figure 2: Solution architecture

The main steps shown in Figure 2 are the following:

  1. Request: The requestor (which can be the operator) creates a change request in Systems Manager Change Manager and selects the template AllowSsmSessionStartTemplate. You need to provide the following mandatory parameters: name of change, approvals (users, group, or roles), IAM role for the execution of change, target account, EC2 instance ID, operator’s principal type (user or role), and operator’s principal name.
  2. Send notification: The notification is sent to the Amazon SNS topic change-manager-ssm-approval for the new change request.
  3. Approve: The approver reviews and approves the request.
  4. Start automation: The Automation runbook AllowStartSsmSession is started at the time specified in the change request.
  5. Tag: The operator’s IAM principal is tagged with the key SecurityAllowSessionInstance. After 10 minutes, the runbook completes by removing the tag from the IAM principal.
  6. Start session: The operator can start a session to the instance by using Systems Manager Session Manager within 10 minutes of approval. A notification is sent to the SNS topic change-manager-ssm-approval, where the operator can also subscribe to be notified.

Roles and permissions

The provided managed policy AllowStartSsmSessionBasedOnIamTags gives permission to start the Systems Manager session when the instance ID is equal to principal tag, and additionally to terminate the session. The managed policy allows the operator to keep an already active session beyond the approval interval and terminate it as preferred. Resumption of the session is not supported, and the operator will need to start a new session instead.

WARNING: You should validate that the operator principal (which is an IAM user or role) does not have permissions on the actions ssm:StartSession, ssm:TerminateSession, ssm:ResumeSession outside the managed policy used in this solution.

WARNING: It is very important that the operator must not have permission to change the relevant IAM roles, users, policies, or principal tags, so that the operator cannot bypass the approval process.

Set up Systems Manager and Change Manager

You need to initially activate Systems Manager and Systems Manager Change Manager in your account. If you have already activated them, you can skip this section.

Note: You should enable Systems Manager as described in Setting up AWS Systems Manager, according to your company needs. The minimal requirement is to set up the service-linked role AWSServiceRoleForAmazonSSM that will be used by Systems Manager.

To create the service-linked IAM role

  1. Open the IAM console. In the navigation pane, choose Roles, then choose Create role.
  2. For the AWS Service role type, choose Systems Manager.
  3. Choose the use case Systems Manager – Inventory and Maintenance Windows, then choose Next: Permissions.
  4. Keep all default values, choose Next: Tags, and then choose Next: Review.
  5. Review the role and then choose Create role.

For more information, see Creating a service-linked role for Systems Manager.

Next, you set up Systems Manager Change Manager as described in Setting up Change Manager. Your specifics will vary depending on whether you use AWS Organizations or a single account.

Define the IAM users or groups that are allowed to approve change templates

Every change template should be approved before use (optional). The approval can be done by users and groups. If you use IAM roles in your organization, you will need a temporary user, which you can set up as described in Creating IAM users (console). Alternatively, you can use the change templates without explicit approval, as described later in this section.

To add reviewers for change templates

  1. In the AWS Systems Manager console, choose Change Manager, choose Settings, then choose Template reviewers.
  2. On the Select IAM approvers page, review the Users tab and Groups tab, as shown in Figure 3, and add approvers if necessary.


Figure 3: Change Manger settings

Figure 3: Change Manger settings

If you prefer not to explicitly review and approve the change template before use, you must turn off approval as follows.

To turn off approval of change templates before use

  1. In the Systems Manager console, choose Change Manager, then choose Settings.
  2. Under Best practices, set the option Require template review and approval before use to disabled.

Deploy the solution

After you complete the setup, you will perform the following steps one time in your selected account and AWS Region. The solution manages the permissions in all Regions you select, because IAM roles and policies are global entities.

To launch the stack

  1. Choose the following Launch Stack button to open the AWS CloudFormation console pre-loaded with the template. You must sign in to your AWS account in order to launch the stack in the required Region.
    Select the Launch Stack button to launch the template
  2. On the CloudFormation launch panel, specify the parameter Approval validity in minutes to correspond to your company policy, or keep the default value of 10 minutes.

(Optional) To approve the template

  1. To request approval of the Change Manager template, in the Systems Manager console, choose Change Manager, and then choose Change templates. Select AllowSsmSessionStartTemplate and submit for approval.
  2. To approve the Change Manager template, sign in to the Systems Manager console as the required approver user or group. Choose Change Manager, and then choose Change templates. Select AllowSsmSessionStartTemplate and choose Actions, Approve template. For more information, see Reviewing and approving or rejecting change templates.
  3. (Optional) The Systems Manager session approvers should subscribe to the SNS topic change-manager-ssm-approval, to get notification on new requests.

Now you’re ready to use the solution.

Test the solution

Next, we’ll demonstrate how you can test the solution end-to-end by doing the following: creating two IAM roles (Operator and Approver), launching an EC2 instance, requesting access by Operator to the instance, approving the request by Approver, and finally starting a Systems Manager session on the EC2 instance by Operator. You will run the test in the single account where you deployed the solution. We assume that you have set up Systems Manager as described in the Set up Systems Manager and Change Manager section.

Note: If you’re not using IAM roles in your organization, you can use IAM users instead, as described in Creating IAM users (console).

To prepare to test the solution

  1. Open the IAM console and create an IAM role named Operator in your account, and attach the following managed policies: ReadOnlyAccess (AWS managed) and AllowStartSsmSessionBasedOnIamTags (which you created in this post). For more information, see Creating IAM roles
  2. Create a second IAM role named Approver in your account, and attach the following AWS managed policies: ReadOnlyAccess and AmazonSSMFullAccess.
  3. Create an IAM role named EC2Role with a trust policy to the EC2 service (ec2.amazonaws.com) and attach the AWS managed policy AmazonEC2RoleforSSM. Alternatively, you can confirm that your existing EC2 instances have the AmazonEC2RoleforSSM policy attached to their role. For more information, see Creating a role for an AWS service (console).
  4. Open the Amazon EC2 console and start a test EC2 instance of type Amazon Linux 2 with the IAM role EC2Role that you created in step 3. You can keep the default values for all the other parameters. You don’t need to set up VPC Security Group rules to allow inbound SSH to the EC2 instance. Take note of the instance-id, because you will need it later. For more information, see Launch an Amazon EC2 Instance.
  5. Open the Amazon SNS console. Under Simple Notification Service, for Topics, subscribe to the SNS topic change-manager-ssm-approval. For more information, see Subscribing to an Amazon SNS topic.

To do a positive test of the solution

  1. Open the Systems Manager console, sign in as Operator, choose Change Manager, and create a change request.
  2. Select the template AllowSsmSessionStartTemplate.
  3. On the Specify change details page, enter a name and description, and select the IAM role Approver as approver.
  4. For Target notification topic, select the SNS topic change-manager-ssm-approval, as shown in Figure 4. Choose Next.

    Figure 4: Creating a change request

    Figure 4: Creating a change request

  5. On the Specify parameters page, provide the automation IAM role SsmSessionControlChangeMangerRole, the instance-id you noted earlier, the principal name Operator, and the principal type role, as shown in Figure 5.

    Figure 5: Specify parameters for the change request

    Figure 5: Specify parameters for the change request

  6. Next, sign in as Approver. In the Systems Manager console, choose Change Manager.
  7. On the Requests tab, as shown in Figure 6, select the request and choose Approve. (For more information, see Reviewing and approving or rejecting change requests (console).) The Automation runbook will be started.

    Figure 6: Change Manager overview

    Figure 6: Change Manager overview

  8. Sign in as Operator. Within the approval validity time that you provided in the template (10 minutes is the default), connect to the instance by using Systems Manager as described in Start a session.When the session has started and you see Unix shell at the instance, the positive test is done.

Next, you can do a negative test, to demonstrate that access isn’t possible after the approval validity period (10 minutes) has elapsed.

To do a negative test of the solution

  1. Do steps 1 through 7 of the previous procedure, if you haven’t already done so.
  2. Sign in as IAM role Operator. Wait several minutes longer than the approval validity time (10-minute default) and connect to the instance by using Systems Manager as described in Start a session.You will see that the IAM role Operator doesn’t have permission to start a session.

Clean up the resources

After the tests are finished, terminate the EC2 instance to avoid incurring future costs and remove the roles if these are no longer needed.

Status monitoring

In the Systems Manager console, on the Change Manager page, on the Requests tab, you can find all service requests and their status, and a link to the log of the runbook, as shown in Figure 7.

Figure 7: Change Manger runbook log

Figure 7: Change Manger runbook log

In the example shown in Figure 7, you can see the status of the following steps in the Automation runbook: tagging the principal, waiting, and removing the principal tag. For more information about audit and login, see Auditing and logging Change Manager activity.


In this post, you‘ve learned how you can enforce separation of duties by using an approval workflow in AWS Systems Manager Change Manager. You can also extend this pattern to use it with AWS Organizations, as described in Setting up Change Manager for an organization. For more information, see Configuring Change Manager options and best practices.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Systems Manager forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Vesselin Tzvetkov

Vesselin is a senior security architect at AWS Professional Services and is passionate about security architecture and engineering innovative solutions. Outside of technology, he likes classical music, philosophy, and sports. He holds a Ph.D. in security from TU-Darmstadt and a M.S. in electrical engineering from Bochum University in Germany.

Pedro Galvao

Pedro is a security engineer at AWS Professional Services. His favorite activity is to help customer doing awesome security engineering work on AWS.

How IPONWEB Adopted Spot Instances to Run their Real-time Bidding Workloads

Post Syndicated from Roman Boiko original https://aws.amazon.com/blogs/architecture/how-iponweb-adopted-spot-instances-to-run-their-real-time-bidding-workloads/

IPONWEB is a global leader that builds programmatic, real-time advertising technology and infrastructure for some of the world’s biggest digital media buyers and sellers. The core of IPONWEB’s business is Real-time Bidding (RTB). IPONWEB’s platform processes, transmits, and auctions huge volumes of bid requests and bid responses in real time. They are then able to determine the optimal ad impression to be displayed to an end user.

IPONWEB processes over 1 trillion impression auctions per month. The volumes of impressions require significant compute power to make calculations and complete the entire bidding auction within 200 milliseconds. IPONWEB’s flagship RTB products, BidSwitch and BSWX, have been running on Amazon EC2 reserved and On-Demand Instances. The nature of the workload for this volume of compute for this short a time, has been notably spiky. IPONWEB maintained a core workload running 24/7 on EC2 instances covered by Reserved Instances and Savings Plans.

To optimize costs further, IPONWEB decided to use Amazon EC2 Spot Instances. Although this decision is also advantageous for workload efficiency, adopting Spot Instances did introduce a few challenges that needed to be addressed.

Some challenges in using Spot Instances for this use case

The Real-time Bidding (RTB) workload is critical to the business. Any interruption of bid processing has a negative impact on revenue. While the spikiness issue needed to be addressed, it was imperative to find a solution that could be implemented in a smooth and seamless fashion.

Efficient distribution over Availability Zones (AZs) within the AWS Region being used was imperative. RTB auctions process trillions of requests and consume large volumes of network traffic. Consequently, IPONWEB is continuously optimizing traffic costs along with compute costs. The architectural approach is to put as much workload as possible in one AZ to reduce inter-AZ traffic. For better performance, IPONWEB’s main MongoDB database and Spot Instances are running in the same AZ.

Under certain circumstances, using Spot Instances in other AZs can still be a good choice. AWS charges for traffic between AZs, so IPONWEB defined a break point. This occurs when usage of an EC2 Spot Instance in a different AZ became equal to usage of an EC2 On-Demand Instance in the same AZ.

This break point calculation led to the following best practices (in order of most value to least, for this use case):

  1. The most efficient practice was to run Spot Instances in the same AZ as the database
  2. Next effective was to run Spot Instances in the other AZs
  3. The third option was to run On-Demand Instances in the same AZ
  4. Least effective was to run On-Demand Instances in the other AZs

IPONWEB’s solution

IPONWEB evaluated Auto Scaling groups using both Spot and On-Demand Instances. They concluded that using Auto Scaling alone would not satisfy all the requirements, and it would be difficult to distribute workloads between AZs adequately. This is because an Auto Scaling group tries to distribute the instances across all the AZs evenly. However, IPONWEB needed to scale in one AZ initially, to avoid unnecessary cross AZ traffic. In addition, one Auto Scaling group can’t compensate for the sudden termination of a large number of Spot Instances. IPONWEB created a more flexible and reliable solution based on using four Auto Scaling groups.

  1. The first Auto Scaling group runs only on Spot Instances in the same AZ as the main database application.
  2. The second Auto Scaling group is running only Spot Instances in the other AZs.
  3. The third Auto Scaling group is running EC2 On-Demand Instances in the database AZ.
  4. The fourth Auto Scaling group is for EC2 On-Demand Instance on the other AZs.

The application workload running on the instances in these four Auto Scaling groups is evenly distributed. This is done using an Application Load Balancer (ALB), or IPONWEB’s purpose-built load balancer.

Figure 1. Spot and On-Demand scaling by different AZs

Figure 1. Spot and On-Demand Instances scaling by different AZs

IPONWEB then needed to create a proper scaling policy for each Auto Scaling group. They decided to implement a self-monitoring mechanism in the application. This monitored how much compute resources (CPU and memory) are still available at any given time. Using this information, the application determines if it is able to take the next processing request. If it’s unable to, a log entry is created, and an “empty” response is returned. IPONWEB created a composite metric based on the drop rate and CPU utilization of each host.

IPONWEB used that custom metric to create scaling policies for every Auto Scaling group. Each Auto Scaling group has a specific threshold. This provides granular control, such as when, and under what conditions the particular Auto Scaling group scales-out or scales-in. It will set a lower threshold for the first Auto Scaling group and higher thresholds for the next Auto Scaling groups.

For example, the group with Spot Instances in the same AZ starts scaling at 75% utilization. The group with Spot Instances in other AZs starts scaling at 80% utilization.

  • Because each instance in every group receives the same number of incoming requests to process, the first group will be scaling earlier than the others.
  • If some Spot Instances are shut down in the first group, then the total utilization will increase, and the second group will begin scaling.
  • If there is a shortage of Spot Instances in other AZs as well, then the group with EC2 On-Demand Instances will scale out.
  • When the Spot Instances are available and can be used again, the first group will use them. Then total utilization will decrease, causing the other groups to scale in.

Typically, the majority of this workload is done by the first group. The other groups are running with a bare minimum of two instances in each. After going live with this solution, IPONWEB observed that they can usually run the major portion of their workloads with the cheapest option. In this case, it was using Spot Instances in the same AZ as their database. There were several times when a large number of the Spot Instances were shut down. When this happened, the scaling diverted to On-Demand Instances. In general, the solution worked well and there was no degradation on the running services.


This solution helps IPONWEB utilize compute resources more efficiently using AWS, to place and process more bids. After the migration, IPONWEB increased infrastructure on AWS by ~15% without incurring additional costs. This solution can also be used for other workloads where you must have a granular control on the scaling while minimizing the costs.

Victor Gorelik, VP of Cloud Infrastructure at IPONWEB said:

“If you use cloud without optimizations, it is expensive. There are two ways to optimize the costs – use either Savings Plans or Spot Instances. But it feels like using Savings Plans is a step back to the traditional model where you buy hardware. On the other hand, using Spots is making you more Cloud native and is the way to move forward. I believe that in the future the majority of our workloads will be run on Spot.”

Read more here:

15 years of silicon innovation with Amazon EC2

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/15-years-of-silicon-innovation-with-amazon-ec2/

The Graviton Hackathon is now available to developers globally to build and migrate apps to run on AWS Graviton2

This week we are celebrating 15 years of Amazon EC2 live on Twitch August 23rd – 24th with keynotes and sessions from AWS leaders and experts. You can watch all the sessions on-demand later this week to learn more about innovation at Amazon EC2.

As Jeff Barr noted in his blog, EC2 started as a single instance back in 2006 with the idea of providing developers on demand access to compute infrastructure and have them pay only for what they use. Today, EC2 has over 400 instance types with the broadest and deepest portfolio of instances in the cloud. As we strive to be the best cloud for every workload, customers consistently ask us for higher performance, and lower costs for their workloads. One way we deliver on this is by building silicon specifically optimized for the cloud. Our journey with custom silicon started in 2012 when we began looking into ways to improve the performance of EC2 instances by offloading virtualization functionality that is traditionally run on underlying servers to a dedicated offload card. Today, the AWS Nitro System is the foundation upon which our modern EC2 instances are built, delivering better performance and enhanced security. The Nitro System has also enabled us to innovate faster and deliver new instances and unique capabilities including Mac Instances, AWS Outposts, AWS Wavelength, VMware Cloud on AWS, and AWS Nitro Enclaves. We also have strong collaboration with partners to build custom silicon optimized for the AWS cloud with their latest generation processors to continue to deliver better performance and better price performance for our joint customers. Additionally, we’ve also designed AWS Inferentia and AWS Trainium chips to drive down the cost and boost performance for deep learning workloads.

One of the biggest innovations that help us deliver higher performance at lower cost for customer workloads are the AWS Graviton2 processors, which are the second generation of Arm-based processors custom-built by AWS. Instances powered by the latest generation AWS Graviton2 processors deliver up to 40% better performance at 20% lower per-instance cost over comparable x86-based instances in EC2. Additionally, Graviton2 is our most power efficient processor. In fact, Graviton2 delivers 2 to 3.5 times better performance per Watt of energy use versus any other processor in AWS.

Customers from startups to large enterprises including Intuit, Snap, Lyft, SmugMug, and Nextroll have realized significant price performance benefits for their production workloads on AWS Graviton2-based instances. Recently, EPIC Games added support for Graviton2 in Unreal Engine to help its developers build high performance games. What’s even more interesting is that AWS Graviton2-based instances supported 12 core retail services during Amazon Prime Day this year.

Most customers get started with AWS Graviton2-based instances by identifying and moving one or two workloads that are easy to migrate, and after realizing the price performance benefits, they move more workloads to Graviton2. In her blog, Liz Fong-Jones, Principal Developer Advocate at Honeycomb.io, details her journey of adopting Graviton2 and realizing significant price performance improvements. Using experience from working with thousands of customers like Liz who have adopted Graviton2, we built a program called the Graviton Challenge that provides a step-by-step plan to help you move your first workload to Graviton2-based instances.

Today, to further incentivize developers to get started with Graviton2, we are launching the Graviton Hackathon, where you can build a new app or migrate an app to run on Graviton2-based instances. Whether you are an existing EC2 user looking to optimize price performance for your workload, an Arm developer looking to leverage Arm-based instances in the cloud, or an open source developer adding support for Arm, you can participate in the Graviton Hackathon for a chance to win prizes for your project, including up to $10,000 in prize money. We look forward to the new applications which will be able to take advantage of the price performance benefits of Graviton. To learn more about Graviton2, watch the EC2 15th Birthday event sessions on-demand later this week, register to the attend the Graviton workshop at the upcoming AWS Summit Online, or register for the Graviton Challenge.

Cloud computing has made cutting edge, cost effective infrastructure available to everyday developers. Startups can use a credit card to spin up instances in minutes and scale up and scale down easily based on demand. Enterprises can leverage compute infrastructure and services to drive improved operational efficiency and customer experience. The last 15 years of EC2 innovation have been at the forefront of this shift, and we are looking forward to the next 15 years.

Confidential computing: an AWS perspective

Post Syndicated from David Brown original https://aws.amazon.com/blogs/security/confidential-computing-an-aws-perspective/

Customers around the globe—from governments and highly regulated industries to small businesses and start-ups—trust Amazon Web Services (AWS) with their most sensitive data and applications. At AWS, keeping our customers’ workloads secure and confidential, while helping them meet their privacy and data sovereignty requirements, is our highest priority. Our investments in security technologies and rigorous operational practices meet and exceed even our most demanding customers’ confidential computing and data privacy standards. Over the years, we’ve made many long-term investments in purpose-built technologies and systems to keep raising the bar of security and confidentiality for our customers.

In the past year, there has been an increasing interest in the phrase confidential computing in the industry and in our customer conversations. We’ve observed that this phrase is being applied to various technologies that solve very different problems, leading to confusion about what it actually means. With the mission of innovating on behalf of our customers, we want to offer you our perspective on confidential computing.

At AWS, we define confidential computing as the use of specialized hardware and associated firmware to protect customer code and data during processing from outside access. Confidential computing has two distinct security and privacy dimensions. The most important dimension—the one we hear most often from customers as their key concern—is the protection of customer code and data from the operator of the underlying cloud infrastructure. The second dimension is the ability for customers to divide their own workloads into more-trusted and less-trusted components, or to design a system that allows parties that do not, or cannot, fully trust one another to build systems that work in close cooperation while maintaining confidentiality of each party’s code and data.

In this post, I explain how the AWS Nitro System intrinsically meets the requirements of the first dimension by providing those protections to customers who use Nitro-based Amazon Elastic Compute Cloud (Amazon EC2) instances, without requiring any code or workload changes from the customer side. I also explain how AWS Nitro Enclaves provides a way for customers to use familiar toolsets and programming models to meet the requirements of the second dimension. Before we get to the details, let’s take a closer look at the Nitro System.

What is the Nitro System?

The Nitro System, the underlying platform for all modern Amazon EC2 instances, is a great example of how we have invented and innovated on behalf of our customers to provide additional confidentiality and privacy for their applications. For ten years, we have been reinventing the EC2 virtualization stack by moving more and more virtualization functions to dedicated hardware and firmware, and the Nitro System is a result of this continuous and sustained innovation. The Nitro System is comprised of three main parts: the Nitro Cards, the Nitro Security Chip, and the Nitro Hypervisor. The Nitro Cards are dedicated hardware components with compute capabilities that perform I/O functions, such as the Nitro Card for Amazon Virtual Private Cloud (Amazon VPC), the Nitro Card for Amazon Elastic Block Store (Amazon EBS), and the Nitro Card for Amazon EC2 instance storage.

Nitro Cards—which are designed, built, and tested by Annapurna Labs, our in-house silicon development subsidiary—enable us to move key virtualization functionality off the EC2 servers—the underlying host infrastructure—that’s running EC2 instances. We engineered the Nitro System with a hardware-based root of trust using the Nitro Security Chip, allowing us to cryptographically measure and validate the system. This provides a significantly higher level of trust than can be achieved with traditional hardware or virtualization systems. The Nitro Hypervisor is a lightweight hypervisor that manages memory and CPU allocation, and delivers performances that is indistinguishable from bare metal (we recently compared it against our bare metal instances in the Bare metal performance with the AWS Nitro System post).

The Nitro approach to confidential computing

There are three main types of protection provided by the Nitro System. The first two protections underpin the key dimension of confidential computing—customer protection from the cloud operator and from cloud system software—and the third reinforces the second dimension—division of customer workloads into more-trusted and less-trusted elements.

  1. Protection from cloud operators: At AWS, we design our systems to ensure workload confidentiality between customers, and also between customers and AWS. We’ve designed the Nitro System to have no operator access. With the Nitro System, there’s no mechanism for any system or person to log in to EC2 servers (the underlying host infrastructure), read the memory of EC2 instances, or access any data stored on instance storage and encrypted EBS volumes. If any AWS operator, including those with the highest privileges, needs to do maintenance work on the EC2 server, they can do so only by using a strictly limited set of authenticated, authorized, and audited administrative APIs. None of these APIs have the ability to access customer data on the EC2 server. Because these technological restrictions are built into the Nitro System itself, no AWS operator can bypass these controls and protections. For additional defense-in-depth against physical attacks at the memory interface level, we offer memory encryption on various EC2 instances. Today, memory encryption is enabled by default on all Graviton2-based instances (T4g, M6g, C6g, C6gn, R6g, X2g), and Intel-based M6i instances, which have Total Memory Encryption (TME). Upcoming EC2 platforms based on the AMD Milan processor will feature Secure Memory Encryption (SME).
  2. Protection from AWS system software: The unique design of the Nitro System utilizes low-level, hardware-based memory isolation to eliminate direct access to customer memory, as well as to eliminate the need for a hypervisor on bare metal instances.

    • For virtualized EC2 instances (as shown in Figure 1), the Nitro Hypervisor coordinates with the underlying hardware-virtualization systems to create virtual machines that are isolated from each other as well as from the hypervisor itself. Network, storage, GPU, and accelerator access use SR-IOV, a technology that allows instances to interact directly with hardware devices using a pass-through connection securely created by the hypervisor. Other EC2 features such as instance snapshots and hibernation are all facilitated by dedicated agents that employ end-to-end memory encryption that is inaccessible to AWS operators.

      Figure 1: Virtualized EC2 instances

      Figure 1: Virtualized EC2 instances

    • For bare metal EC2 instances (as shown in Figure 2), there’s no hypervisor running on the EC2 server, and customers get dedicated and exclusive access to all of the underlying main system board. Bare metal instances are designed for customers who want access to the physical resources for applications that take advantage of low-level hardware features—such as performance counters and Intel® VT—that aren’t always available or fully supported in virtualized environments, and also for applications intended to run directly on the hardware or licensed and supported for use in non-virtualized environments. Bare metal instances feature the same storage, networking, and other EC2 capabilities as virtualized instances because the Nitro System implements all of the system functions normally provided by the virtualization layer in an isolated and independent manner using dedicated hardware and purpose-built system firmware. We used the very same technology to create Amazon EC2 Mac instances. Because the Nitro System operates over an independent bus, we can attach Nitro cards directly to Apple’s Mac mini hardware without any other physical modifications.

      Figure 2: Bare metal EC2 instance

      Figure 2: Bare metal EC2 instance

  3. Protection of sensitive computing and data elements from customers’ own operators and software: Nitro Enclaves provides the second dimension of confidential computing. Nitro Enclaves is a hardened and highly-isolated compute environment that’s launched from, and attached to, a customer’s EC2 instance. By default, there’s no ability for any user (even a root or admin user) or software running on the customer’s EC2 instance to have interactive access to the enclave. Nitro Enclaves has cryptographic attestation capabilities that allow customers to verify that all of the software deployed to their enclave has been validated and hasn’t been tampered with. A Nitro enclave has the same level of protection from the cloud operator as a normal Nitro-based EC2 instance, but adds the capability for customers to divide their own systems into components with different levels of trust. A Nitro enclave provides a means of protecting particularly sensitive elements of customer code and data not just from AWS operators but also from the customer’s own operators and other software.As the main goal of Nitro Enclaves is to protect against the customers’ own users and software on their EC2 instances, a Nitro enclave considers the EC2 instance to reside outside of its trust boundary. Therefore, a Nitro enclave shares no memory or CPU cores with the customer instance. To significantly reduce the attack surface area, a Nitro enclave also has no IP networking and offers no persistent storage. We designed Nitro Enclaves to be a platform that is highly accessible to all developers without the need to have advanced cryptography knowledge or CPU micro-architectural expertise, so that these developers can quickly and easily build applications to process sensitive data. At the same time, we focused on creating a familiar developer experience so that developing the trusted code that runs in a Nitro enclave is as easy as writing code for any Linux environment.


To summarize, the Nitro System’s unique approach to virtualization and isolation enables our customers to secure and isolate sensitive data processing from AWS operators and software at all times. It provides the most important dimension of confidential computing as an intrinsic, on-by-default, set of protections from the system software and cloud operators, and optionally via Nitro Enclaves even from customers’ own software and operators.

What’s next?

As mentioned earlier, the Nitro System represents our almost decade-long commitment to raising the bar for security and confidentiality for compute workloads in the cloud. It has allowed us to do more for our customers than is possible with off-the-shelf technology and hardware. But we’re not stopping here, and will continue to add more confidential computing capabilities in the coming months.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


David Brown

David is the Vice President of Amazon EC2, a web service that provides secure, resizable compute capacity in the cloud. He joined AWS in 2007, as a software developer based in Cape Town, working on the early development of Amazon EC2. Over the last 12 years, he has had several roles within Amazon EC2, working on shaping the service into what it is today. Prior to joining Amazon, David worked as a software developer within a financial industry startup.

Happy 15th Birthday Amazon EC2

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/happy-15th-birthday-amazon-ec2/

Fifteen years ago today I wrote the blog post that launched the Amazon EC2 Beta. As I recall, the launch was imminent for quite some time as we worked to finalize the feature set, the pricing model, and innumerable other details. The launch date was finally chosen and it happened to fall in the middle of a long-planned family vacation to Cabo San Lucas, Mexico. Undaunted, I brought my laptop along on vacation, and had to cover it with a towel so that I could see the screen as I wrote. I am not 100% sure, but I believe that I actually clicked Publish while sitting on a lounge chair near the pool! I spent the remainder of the week offline, totally unaware of just how much excitement had been created by this launch.

Preparing for the Launch
When Andy Jassy formed the AWS group and began writing his narrative, he showed me a document that proposed the construction of something called the Amazon Execution Service and asked me if developers would find it useful, and if they would pay to use it. I read the document with great interest and responded with an enthusiastic “yes” to both of his questions. Earlier in my career I had built and run several projects hosted at various colo sites, and was all too familiar with the inflexible long-term commitments and the difficulties of scaling on demand; the proposed service would address both of these fundamental issues and make it easier for developers like me to address widely fluctuating changes in demand.

The EC2 team had to make a lot of decisions in order to build a service to meet the needs of developers, entrepreneurs, and larger organizations. While I was not part of the decision-making process, it seems to me that they had to make decisions in at least three principal areas: features, pricing, and level of detail.

Features – Let’s start by reviewing the features that EC2 launched with. There was one instance type, one region (US East (N. Virginia)), and we had not yet exposed the concept of Availability Zones. There was a small selection of prebuilt Linux kernels to choose from, and IP addresses were allocated as instances were launched. All storage was transient and had the same lifetime as the instance. There was no block storage and the root disk image (AMI) was stored in an S3 bundle. It would be easy to make the case that any or all of these features were must-haves for the launch, but none of them were, and our customers started to put EC2 to use right away. Over the years I have seen that this strategy of creating services that are minimal-yet-useful allows us to launch quickly and to iterate (and add new features) rapidly in response to customer feedback.

Pricing – While it was always obvious that we would charge for the use of EC2, we had to decide on the units that we would charge for, and ultimately settled on instance hours. This was a huge step forward when compared to the old model of buying a server outright and depreciating it over a 3 or 5 year term, or paying monthly as part of an annual commitment. Even so, our customers had use cases that could benefit from more fine-grained billing, and we launched per-second billing for EC2 and EBS back in 2017. Behind the scenes, the AWS team also had to build the infrastructure to measure, track, tabulate, and bill our customers for their usage.

Level of Detail – This might not be as obvious as the first two, but it is something that I regularly think about when I write my posts. At launch time I shared the fact that the EC2 instance (which we later called the m1.small) provided compute power equivalent to a 1.7 GHz Intel Xeon processor, but I did not share the actual model number or other details. I did share the fact that we built EC2 on Xen. Over the years, customers told us that they wanted to take advantage of specialized processor features and we began to share that information.

Some Memorable EC2 Launches
Looking back on the last 15 years, I think we got a lot of things right, and we also left a lot of room for the service to grow. While I don’t have one favorite launch, here are some of the more memorable ones:

EC2 Launch (2006) – This was the launch that started it all. One of our more notable early scaling successes took place in early 2008, when Animoto scaled their usage from less than 100 instances all the way up to 3400 in the course of a week (read Animoto – Scaling Through Viral Growth for the full story).

Amazon Elastic Block Store (2008) – This launch allowed our customers to make use of persistent block storage with EC2. If you take a look at the post, you can see some historic screen shots of the once-popular ElasticFox extension for Firefox.

Elastic Load Balancing / Auto Scaling / CloudWatch (2009) – This launch made it easier for our customers to build applications that were scalable and highly available. To quote myself, “Amazon CloudWatch monitors your Amazon EC2 capacity, Auto Scaling dynamically scales it based on demand, and Elastic Load Balancing distributes load across multiple instances in one or more Availability Zones.”

Virtual Private Cloud / VPC (2009) – This launch gave our customers the ability to create logically isolated sets of EC2 instances and to connect them to existing networks via an IPsec VPN connection. It gave our customers additional control over network addressing and routing, and opened the door to many additional networking features over the coming years.

Nitro System (2017) – This launch represented the culmination of many years of work to reimagine and rebuild our virtualization infrastructure in pursuit of higher performance and enhanced security (read more).

Graviton (2018) -This launch marked the launch of Amazon-built custom CPUs that were designed for cost-sensitive scale-out workloads. Since that launch we have continued this evolutionary line, launching general purpose, compute-optimized, memory-optimized, and burstable instances powered by Graviton2 processors.

Instance Types (2006 – Present) -We launched with one instance type and now have over four hundred, each one designed to empower our customers to address a particular use case.

Celebrate with Us
To celebrate the incredible innovation we’ve seen from our customers over the last 15 years, we’re hosting a 2-day live event on August 23rd and 24th covering a range of topics. We kick off the event with today at 9am PDT with Vice President of Amazon EC2 Dave Brown’s keynote “Lessons from 15 Years of Innovation.

Event Agenda

August 23 August 24
Lessons from 15 Years of Innovation AWS Everywhere: A Fireside Chat on Hybrid Cloud
15 Years of AWS Silicon Innovation Deep Dive on Real-World AWS Hybrid Examples
Choose the Right Instance for the Right Workload AWS Outposts: Extending AWS On-Premises for a Truly Consistent Hybrid Experience
Optimize Compute for Cost and Capacity Connect Your Network to AWS with Hybrid Connectivity Solutions
The Evolution of Amazon Virtual Private Cloud Accelerating ADAS and Autonomous Vehicle Development on AWS
Accelerating AI/ML Innovation with AWS ML Infrastructure Services Accelerate AI/ML Adoption with AWS ML Silicon
Using Machine Learning and HPC to Accelerate Product Design Digital Twins: Connecting the Physical to the Digital World

Register here and join us starting at 9 AM PT to learn more about EC2 and to celebrate along with us!


How to authenticate private container registries using AWS Batch

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/how-to-authenticate-private-container-registries-using-aws-batch/

This post was contributed by Clayton Thomas, Solutions Architect, AWS WW Public Sector SLG Govtech.

Many AWS Batch users choose to store and consume their AWS Batch job container images on AWS using Amazon Elastic Container Registries (ECR). AWS Batch and Amazon Elastic Container Service (ECS) natively support pulling from Amazon ECR without any extra steps required. For those users that choose to store their container images on other container registries or Docker Hub, often times they are not publicly exposed and require authentication to pull these images. Third-party repositories may throttle the number of requests, which impedes the ability to run workloads and self-managed repositories require heavy tuning to offer the scale that Amazon ECS provides. This makes Amazon ECS the preferred solution to run workloads on AWS Batch.

While Amazon ECS allows you to configure repositoryCredentials in task definitions containing private registry credentials, AWS Batch does not expose this option in AWS Batch job definitions. AWS Batch does not provide the ability to use private registries by default but you can allow that by configuring the Amazon ECS agent in a few steps.

This post shows how to configure an AWS Batch EC2 compute environment and the Amazon ECS agent to pull your private container images from private container registries. This gives you the flexibility to use your own private and public container registries with AWS Batch.


The solution uses AWS Secrets Manager to securely store your private container registry credentials, which are retrieved on startup of the AWS Batch compute environment. This ensures that your credentials are securely managed and accessed using IAM roles and are not persisted or stored in AWS Batch job definitions or EC2 user data. The Amazon ECS agent is then configured upon startup to pull these credentials from AWS Secrets Manager. Note that this solution only supports Amazon EC2 based AWS Batch compute environments, thus AWS Fargate cannot use this solution.

High-level diagram showing event flow

Figure 1: High-level diagram showing event flow

  1. AWS Batch uses an Amazon EC2 Compute Environment powered by Amazon ECS. This compute environment uses a custom EC2 Launch Template to configure the Amazon ECS agent to include credentials for pulling images from private registries.
  2. An EC2 User Data script is run upon EC2 instance startup that retrieves registry credentials from AWS Secrets Manager. The EC2 instance authenticates with AWS Secrets Manager using its configured IAM instance profile, which grants temporary IAM credentials.
  3. AWS Batch jobs can be submitted using private images that require authentication with configured credentials.


For this walkthrough, you should have the following prerequisites:

  1. An AWS account
  2. An Amazon Virtual Private Cloud with private and public subnets. If you do not have a VPC, this tutorial can be followed. The AWS Batch compute environment must have connectivity to the container registry.
  3. A container registry containing a private image. This example uses Docker Hub and assumes you have created a private repository
  4. Registry credentials and/or an access token to authenticate with the container registry or Docker Hub.
  5. A VPC Security Group allowing the AWS Batch compute environment egress connectivity to the container registry.

A CloudFormation template is provided to simplify setting up this example. The CloudFormation template and provided EC2 user data script can be viewed here on GitHub.

The CloudFormation template will create the following resources:

  1. Necessary IAM roles for AWS Batch
  2. AWS Secrets Manager secret containing container registry credentials
  3. AWS Batch managed compute environment and job queue
  4. EC2 Launch Configuration with user data script

Click the Launch Stack button to get started:

Launch Stack

Launch the CloudFormation stack

After clicking the Launch stack button above, click Next to be presented with the following screen:

Figure 2: CloudFormation stack parameters

Figure 2: CloudFormation stack parameters

Fill in the required parameters as follows:

  1. Stack Name: Give your stack a unique name.
  2. Password: Your container registry password or Docker Hub access token. Note that both user name and password are masked and will not appear in any CF logs or output. Additionally, they are securely stored in an AWS Secrets Manager secret created by CloudFormation.
  3. RegistryUrl: If not using Docker Hub, specify the URL of the private container registry.
  4. User name: Your container registry user name.
  5. SecurityGroupIDs: Select your previously created security group to assign to the example Batch compute environment.
  6. SubnetIDs: To assign to the example Batch compute environment, select one or more VPC subnet IDs.

After entering these parameters, you can click through next twice and create the stack, which will take a few minutes to complete. Note that you must acknowledge that the template creates IAM resources on the review page before submitting.

Finally, you will be presented with a list of created AWS resources once the stack deployment finishes as shown in Figure 3 if you would like to dig deeper.

Figure 3: CloudFormation created resources

Figure 3: CloudFormation created resources

User data script contained within launch template

AWS Batch allows you to customize the compute environment in a variety of ways such as specifying an EC2 key pair, custom AMI, or an EC2 user data script. This is done by specifying an EC2 launch template before creating the Batch compute environment. For more information on Batch launch template support, see here.

Let’s take a closer look at how the Amazon ECS agent is configured upon compute environment startup to use your registry credentials.

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

Content-Type: text/cloud-config; charset="us-ascii"

- jq
- aws-cli

- /usr/bin/aws configure set region $(curl
- export SECRET_STRING=$(/usr/bin/aws secretsmanager get-secret-value --secret-id your_secrets_manager_secret_id | jq -r '.SecretString')
- export USERNAME=$(echo $SECRET_STRING | jq -r '.username')
- export PASSWORD=$(echo $SECRET_STRING | jq -r '.password')
- export REGISTRY_URL=$(echo $SECRET_STRING | jq -r '.registry_url')
- echo $PASSWORD | docker login --username $USERNAME --password-stdin $REGISTRY_URL
- export AUTH=$(cat ~/.docker/config.json | jq -c .auths)
- echo 'ECS_ENGINE_AUTH_TYPE=dockercfg' >> /etc/ecs/ecs.config
- echo "ECS_ENGINE_AUTH_DATA=$AUTH" >> /etc/ecs/ecs.config


This example script uses and installs a few tools including the AWS CLI and the open-source tool jq to retrieve and parse the previously created Secrets Manager secret. These packages are installed using the cloud-config user data type, which is part of the cloud-init packages functionality. If using the provided CloudFormation template, this script will be dynamically rendered to reference the created secret, but note that you must specify the correct Secrets Manager secret id if not using the template.

After performing a Docker login, the generated Auth JSON object is captured and passed to the Amazon ECS agent configuration to be used on AWS Batch jobs that require private images. For an explanation of Amazon ECS agent configuration options including available Amazon ECS engine Auth types, see here. This example script can be extended or customized to fit your needs but must adhere to requirements for Batch launch template user data scripts, including being in MIME multi-part archive format.

It’s worth noting that the AWS CLI automatically grabs temporary IAM credentials from the associated IAM instance profile the CloudFormation stack created in order to retrieve the Secret Manager secret values. This example assumes you created the AWS Secrets Manager secret with the default AWS managed KMS key for Secrets Manager. However, if you choose to encrypt your secret with a customer managed KMS key, make sure to specify kms:Decrypt IAM permissions for the Batch compute environment IAM role.

Submitting the AWS Batch job

Now let’s try an example Batch job that uses a private container image by creating a Batch job definition and submitting a Batch job:

  1. Open the AWS Batch console
  2. Navigate to the Job Definition page
  3. Click create
  4. Provide a unique Name for the job definition
  5. Select the EC2 platform
  6. Specify your private container image located in the Image field
  7. Click create

Figure 4: Batch job definition

Now you can submit an AWS Batch job that uses this job definition:

  1. Click on the Jobs page
  2. Click Submit New Job
  3. Provide a Name for the job
  4. Select the previously created job definition
  5. Select the Batch Job Queue created by the CloudFormation stack
  6. Click Submit
Submitting a new Batch job

Figure 5: Submitting a new Batch job

After submitting the AWS Batch job, it will take a few minutes for the AWS Batch Compute Environment to create resources for scheduling the job. Once that is done, you should see a SUCCEEDED status by viewing the job and filtering by AWS Batch job queue shown in Figure 6.

Figure 6: AWS Batch job succeeded

Figure 6: AWS Batch job succeeded

Cleaning up

To clean up the example resources, click delete for the created CloudFormation stack in the CloudFormation Console.


In this blog, you deployed a customized AWS Batch managed compute environment that was configured to allow pulling private container images in a secure manner. As I’ve shown, AWS Batch gives you the flexibility to use both private and public container registries. I encourage you to continue to explore the many options available natively on AWS for hosting and pulling container images. Amazon ECR or the recently launched Amazon ECR public repositories (for a deeper dive, see this blog announcement) both provide a seamless experience for container workloads running on AWS.